GBM

Gradient Boosting Machine (gbm)

library(gbm)

## Loaded gbm 2.1.8.1

library(MASS)
df <- Boston
set.seed(1234)
sp <- sample(1:nrow(df), 354)
df.train <- df[sp,]
df.test <- df[-sp,]
str(df)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

dim(df.train)

## [1] 354  14

dim(df.test)

## [1] 152  14

gbm.boston <- gbm(medv~., data = df.train, n.trees = 5000)

## Distribution not specified, assuming gaussian ...

The code you have provided is building a Gradient Boosting Machine (GBM) model using the gbm() function from the gbm package in R.

The GBM algorithm is a popular machine learning algorithm for both regression and classification problems. It is an ensemble method that combines multiple weak predictive models to form a more powerful model. The basic idea of GBM is to iteratively add decision trees to the model, each of which attempts to correct the errors of the previous trees. The final model is the sum of all the trees.

In this case, the GBM model is being trained on the medv variable as the response variable and all other variables in the df.train dataset as predictors. The n.trees parameter is set to 5000, which means that the model will fit 5000 decision trees.

names(gbm.boston)

##  [1] "initF"             "fit"               "train.error"      
##  [4] "valid.error"       "oobag.improve"     "trees"            
##  [7] "c.splits"          "bag.fraction"      "distribution"     
## [10] "interaction.depth" "n.minobsinnode"    "num.classes"      
## [13] "n.trees"           "nTrain"            "train.fraction"   
## [16] "response.name"     "shrinkage"         "var.levels"       
## [19] "var.monotone"      "var.names"         "var.type"         
## [22] "verbose"           "data"              "Terms"            
## [25] "cv.folds"          "call"              "m"

The gbm() function in R returns an object of class gbm that contains various components of the fitted GBM model. You can use the names() function to see the names of the components of the gbm.boston object. Here are some common components that you might see:

train.data: the training dataset used to fit the model
distribution: the distribution used for the response variable
shrinkage: the shrinkage parameter used to control the contribution of each tree to the model
n.trees: the number of trees in the model
interaction.depth: the maximum depth of each tree
n.minobsinnode: the minimum number of observations allowed in each terminal node
bag.fraction: the fraction of observations used to train each tree (a value between 0 and 1)
cv.folds: the number of cross-validation folds used to tune the model
cv.fraction: the fraction of observations used for each cross-validation fold (a value between 0 and 1)
data: the dataset used for prediction (if specified)

gbm.predictions <- predict(gbm.boston,newdata = df.test, n.trees = 5000)

gbm.predictions

##   [1] 23.935971 32.608486 32.893069 21.539574 17.115374 13.413440 15.918031
##   [8] 16.860827 11.732292 15.040487 15.271480 16.904246 20.059114 15.610719
##  [15] 17.198863 19.994315 25.079467 23.943777 20.130015 13.938996 14.999024
##  [22] 19.947518 13.512369 32.075211 20.793629 20.242334 16.445325 25.071896
##  [29] 24.251273 21.946026 25.716360 20.959580 19.353139 22.083354 15.401075
##  [36] 17.337369 21.691178 14.753193 15.459220 16.707022 20.024781 14.118165
##  [43] 15.252928 17.208194 15.631414 10.162659 12.007802 13.577960 43.181358
##  [50] 47.244481 27.056571 17.625174 22.801004 25.327921 35.377638 27.461126
##  [57] 32.296838 32.625246 39.700423 35.944790 29.854919 35.339376 24.721686
##  [64] 22.557887 19.702711 16.974687 40.784192 41.911594 33.030486 45.252078
##  [71] 22.846642 25.386085 28.702969 12.798120 31.355405 32.424056 21.249219
##  [78] 19.402364 39.303846 34.724007 40.276481 43.858744 24.361039 35.661047
##  [85] 33.403324 32.553205 45.897830 19.222601 24.580476 26.924984 31.673783
##  [92] 28.496781 21.416587 26.381558 25.070364 15.696453 21.068919 22.756717
##  [99] 22.363258 17.499260 19.602081 23.744260 23.323494 23.735277 21.404417
## [106] 21.559026 32.497149 29.784905 20.335786 21.098606 19.735515 17.339285
## [113] 27.895922 40.929592 46.356357 20.895468 18.058086 14.617031  7.196315
## [120] 13.426398 15.543709  1.130734 18.979508 21.399623 12.214506 14.262533
## [127] 15.326965 13.754721 23.067952 15.356723 10.364479 14.089803 15.452866
## [134] 13.192462 11.680304 27.435300 11.551694 12.776991 25.578895 21.025716
## [141] 17.546829 22.026402 22.049506 26.765847 17.420399 22.572812 21.706963
## [148] 22.619581 19.851625 21.442159 20.972083 18.565331

library(Metrics)
rmse(actual = df.test$medv, predicted = gbm.predictions )

## [1] 3.798868

mae(actual = df.test$medv, predicted = gbm.predictions )

## [1] 2.925564

mape(actual = df.test$medv, predicted = gbm.predictions )

## [1] 0.1458156

?gbm

gbm(
  formula = formula(data),
  distribution = "bernoulli",
  data = list(),
  weights,
  var.monotone = NULL,
  n.trees = 100,
  interaction.depth = 1,
  n.minobsinnode = 10,
  shrinkage = 0.1,
  bag.fraction = 0.5,
  train.fraction = 1,
  cv.folds = 0,
  keep.data = TRUE,
  verbose = FALSE,
  class.stratify.cv = NULL,
  n.cores = NULL
)

GBM

Azat

2023-05-01

Gradient Boosting Machine (gbm)