library(gbm)
## Loaded gbm 2.1.8.1
library(MASS)
df <- Boston
set.seed(1234)
sp <- sample(1:nrow(df), 354)
df.train <- df[sp,]
df.test <- df[-sp,]
str(df)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(df.train)
## [1] 354 14
dim(df.test)
## [1] 152 14
gbm.boston <- gbm(medv~., data = df.train, n.trees = 5000)
## Distribution not specified, assuming gaussian ...
The code you have provided is building a Gradient Boosting Machine
(GBM) model using the gbm()
function from the
gbm
package in R.
The GBM algorithm is a popular machine learning algorithm for both regression and classification problems. It is an ensemble method that combines multiple weak predictive models to form a more powerful model. The basic idea of GBM is to iteratively add decision trees to the model, each of which attempts to correct the errors of the previous trees. The final model is the sum of all the trees.
In this case, the GBM model is being trained on the medv
variable as the response variable and all other variables in the
df.train
dataset as predictors. The n.trees
parameter is set to 5000, which means that the model will fit 5000
decision trees.
names(gbm.boston)
## [1] "initF" "fit" "train.error"
## [4] "valid.error" "oobag.improve" "trees"
## [7] "c.splits" "bag.fraction" "distribution"
## [10] "interaction.depth" "n.minobsinnode" "num.classes"
## [13] "n.trees" "nTrain" "train.fraction"
## [16] "response.name" "shrinkage" "var.levels"
## [19] "var.monotone" "var.names" "var.type"
## [22] "verbose" "data" "Terms"
## [25] "cv.folds" "call" "m"
The gbm()
function in R returns an object of class
gbm
that contains various components of the fitted GBM
model. You can use the names()
function to see the names of
the components of the gbm.boston
object. Here are some
common components that you might see:
train.data
: the training dataset used to fit the
modeldistribution
: the distribution used for the response
variableshrinkage
: the shrinkage parameter used to control the
contribution of each tree to the modeln.trees
: the number of trees in the modelinteraction.depth
: the maximum depth of each treen.minobsinnode
: the minimum number of observations
allowed in each terminal nodebag.fraction
: the fraction of observations used to
train each tree (a value between 0 and 1)cv.folds
: the number of cross-validation folds used to
tune the modelcv.fraction
: the fraction of observations used for each
cross-validation fold (a value between 0 and 1)data
: the dataset used for prediction (if
specified)gbm.predictions <- predict(gbm.boston,newdata = df.test, n.trees = 5000)
gbm.predictions
## [1] 23.935971 32.608486 32.893069 21.539574 17.115374 13.413440 15.918031
## [8] 16.860827 11.732292 15.040487 15.271480 16.904246 20.059114 15.610719
## [15] 17.198863 19.994315 25.079467 23.943777 20.130015 13.938996 14.999024
## [22] 19.947518 13.512369 32.075211 20.793629 20.242334 16.445325 25.071896
## [29] 24.251273 21.946026 25.716360 20.959580 19.353139 22.083354 15.401075
## [36] 17.337369 21.691178 14.753193 15.459220 16.707022 20.024781 14.118165
## [43] 15.252928 17.208194 15.631414 10.162659 12.007802 13.577960 43.181358
## [50] 47.244481 27.056571 17.625174 22.801004 25.327921 35.377638 27.461126
## [57] 32.296838 32.625246 39.700423 35.944790 29.854919 35.339376 24.721686
## [64] 22.557887 19.702711 16.974687 40.784192 41.911594 33.030486 45.252078
## [71] 22.846642 25.386085 28.702969 12.798120 31.355405 32.424056 21.249219
## [78] 19.402364 39.303846 34.724007 40.276481 43.858744 24.361039 35.661047
## [85] 33.403324 32.553205 45.897830 19.222601 24.580476 26.924984 31.673783
## [92] 28.496781 21.416587 26.381558 25.070364 15.696453 21.068919 22.756717
## [99] 22.363258 17.499260 19.602081 23.744260 23.323494 23.735277 21.404417
## [106] 21.559026 32.497149 29.784905 20.335786 21.098606 19.735515 17.339285
## [113] 27.895922 40.929592 46.356357 20.895468 18.058086 14.617031 7.196315
## [120] 13.426398 15.543709 1.130734 18.979508 21.399623 12.214506 14.262533
## [127] 15.326965 13.754721 23.067952 15.356723 10.364479 14.089803 15.452866
## [134] 13.192462 11.680304 27.435300 11.551694 12.776991 25.578895 21.025716
## [141] 17.546829 22.026402 22.049506 26.765847 17.420399 22.572812 21.706963
## [148] 22.619581 19.851625 21.442159 20.972083 18.565331
library(Metrics)
rmse(actual = df.test$medv, predicted = gbm.predictions )
## [1] 3.798868
mae(actual = df.test$medv, predicted = gbm.predictions )
## [1] 2.925564
mape(actual = df.test$medv, predicted = gbm.predictions )
## [1] 0.1458156
?gbm
gbm(
formula = formula(data),
distribution = "bernoulli",
data = list(),
weights,
var.monotone = NULL,
n.trees = 100,
interaction.depth = 1,
n.minobsinnode = 10,
shrinkage = 0.1,
bag.fraction = 0.5,
train.fraction = 1,
cv.folds = 0,
keep.data = TRUE,
verbose = FALSE,
class.stratify.cv = NULL,
n.cores = NULL
)