Random Forest with Boston

load the libraries

library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
library(MASS)

explore the data

df <- Boston
head(df)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
str(df)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

split the data

set.seed(1)
split.index <- sample(1:nrow(df), nrow(df)/2)
df.train <- df[split.index,]
df.test <- df[-split.index,]

This code snippet creates a train-test split for the Boston dataset using a 50/50 split. Here’s a step-by-step explanation of the code:

  1. The first line assigns the Boston dataset to a data frame named “df”. The Boston dataset is a built-in dataset in R that contains information about housing prices and other variables in Boston suburbs.

  2. The second line sets the seed for the random number generator to 1. This is useful if we want to reproduce the same train-test split each time we run the code.

  3. The third line generates a random sample of row indices from the Boston dataset using the “sample” function. Specifically, the “sample” function takes two arguments: the first argument specifies the population of indices (in this case, the row indices of the “df” data frame), and the second argument specifies the sample size (in this case, half the number of rows in the “df” data frame). The result of the “sample” function is a vector of randomly selected row indices.

  4. The fourth line subsets the “df” data frame using the randomly selected row indices to create a training dataset. Specifically, the “[split.index, ]” notation specifies that we want to select all rows in “df” whose indices are in the “split.index” vector.

  5. The fifth line creates a test dataset by removing the rows in the training dataset from the original “df” data frame. Specifically, the “[-split.index]” notation specifies that we want to select all rows in “df” whose indices are not in the “split.index” vector.

Build the model

rf.boston = randomForest(medv~., data = df.train, importance=TRUE)
rf.boston
## 
## Call:
##  randomForest(formula = medv ~ ., data = df.train, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 9.94747
##                     % Var explained: 87.06

This code snippet fits a random forest regression model to the “df.train” dataset using the “randomForest” function in R. Here’s a step-by-step explanation of the code:

  1. The first line fits a random forest regression model to the “df.train” dataset using the “randomForest” function. The formula “medv~.” specifies that we want to predict the “medv” variable using all the other variables in the dataset. The argument “data = df.train” specifies that we want to use the “df.train” data frame for training the model. The argument “importance = TRUE” specifies that we want to compute variable importance measures for the model.

  2. The second line prints out the details of the fitted random forest model object “rf.boston”. This includes information about the number of trees in the forest, the number of variables used in each split, and the mean square error of the model.

Random forests are an ensemble learning method that combines multiple decision trees to improve the accuracy and robustness of the predictions. The “randomForest” function in R fits a random forest model to the data by building multiple decision trees on different subsets of the data and using the average of the predictions from all trees as the final prediction.

The variable importance measures computed by the random forest model can be used to identify the most important predictors in the dataset, which can help in feature selection and understanding the underlying patterns in the data.

Making Predictions

prediction <- predict(rf.boston, newdata = df.test)
plot(prediction, df.test$medv)
abline(0,1)

This code snippet makes predictions on the “df.test” dataset using the previously fitted random forest model “rf.boston” and creates a scatter plot of the predicted values versus the actual values of the “medv” variable. Here’s a step-by-step explanation of the code:

  1. The first line makes predictions on the “df.test” dataset using the previously fitted random forest model “rf.boston”. Specifically, the “predict” function takes two arguments: the first argument specifies the model object to use for prediction (in this case, “rf.boston”), and the second argument specifies the new data to make predictions on (in this case, “df.test”).

  2. The second line creates a scatter plot of the predicted values versus the actual values of the “medv” variable. The “plot” function takes two arguments: the first argument specifies the x-axis values (in this case, the predicted values), and the second argument specifies the y-axis values (in this case, the actual values of “medv” in the “df.test” dataset).

  3. The third line adds a diagonal line to the plot with intercept 0 and slope 1 using the “abline” function. This line represents perfect agreement between the predicted and actual values, and is useful for visualizing how well the model is performing.

By comparing the predicted values to the actual values, we can get a sense of how well the random forest model is able to predict the “medv” variable on unseen data. If the points in the scatter plot fall close to the diagonal line, it suggests that the model is accurately predicting the target variable. If the points are widely scattered, it suggests that the model may not be accurately predicting the target variable.

the line is y=x line

library(Metrics)
rmse(actual = df.test$medv, predicted = prediction)
## [1] 4.320301
mae(actual = df.test$medv, predicted = prediction)
## [1] 2.538842
mape(actual = df.test$medv, predicted = prediction)
## [1] 0.1158074

we can also calculate them as:

# rmse
er = df.test$medv - prediction
sq = er^2
sqrt(mean(sq))
## [1] 4.320301
#mae 
mean(abs(df.test$medv-prediction))
## [1] 2.538842
#mape
ape <- abs((df.test$medv-prediction) / df.test$medv)
mape <- mean(ape)
mape
## [1] 0.1158074

Comparision with Rpart (with default parameters)

library(rpart)
library(rpart.plot)
rp.boston = rpart(medv~., data = df.train)
prediction.rp = predict(rp.boston, newdata=df.test)
rpart.plot(rp.boston)

rmse(actual = df.test$medv, predicted = prediction.rp)
## [1] 5.940276
mae(actual = df.test$medv, predicted = prediction.rp)
## [1] 3.692607
mape(actual = df.test$medv, predicted = prediction.rp)
## [1] 0.1786559

Parameters of the Random Forest and Fine-tuning

set.seed(580)
rf.boston2 <- randomForest(medv~., data = df.train, mtry=8, ntree=200,importance=TRUE)
prediction2 <- predict(rf.boston2, newdata = df.test)
plot(prediction, df.test$medv)
abline(0,1)

rmse(actual = df.test$medv, predicted = prediction2)
## [1] 4.533262
mae(actual = df.test$medv, predicted = prediction2)
## [1] 2.679588
mape(actual = df.test$medv, predicted = prediction2)
## [1] 0.1245878
importance(rf.boston2)
##           %IncMSE IncNodePurity
## crim    10.034111    1065.02609
## zn       3.373794      53.00401
## indus    2.177063     298.52180
## chas     1.309599      20.06688
## nox      9.743032     578.24666
## rm      23.980715    8874.45705
## age     10.942343     549.76843
## dis      4.400117     493.26584
## rad      2.015368      62.67513
## tax      7.505069     195.56709
## ptratio  5.809140     702.12363
## black    4.779008     239.25420
## lstat   18.759408    5990.43981
round(importance(rf.boston2),2)
##         %IncMSE IncNodePurity
## crim      10.03       1065.03
## zn         3.37         53.00
## indus      2.18        298.52
## chas       1.31         20.07
## nox        9.74        578.25
## rm        23.98       8874.46
## age       10.94        549.77
## dis        4.40        493.27
## rad        2.02         62.68
## tax        7.51        195.57
## ptratio    5.81        702.12
## black      4.78        239.25
## lstat     18.76       5990.44
varImpPlot(rf.boston2)

IncNodePurity: How much this input attribute increases the purity when used in the node to split.

This code snippet displays the variable importance measures for the “rf.boston2” random forest regression model using the “importance” function in R. Here’s a step-by-step explanation of the code and the output:

  1. The first line uses the “importance” function to compute the variable importance measures for the “rf.boston2” model. The “importance” function takes a single argument, which is the model object to compute the variable importance for.

  2. The second line rounds the variable importance measures to two decimal places using the “round” function.

  3. The output of the code is a table that shows the variable importance measures for each predictor variable in the random forest model. There are two columns in the table: “%IncMSE” and “IncNodePurity”.

    • “%IncMSE” stands for “percent increase in mean squared error” and measures the increase in MSE that results from permuting (randomly shuffling) the values of each predictor variable. Higher values of “%IncMSE” indicate that permuting the variable results in a larger increase in MSE, and hence the variable is more important in the model.

    • “IncNodePurity” measures the improvement in node purity (a measure of how well a node separates the classes) that results from splitting on each predictor variable. Higher values of “IncNodePurity” indicate that splitting on the variable results in a larger improvement in node purity, and hence the variable is more important in the model.

    The table shows that the most important predictor variable in the model is “rm”, which has a %IncMSE of 23.98 and an IncNodePurity of 8874.46. This indicates that permuting the “rm” variable and splitting on the “rm” variable result in the largest increase in MSE and improvement in node purity, respectively. Other important variables include “lstat”, “age”, “nox”, and “crim”.

Random Forest with Caret

library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:Metrics':
## 
##     precision, recall
set.seed(580)
#Number randomely variable selected is mtry
rf.caret.boston = train(medv~., data=df.train, method = "rf", metric = 'RMSE',trControl = trainControl(method="cv",number=10), nodesize=10,ntree=20,tuneGrid = expand.grid(.mtry = (3:7)))

rf.caret.boston
## Random Forest 
## 
## 253 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 228, 227, 227, 228, 226, 229, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   3     3.422793  0.8631455  2.431940
##   4     3.420192  0.8438907  2.509287
##   5     3.401368  0.8481789  2.390278
##   6     3.176863  0.8666879  2.293596
##   7     3.252512  0.8543216  2.319867
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
  • nodesize: This parameter refers to the minimum number of observations to include in a terminal node.
  • ntree : it is referring to the number of trees you want to build in your model. The standard is 500.
  • mtry: or the Number of Variable Splits
plot(rf.caret.boston)

Finetuning ntree parameter

model_list <- list()
train.control = trainControl(method = "cv",number = 10)
tune.grid = expand.grid(.mtry = (3:7))

#loop through different ntree
for (n in c(100,200,500,700)){
  set.seed(580)
  fit <- train(medv~., data=df.train, method = "rf", metric = 'RMSE',trControl =train.control,nodesize=10,ntree=n, tuneGrid = tune.grid)
  key <- paste("n",toString(n),sep = "")
  model_list[[key]] <- fit
}
results <- resamples(model_list)
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: n100, n200, n500, n700 
## Number of resamples: 10 
## 
## MAE 
##          Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## n100 1.778837 2.011302 2.250448 2.268086 2.411398 3.077195    0
## n200 1.767857 2.044762 2.195395 2.245818 2.400452 3.180050    0
## n500 1.699889 1.902565 2.255466 2.254859 2.434125 3.174340    0
## n700 1.717491 1.998307 2.247530 2.255484 2.391076 3.181974    0
## 
## RMSE 
##          Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## n100 2.322907 2.843467 2.915235 3.119676 3.284388 4.385675    0
## n200 2.254598 2.800779 2.919671 3.119946 3.319442 4.275150    0
## n500 2.218280 2.736850 2.858810 3.111931 3.350704 4.528454    0
## n700 2.337179 2.782756 2.917308 3.117822 3.295820 4.273877    0
## 
## Rsquared 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## n100 0.7861330 0.8111349 0.8821698 0.8672525 0.9154392 0.9369114    0
## n200 0.7711460 0.8138233 0.8827169 0.8663666 0.9160556 0.9366047    0
## n500 0.7742968 0.8146007 0.8870368 0.8704782 0.9253158 0.9456700    0
## n700 0.7661994 0.8117141 0.8835658 0.8656041 0.9199288 0.9357088    0
model_list$n100
## Random Forest 
## 
## 253 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 228, 227, 227, 228, 226, 229, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   3     3.212181  0.8737953  2.282146
##   4     3.178164  0.8699424  2.315820
##   5     3.183194  0.8632331  2.291329
##   6     3.119676  0.8672525  2.268086
##   7     3.142122  0.8649630  2.253553
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
plot(model_list$n100)

model_list$n200
## Random Forest 
## 
## 253 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 228, 227, 227, 228, 226, 229, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   3     3.216919  0.8725994  2.299372
##   4     3.157080  0.8704253  2.280789
##   5     3.137741  0.8682082  2.270369
##   6     3.133181  0.8647284  2.278779
##   7     3.119946  0.8663666  2.245818
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 7.
plot(model_list$n200)

model_list$n500
## Random Forest 
## 
## 253 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 228, 227, 227, 228, 226, 229, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   3     3.263612  0.8672341  2.328121
##   4     3.159279  0.8694840  2.279159
##   5     3.111931  0.8704782  2.254859
##   6     3.127452  0.8659253  2.268094
##   7     3.126252  0.8648255  2.257493
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
plot(model_list$n500)

model_list$n700
## Random Forest 
## 
## 253 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 228, 227, 227, 228, 226, 229, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   3     3.275341  0.8657472  2.331924
##   4     3.164930  0.8686753  2.283142
##   5     3.126110  0.8687028  2.263612
##   6     3.131358  0.8659074  2.268510
##   7     3.117822  0.8656041  2.255484
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 7.
plot(model_list$n700)