library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
library(MASS)
df <- Boston
head(df)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
str(df)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
set.seed(1)
split.index <- sample(1:nrow(df), nrow(df)/2)
df.train <- df[split.index,]
df.test <- df[-split.index,]
This code snippet creates a train-test split for the Boston dataset using a 50/50 split. Here’s a step-by-step explanation of the code:
The first line assigns the Boston dataset to a data frame named “df”. The Boston dataset is a built-in dataset in R that contains information about housing prices and other variables in Boston suburbs.
The second line sets the seed for the random number generator to 1. This is useful if we want to reproduce the same train-test split each time we run the code.
The third line generates a random sample of row indices from the Boston dataset using the “sample” function. Specifically, the “sample” function takes two arguments: the first argument specifies the population of indices (in this case, the row indices of the “df” data frame), and the second argument specifies the sample size (in this case, half the number of rows in the “df” data frame). The result of the “sample” function is a vector of randomly selected row indices.
The fourth line subsets the “df” data frame using the randomly selected row indices to create a training dataset. Specifically, the “[split.index, ]” notation specifies that we want to select all rows in “df” whose indices are in the “split.index” vector.
The fifth line creates a test dataset by removing the rows in the training dataset from the original “df” data frame. Specifically, the “[-split.index]” notation specifies that we want to select all rows in “df” whose indices are not in the “split.index” vector.
rf.boston = randomForest(medv~., data = df.train, importance=TRUE)
rf.boston
##
## Call:
## randomForest(formula = medv ~ ., data = df.train, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 9.94747
## % Var explained: 87.06
This code snippet fits a random forest regression model to the “df.train” dataset using the “randomForest” function in R. Here’s a step-by-step explanation of the code:
The first line fits a random forest regression model to the “df.train” dataset using the “randomForest” function. The formula “medv~.” specifies that we want to predict the “medv” variable using all the other variables in the dataset. The argument “data = df.train” specifies that we want to use the “df.train” data frame for training the model. The argument “importance = TRUE” specifies that we want to compute variable importance measures for the model.
The second line prints out the details of the fitted random forest model object “rf.boston”. This includes information about the number of trees in the forest, the number of variables used in each split, and the mean square error of the model.
Random forests are an ensemble learning method that combines multiple decision trees to improve the accuracy and robustness of the predictions. The “randomForest” function in R fits a random forest model to the data by building multiple decision trees on different subsets of the data and using the average of the predictions from all trees as the final prediction.
The variable importance measures computed by the random forest model can be used to identify the most important predictors in the dataset, which can help in feature selection and understanding the underlying patterns in the data.
prediction <- predict(rf.boston, newdata = df.test)
plot(prediction, df.test$medv)
abline(0,1)
This code snippet makes predictions on the “df.test” dataset using the previously fitted random forest model “rf.boston” and creates a scatter plot of the predicted values versus the actual values of the “medv” variable. Here’s a step-by-step explanation of the code:
The first line makes predictions on the “df.test” dataset using the previously fitted random forest model “rf.boston”. Specifically, the “predict” function takes two arguments: the first argument specifies the model object to use for prediction (in this case, “rf.boston”), and the second argument specifies the new data to make predictions on (in this case, “df.test”).
The second line creates a scatter plot of the predicted values versus the actual values of the “medv” variable. The “plot” function takes two arguments: the first argument specifies the x-axis values (in this case, the predicted values), and the second argument specifies the y-axis values (in this case, the actual values of “medv” in the “df.test” dataset).
The third line adds a diagonal line to the plot with intercept 0 and slope 1 using the “abline” function. This line represents perfect agreement between the predicted and actual values, and is useful for visualizing how well the model is performing.
By comparing the predicted values to the actual values, we can get a sense of how well the random forest model is able to predict the “medv” variable on unseen data. If the points in the scatter plot fall close to the diagonal line, it suggests that the model is accurately predicting the target variable. If the points are widely scattered, it suggests that the model may not be accurately predicting the target variable.
the line is y=x line
library(Metrics)
rmse(actual = df.test$medv, predicted = prediction)
## [1] 4.320301
mae(actual = df.test$medv, predicted = prediction)
## [1] 2.538842
mape(actual = df.test$medv, predicted = prediction)
## [1] 0.1158074
we can also calculate them as:
# rmse
er = df.test$medv - prediction
sq = er^2
sqrt(mean(sq))
## [1] 4.320301
#mae
mean(abs(df.test$medv-prediction))
## [1] 2.538842
#mape
ape <- abs((df.test$medv-prediction) / df.test$medv)
mape <- mean(ape)
mape
## [1] 0.1158074
library(rpart)
library(rpart.plot)
rp.boston = rpart(medv~., data = df.train)
prediction.rp = predict(rp.boston, newdata=df.test)
rpart.plot(rp.boston)
rmse(actual = df.test$medv, predicted = prediction.rp)
## [1] 5.940276
mae(actual = df.test$medv, predicted = prediction.rp)
## [1] 3.692607
mape(actual = df.test$medv, predicted = prediction.rp)
## [1] 0.1786559
set.seed(580)
rf.boston2 <- randomForest(medv~., data = df.train, mtry=8, ntree=200,importance=TRUE)
prediction2 <- predict(rf.boston2, newdata = df.test)
plot(prediction, df.test$medv)
abline(0,1)
rmse(actual = df.test$medv, predicted = prediction2)
## [1] 4.533262
mae(actual = df.test$medv, predicted = prediction2)
## [1] 2.679588
mape(actual = df.test$medv, predicted = prediction2)
## [1] 0.1245878
importance(rf.boston2)
## %IncMSE IncNodePurity
## crim 10.034111 1065.02609
## zn 3.373794 53.00401
## indus 2.177063 298.52180
## chas 1.309599 20.06688
## nox 9.743032 578.24666
## rm 23.980715 8874.45705
## age 10.942343 549.76843
## dis 4.400117 493.26584
## rad 2.015368 62.67513
## tax 7.505069 195.56709
## ptratio 5.809140 702.12363
## black 4.779008 239.25420
## lstat 18.759408 5990.43981
round(importance(rf.boston2),2)
## %IncMSE IncNodePurity
## crim 10.03 1065.03
## zn 3.37 53.00
## indus 2.18 298.52
## chas 1.31 20.07
## nox 9.74 578.25
## rm 23.98 8874.46
## age 10.94 549.77
## dis 4.40 493.27
## rad 2.02 62.68
## tax 7.51 195.57
## ptratio 5.81 702.12
## black 4.78 239.25
## lstat 18.76 5990.44
varImpPlot(rf.boston2)
IncNodePurity: How much this input attribute increases the purity when used in the node to split.
This code snippet displays the variable importance measures for the “rf.boston2” random forest regression model using the “importance” function in R. Here’s a step-by-step explanation of the code and the output:
The first line uses the “importance” function to compute the variable importance measures for the “rf.boston2” model. The “importance” function takes a single argument, which is the model object to compute the variable importance for.
The second line rounds the variable importance measures to two decimal places using the “round” function.
The output of the code is a table that shows the variable importance measures for each predictor variable in the random forest model. There are two columns in the table: “%IncMSE” and “IncNodePurity”.
“%IncMSE” stands for “percent increase in mean squared error” and measures the increase in MSE that results from permuting (randomly shuffling) the values of each predictor variable. Higher values of “%IncMSE” indicate that permuting the variable results in a larger increase in MSE, and hence the variable is more important in the model.
“IncNodePurity” measures the improvement in node purity (a measure of how well a node separates the classes) that results from splitting on each predictor variable. Higher values of “IncNodePurity” indicate that splitting on the variable results in a larger improvement in node purity, and hence the variable is more important in the model.
The table shows that the most important predictor variable in the model is “rm”, which has a %IncMSE of 23.98 and an IncNodePurity of 8874.46. This indicates that permuting the “rm” variable and splitting on the “rm” variable result in the largest increase in MSE and improvement in node purity, respectively. Other important variables include “lstat”, “age”, “nox”, and “crim”.
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following objects are masked from 'package:Metrics':
##
## precision, recall
set.seed(580)
#Number randomely variable selected is mtry
rf.caret.boston = train(medv~., data=df.train, method = "rf", metric = 'RMSE',trControl = trainControl(method="cv",number=10), nodesize=10,ntree=20,tuneGrid = expand.grid(.mtry = (3:7)))
rf.caret.boston
## Random Forest
##
## 253 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 228, 227, 227, 228, 226, 229, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 3 3.422793 0.8631455 2.431940
## 4 3.420192 0.8438907 2.509287
## 5 3.401368 0.8481789 2.390278
## 6 3.176863 0.8666879 2.293596
## 7 3.252512 0.8543216 2.319867
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
plot(rf.caret.boston)
ntree
parametermodel_list <- list()
train.control = trainControl(method = "cv",number = 10)
tune.grid = expand.grid(.mtry = (3:7))
#loop through different ntree
for (n in c(100,200,500,700)){
set.seed(580)
fit <- train(medv~., data=df.train, method = "rf", metric = 'RMSE',trControl =train.control,nodesize=10,ntree=n, tuneGrid = tune.grid)
key <- paste("n",toString(n),sep = "")
model_list[[key]] <- fit
}
results <- resamples(model_list)
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: n100, n200, n500, n700
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## n100 1.778837 2.011302 2.250448 2.268086 2.411398 3.077195 0
## n200 1.767857 2.044762 2.195395 2.245818 2.400452 3.180050 0
## n500 1.699889 1.902565 2.255466 2.254859 2.434125 3.174340 0
## n700 1.717491 1.998307 2.247530 2.255484 2.391076 3.181974 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## n100 2.322907 2.843467 2.915235 3.119676 3.284388 4.385675 0
## n200 2.254598 2.800779 2.919671 3.119946 3.319442 4.275150 0
## n500 2.218280 2.736850 2.858810 3.111931 3.350704 4.528454 0
## n700 2.337179 2.782756 2.917308 3.117822 3.295820 4.273877 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## n100 0.7861330 0.8111349 0.8821698 0.8672525 0.9154392 0.9369114 0
## n200 0.7711460 0.8138233 0.8827169 0.8663666 0.9160556 0.9366047 0
## n500 0.7742968 0.8146007 0.8870368 0.8704782 0.9253158 0.9456700 0
## n700 0.7661994 0.8117141 0.8835658 0.8656041 0.9199288 0.9357088 0
model_list$n100
## Random Forest
##
## 253 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 228, 227, 227, 228, 226, 229, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 3 3.212181 0.8737953 2.282146
## 4 3.178164 0.8699424 2.315820
## 5 3.183194 0.8632331 2.291329
## 6 3.119676 0.8672525 2.268086
## 7 3.142122 0.8649630 2.253553
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
plot(model_list$n100)
model_list$n200
## Random Forest
##
## 253 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 228, 227, 227, 228, 226, 229, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 3 3.216919 0.8725994 2.299372
## 4 3.157080 0.8704253 2.280789
## 5 3.137741 0.8682082 2.270369
## 6 3.133181 0.8647284 2.278779
## 7 3.119946 0.8663666 2.245818
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 7.
plot(model_list$n200)
model_list$n500
## Random Forest
##
## 253 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 228, 227, 227, 228, 226, 229, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 3 3.263612 0.8672341 2.328121
## 4 3.159279 0.8694840 2.279159
## 5 3.111931 0.8704782 2.254859
## 6 3.127452 0.8659253 2.268094
## 7 3.126252 0.8648255 2.257493
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
plot(model_list$n500)
model_list$n700
## Random Forest
##
## 253 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 228, 227, 227, 228, 226, 229, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 3 3.275341 0.8657472 2.331924
## 4 3.164930 0.8686753 2.283142
## 5 3.126110 0.8687028 2.263612
## 6 3.131358 0.8659074 2.268510
## 7 3.117822 0.8656041 2.255484
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 7.
plot(model_list$n700)