library(rpart)
library(rpart.plot)
library(caTools)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
rpart
is a package for recursive partitioning and
regression tree analysis.rpart.plot
is a package for plotting decision trees
generated by the rpart
package.caTools
is a package for some basic data manipulation
functions.caret
is a package that provides a consistent interface
for data splitting, pre-processing, feature selection, model tuning, and
evaluation.bank = read.csv("UniversalBank.csv")
str(bank)
## 'data.frame': 5000 obs. of 14 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 25 45 39 35 35 37 53 50 35 34 ...
## $ Experience : int 1 19 15 9 8 13 27 24 10 9 ...
## $ Income : int 49 34 11 100 45 29 72 22 81 180 ...
## $ ZIP.Code : int 91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
## $ Family : int 4 3 1 1 4 4 2 1 3 1 ...
## $ CCAvg : num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
## $ Education : int 1 1 1 2 2 2 2 3 2 3 ...
## $ Mortgage : int 0 0 0 0 0 155 0 0 104 0 ...
## $ Personal.Loan : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Securities.Account: int 1 1 0 0 0 0 0 0 0 0 ...
## $ CD.Account : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Online : int 0 0 0 0 0 1 1 0 1 0 ...
## $ CreditCard : int 0 0 0 0 1 0 0 1 0 0 ...
head(bank)
## ID Age Experience Income ZIP.Code Family CCAvg Education Mortgage
## 1 1 25 1 49 91107 4 1.6 1 0
## 2 2 45 19 34 90089 3 1.5 1 0
## 3 3 39 15 11 94720 1 1.0 1 0
## 4 4 35 9 100 94112 1 2.7 2 0
## 5 5 35 8 45 91330 4 1.0 2 0
## 6 6 37 13 29 92121 4 0.4 2 155
## Personal.Loan Securities.Account CD.Account Online CreditCard
## 1 0 1 0 0 0
## 2 0 1 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 1
## 6 0 0 0 1 0
Then we will change the output attibute and other categorial attributes from numerical vals to categorical, this is very important for classification problems.
bank$Personal.Loan=as.factor(bank$Personal.Loan)
bank$Securities.Account=as.factor(bank$Securities.Account)
bank$CD.Account=as.factor(bank$CD.Account)
bank$Online=as.factor(bank$Online)
bank$CreditCard=as.factor(bank$CreditCard)
str(bank)
## 'data.frame': 5000 obs. of 14 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 25 45 39 35 35 37 53 50 35 34 ...
## $ Experience : int 1 19 15 9 8 13 27 24 10 9 ...
## $ Income : int 49 34 11 100 45 29 72 22 81 180 ...
## $ ZIP.Code : int 91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
## $ Family : int 4 3 1 1 4 4 2 1 3 1 ...
## $ CCAvg : num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
## $ Education : int 1 1 1 2 2 2 2 3 2 3 ...
## $ Mortgage : int 0 0 0 0 0 155 0 0 104 0 ...
## $ Personal.Loan : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
## $ Securities.Account: Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...
## $ CD.Account : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Online : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 2 1 2 1 ...
## $ CreditCard : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
This code converts some of the variables in the bank
data frame to factors using the as.factor()
function.
Converting variables to factors can be useful in some data mining and machine learning tasks, such as classification, where the algorithms require categorical variables to be represented as factors instead of numeric values.
The variables Personal.Loan
,
Securities.Account
, CD.Account
,
Online
, and CreditCard
are all converted to
factors using the as.factor()
function.
ctrl1 = trainControl(method = "repeatedcv",number = 10,repeats = 5)
trainControl
is a function from the caret
package that allows you to specify the parameters for training and
cross-validation of machine learning models.
In this line of code, method = "repeatedcv"
specifies
that repeated cross-validation will be used to evaluate the performance
of the models.
number = 10
specifies the number of folds used in
cross-validation, which means that the data will be split into 10 equal
parts, and the model will be trained and tested on each part in
turn.
repeats = 5
specifies that the cross-validation process
will be repeated 5 times, which helps to ensure that the results are
robust and not influenced by chance variations in the data splits.
The resulting object ctrl1
contains the configuration
settings for the cross-validation procedure that will be used in the
subsequent steps of the data mining process.
fit1=train(Personal.Loan~., data=bank, method = "rpart", trControl = ctrl1, tuneGrid = expand.grid(cp=(1:100)*0.001))
fit1
## CART
##
## 5000 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 4500, 4500, 4500, 4500, 4500, 4500, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.001 0.98420 0.9073671
## 0.002 0.98468 0.9099792
## 0.003 0.98508 0.9121076
## 0.004 0.98532 0.9137404
## 0.005 0.98592 0.9176308
## 0.006 0.98628 0.9199238
## 0.007 0.98596 0.9182239
## 0.008 0.98596 0.9182239
## 0.009 0.98568 0.9167434
## 0.010 0.98520 0.9143061
## 0.011 0.98496 0.9126174
## 0.012 0.98396 0.9056459
## 0.013 0.98252 0.8962825
## 0.014 0.98072 0.8836553
## 0.015 0.98072 0.8836553
## 0.016 0.97908 0.8713028
## 0.017 0.97892 0.8697958
## 0.018 0.97904 0.8700263
## 0.019 0.97904 0.8700263
## 0.020 0.97920 0.8708355
## 0.021 0.97920 0.8708355
## 0.022 0.97920 0.8708355
## 0.023 0.97920 0.8708355
## 0.024 0.97920 0.8708355
## 0.025 0.97920 0.8708355
## 0.026 0.97920 0.8708355
## 0.027 0.97920 0.8708355
## 0.028 0.97920 0.8708355
## 0.029 0.97920 0.8708355
## 0.030 0.97920 0.8708355
## 0.031 0.97920 0.8708355
## 0.032 0.97920 0.8708355
## 0.033 0.97920 0.8708355
## 0.034 0.97920 0.8708355
## 0.035 0.97920 0.8708355
## 0.036 0.97920 0.8708355
## 0.037 0.97920 0.8708355
## 0.038 0.97920 0.8708355
## 0.039 0.97920 0.8708355
## 0.040 0.97920 0.8708355
## 0.041 0.97920 0.8708355
## 0.042 0.97920 0.8708355
## 0.043 0.97920 0.8708355
## 0.044 0.97920 0.8708355
## 0.045 0.97920 0.8708355
## 0.046 0.97920 0.8708355
## 0.047 0.97920 0.8708355
## 0.048 0.97920 0.8708355
## 0.049 0.97920 0.8708355
## 0.050 0.97920 0.8708355
## 0.051 0.97920 0.8708355
## 0.052 0.97920 0.8708355
## 0.053 0.97920 0.8708355
## 0.054 0.97920 0.8708355
## 0.055 0.97920 0.8708355
## 0.056 0.97920 0.8708355
## 0.057 0.97920 0.8708355
## 0.058 0.97920 0.8708355
## 0.059 0.97920 0.8708355
## 0.060 0.97920 0.8708355
## 0.061 0.97920 0.8708355
## 0.062 0.97920 0.8708355
## 0.063 0.97920 0.8708355
## 0.064 0.97920 0.8708355
## 0.065 0.97920 0.8708355
## 0.066 0.97920 0.8708355
## 0.067 0.97920 0.8708355
## 0.068 0.97920 0.8708355
## 0.069 0.97920 0.8708355
## 0.070 0.97920 0.8708355
## 0.071 0.97920 0.8708355
## 0.072 0.97920 0.8708355
## 0.073 0.97920 0.8708355
## 0.074 0.97920 0.8708355
## 0.075 0.97920 0.8708355
## 0.076 0.97920 0.8708355
## 0.077 0.97920 0.8708355
## 0.078 0.97920 0.8708355
## 0.079 0.97920 0.8708355
## 0.080 0.97920 0.8708355
## 0.081 0.97920 0.8708355
## 0.082 0.97920 0.8708355
## 0.083 0.97920 0.8708355
## 0.084 0.97920 0.8708355
## 0.085 0.97920 0.8708355
## 0.086 0.97920 0.8708355
## 0.087 0.97920 0.8708355
## 0.088 0.97920 0.8708355
## 0.089 0.97920 0.8708355
## 0.090 0.97920 0.8708355
## 0.091 0.97920 0.8708355
## 0.092 0.97920 0.8708355
## 0.093 0.97920 0.8708355
## 0.094 0.97920 0.8708355
## 0.095 0.97920 0.8708355
## 0.096 0.97920 0.8708355
## 0.097 0.97920 0.8708355
## 0.098 0.97920 0.8708355
## 0.099 0.97920 0.8708355
## 0.100 0.97920 0.8708355
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.006.
This code is using the train()
function from the
caret
package to train a classification model for
predicting the Personal.Loan
variable in the
bank
data set.
The formula Personal.Loan~.
specifies that
Personal.Loan
is the target variable and all other
variables in the data set are used as predictors.
The argument data=bank
specifies that the data set to be
used is the bank
data frame.
The argument method = "rpart"
specifies that the
classification model to be used is a decision tree, implemented with the
rpart
package.
The argument trControl = ctrl1
specifies the
cross-validation configuration that was set up earlier using
trainControl()
. This will be used to tune the
hyperparameters of the model and estimate its performance.
The argument tuneGrid = expand.grid(cp=(1:100)*0.001)
specifies a grid of values to be tried for the cp
parameter
of the decision tree model. The cp
parameter controls the
complexity of the tree by specifying the minimum amount of improvement
in prediction accuracy required to split a node further.
The resulting object fit1
is a trained model that can be
used to make predictions on new data, as well as to evaluate the
performance of the model using various metrics such as accuracy,
sensitivity, specificity, and AUC (area under the ROC curve).
fit1.pred = predict(fit1)
This line of code uses the predict()
function to make
predictions on the training data using the fit1
model that
was trained earlier.
The resulting object fit1.pred
will contain the
predicted values for the Personal.Loan
variable based on
the predictors in the training data set.
confusionMatrix(fit1.pred, bank$Personal.Loan, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4495 37
## 1 25 443
##
## Accuracy : 0.9876
## 95% CI : (0.9841, 0.9905)
## No Information Rate : 0.904
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9278
##
## Mcnemar's Test P-Value : 0.1624
##
## Sensitivity : 0.9229
## Specificity : 0.9945
## Pos Pred Value : 0.9466
## Neg Pred Value : 0.9918
## Prevalence : 0.0960
## Detection Rate : 0.0886
## Detection Prevalence : 0.0936
## Balanced Accuracy : 0.9587
##
## 'Positive' Class : 1
##
This code uses the confusionMatrix()
function from the
caret
package to compute the confusion matrix for the
predictions made by the fit1
model.
The first argument fit1.pred
is the predicted values of
the target variable (Personal.Loan
) that were generated by
the predict()
function.
The second argument bank$Personal.Loan
is the actual
values of the target variable in the bank
data set.
The argument positive = "1"
specifies that the positive
class for the confusion matrix is when the Personal.Loan
variable is equal to 1 (i.e., when a customer has taken a personal
loan).
The confusionMatrix()
function computes various
performance metrics for the model, including accuracy, sensitivity,
specificity, precision, recall, F1 score, and AUC.
The resulting object is a table that displays the confusion matrix and the performance metrics for the model.
This table provides a summary of how well the fit1
model
is able to predict whether a customer has taken a personal loan or not,
based on the other variables in the data set.