Cross-Validation-With-Caret-Rpart

Loading libraries

library(rpart)
library(rpart.plot)
library(caTools)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

rpart is a package for recursive partitioning and regression tree analysis.
rpart.plot is a package for plotting decision trees generated by the rpart package.
caTools is a package for some basic data manipulation functions.
caret is a package that provides a consistent interface for data splitting, pre-processing, feature selection, model tuning, and evaluation.

Preparing Data

bank = read.csv("UniversalBank.csv")
str(bank)

## 'data.frame':    5000 obs. of  14 variables:
##  $ ID                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age               : int  25 45 39 35 35 37 53 50 35 34 ...
##  $ Experience        : int  1 19 15 9 8 13 27 24 10 9 ...
##  $ Income            : int  49 34 11 100 45 29 72 22 81 180 ...
##  $ ZIP.Code          : int  91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
##  $ Family            : int  4 3 1 1 4 4 2 1 3 1 ...
##  $ CCAvg             : num  1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
##  $ Education         : int  1 1 1 2 2 2 2 3 2 3 ...
##  $ Mortgage          : int  0 0 0 0 0 155 0 0 104 0 ...
##  $ Personal.Loan     : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Securities.Account: int  1 1 0 0 0 0 0 0 0 0 ...
##  $ CD.Account        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Online            : int  0 0 0 0 0 1 1 0 1 0 ...
##  $ CreditCard        : int  0 0 0 0 1 0 0 1 0 0 ...

head(bank)

##   ID Age Experience Income ZIP.Code Family CCAvg Education Mortgage
## 1  1  25          1     49    91107      4   1.6         1        0
## 2  2  45         19     34    90089      3   1.5         1        0
## 3  3  39         15     11    94720      1   1.0         1        0
## 4  4  35          9    100    94112      1   2.7         2        0
## 5  5  35          8     45    91330      4   1.0         2        0
## 6  6  37         13     29    92121      4   0.4         2      155
##   Personal.Loan Securities.Account CD.Account Online CreditCard
## 1             0                  1          0      0          0
## 2             0                  1          0      0          0
## 3             0                  0          0      0          0
## 4             0                  0          0      0          0
## 5             0                  0          0      0          1
## 6             0                  0          0      1          0

Then we will change the output attibute and other categorial attributes from numerical vals to categorical, this is very important for classification problems.

bank$Personal.Loan=as.factor(bank$Personal.Loan)
bank$Securities.Account=as.factor(bank$Securities.Account)
bank$CD.Account=as.factor(bank$CD.Account)
bank$Online=as.factor(bank$Online)
bank$CreditCard=as.factor(bank$CreditCard)
str(bank)

## 'data.frame':    5000 obs. of  14 variables:
##  $ ID                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age               : int  25 45 39 35 35 37 53 50 35 34 ...
##  $ Experience        : int  1 19 15 9 8 13 27 24 10 9 ...
##  $ Income            : int  49 34 11 100 45 29 72 22 81 180 ...
##  $ ZIP.Code          : int  91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
##  $ Family            : int  4 3 1 1 4 4 2 1 3 1 ...
##  $ CCAvg             : num  1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
##  $ Education         : int  1 1 1 2 2 2 2 3 2 3 ...
##  $ Mortgage          : int  0 0 0 0 0 155 0 0 104 0 ...
##  $ Personal.Loan     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
##  $ Securities.Account: Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...
##  $ CD.Account        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Online            : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 2 1 2 1 ...
##  $ CreditCard        : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...

This code converts some of the variables in the bank data frame to factors using the as.factor() function.

Converting variables to factors can be useful in some data mining and machine learning tasks, such as classification, where the algorithms require categorical variables to be represented as factors instead of numeric values.

The variables Personal.Loan, Securities.Account, CD.Account, Online, and CreditCard are all converted to factors using the as.factor() function.

Fitting the Model

ctrl1 = trainControl(method = "repeatedcv",number = 10,repeats = 5)

trainControl is a function from the caret package that allows you to specify the parameters for training and cross-validation of machine learning models.

In this line of code, method = "repeatedcv" specifies that repeated cross-validation will be used to evaluate the performance of the models.

number = 10 specifies the number of folds used in cross-validation, which means that the data will be split into 10 equal parts, and the model will be trained and tested on each part in turn.

repeats = 5 specifies that the cross-validation process will be repeated 5 times, which helps to ensure that the results are robust and not influenced by chance variations in the data splits.

The resulting object ctrl1 contains the configuration settings for the cross-validation procedure that will be used in the subsequent steps of the data mining process.

fit1=train(Personal.Loan~., data=bank, method = "rpart", trControl = ctrl1, tuneGrid = expand.grid(cp=(1:100)*0.001))
fit1

## CART 
## 
## 5000 samples
##   13 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 4500, 4500, 4500, 4500, 4500, 4500, ... 
## Resampling results across tuning parameters:
## 
##   cp     Accuracy  Kappa    
##   0.001  0.98420   0.9073671
##   0.002  0.98468   0.9099792
##   0.003  0.98508   0.9121076
##   0.004  0.98532   0.9137404
##   0.005  0.98592   0.9176308
##   0.006  0.98628   0.9199238
##   0.007  0.98596   0.9182239
##   0.008  0.98596   0.9182239
##   0.009  0.98568   0.9167434
##   0.010  0.98520   0.9143061
##   0.011  0.98496   0.9126174
##   0.012  0.98396   0.9056459
##   0.013  0.98252   0.8962825
##   0.014  0.98072   0.8836553
##   0.015  0.98072   0.8836553
##   0.016  0.97908   0.8713028
##   0.017  0.97892   0.8697958
##   0.018  0.97904   0.8700263
##   0.019  0.97904   0.8700263
##   0.020  0.97920   0.8708355
##   0.021  0.97920   0.8708355
##   0.022  0.97920   0.8708355
##   0.023  0.97920   0.8708355
##   0.024  0.97920   0.8708355
##   0.025  0.97920   0.8708355
##   0.026  0.97920   0.8708355
##   0.027  0.97920   0.8708355
##   0.028  0.97920   0.8708355
##   0.029  0.97920   0.8708355
##   0.030  0.97920   0.8708355
##   0.031  0.97920   0.8708355
##   0.032  0.97920   0.8708355
##   0.033  0.97920   0.8708355
##   0.034  0.97920   0.8708355
##   0.035  0.97920   0.8708355
##   0.036  0.97920   0.8708355
##   0.037  0.97920   0.8708355
##   0.038  0.97920   0.8708355
##   0.039  0.97920   0.8708355
##   0.040  0.97920   0.8708355
##   0.041  0.97920   0.8708355
##   0.042  0.97920   0.8708355
##   0.043  0.97920   0.8708355
##   0.044  0.97920   0.8708355
##   0.045  0.97920   0.8708355
##   0.046  0.97920   0.8708355
##   0.047  0.97920   0.8708355
##   0.048  0.97920   0.8708355
##   0.049  0.97920   0.8708355
##   0.050  0.97920   0.8708355
##   0.051  0.97920   0.8708355
##   0.052  0.97920   0.8708355
##   0.053  0.97920   0.8708355
##   0.054  0.97920   0.8708355
##   0.055  0.97920   0.8708355
##   0.056  0.97920   0.8708355
##   0.057  0.97920   0.8708355
##   0.058  0.97920   0.8708355
##   0.059  0.97920   0.8708355
##   0.060  0.97920   0.8708355
##   0.061  0.97920   0.8708355
##   0.062  0.97920   0.8708355
##   0.063  0.97920   0.8708355
##   0.064  0.97920   0.8708355
##   0.065  0.97920   0.8708355
##   0.066  0.97920   0.8708355
##   0.067  0.97920   0.8708355
##   0.068  0.97920   0.8708355
##   0.069  0.97920   0.8708355
##   0.070  0.97920   0.8708355
##   0.071  0.97920   0.8708355
##   0.072  0.97920   0.8708355
##   0.073  0.97920   0.8708355
##   0.074  0.97920   0.8708355
##   0.075  0.97920   0.8708355
##   0.076  0.97920   0.8708355
##   0.077  0.97920   0.8708355
##   0.078  0.97920   0.8708355
##   0.079  0.97920   0.8708355
##   0.080  0.97920   0.8708355
##   0.081  0.97920   0.8708355
##   0.082  0.97920   0.8708355
##   0.083  0.97920   0.8708355
##   0.084  0.97920   0.8708355
##   0.085  0.97920   0.8708355
##   0.086  0.97920   0.8708355
##   0.087  0.97920   0.8708355
##   0.088  0.97920   0.8708355
##   0.089  0.97920   0.8708355
##   0.090  0.97920   0.8708355
##   0.091  0.97920   0.8708355
##   0.092  0.97920   0.8708355
##   0.093  0.97920   0.8708355
##   0.094  0.97920   0.8708355
##   0.095  0.97920   0.8708355
##   0.096  0.97920   0.8708355
##   0.097  0.97920   0.8708355
##   0.098  0.97920   0.8708355
##   0.099  0.97920   0.8708355
##   0.100  0.97920   0.8708355
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.006.

This code is using the train() function from the caret package to train a classification model for predicting the Personal.Loan variable in the bank data set.

The formula Personal.Loan~. specifies that Personal.Loan is the target variable and all other variables in the data set are used as predictors.

The argument data=bank specifies that the data set to be used is the bank data frame.

The argument method = "rpart" specifies that the classification model to be used is a decision tree, implemented with the rpart package.

The argument trControl = ctrl1 specifies the cross-validation configuration that was set up earlier using trainControl(). This will be used to tune the hyperparameters of the model and estimate its performance.

The argument tuneGrid = expand.grid(cp=(1:100)*0.001) specifies a grid of values to be tried for the cp parameter of the decision tree model. The cp parameter controls the complexity of the tree by specifying the minimum amount of improvement in prediction accuracy required to split a node further.

The resulting object fit1 is a trained model that can be used to make predictions on new data, as well as to evaluate the performance of the model using various metrics such as accuracy, sensitivity, specificity, and AUC (area under the ROC curve).

fit1.pred = predict(fit1)

This line of code uses the predict() function to make predictions on the training data using the fit1 model that was trained earlier.

The resulting object fit1.pred will contain the predicted values for the Personal.Loan variable based on the predictors in the training data set.

confusionMatrix(fit1.pred, bank$Personal.Loan, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4495   37
##          1   25  443
##                                           
##                Accuracy : 0.9876          
##                  95% CI : (0.9841, 0.9905)
##     No Information Rate : 0.904           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9278          
##                                           
##  Mcnemar's Test P-Value : 0.1624          
##                                           
##             Sensitivity : 0.9229          
##             Specificity : 0.9945          
##          Pos Pred Value : 0.9466          
##          Neg Pred Value : 0.9918          
##              Prevalence : 0.0960          
##          Detection Rate : 0.0886          
##    Detection Prevalence : 0.0936          
##       Balanced Accuracy : 0.9587          
##                                           
##        'Positive' Class : 1               
##

This code uses the confusionMatrix() function from the caret package to compute the confusion matrix for the predictions made by the fit1 model.

The first argument fit1.pred is the predicted values of the target variable (Personal.Loan) that were generated by the predict() function.

The second argument bank$Personal.Loan is the actual values of the target variable in the bank data set.

The argument positive = "1" specifies that the positive class for the confusion matrix is when the Personal.Loan variable is equal to 1 (i.e., when a customer has taken a personal loan).

The confusionMatrix() function computes various performance metrics for the model, including accuracy, sensitivity, specificity, precision, recall, F1 score, and AUC.

The resulting object is a table that displays the confusion matrix and the performance metrics for the model.

This table provides a summary of how well the fit1 model is able to predict whether a customer has taken a personal loan or not, based on the other variables in the data set.

Cross-Validation-With-Caret-Rpart

Azat

2023-04-29

Loading libraries

Preparing Data

Fitting the Model