FE581 – 0313 - R scripts Walkthrough

Classification Example game.csv

We have our data game.csv with contents as:

OutlookTemperatureHumidityWindyPlay
sunnyhothighfalseno
sunnyhothightrueno
overcasthothighfalseyes
rainymildhighfalseyes
rainycoolnormalfalseyes
rainycoolnormaltrueno
overcastcoolnormaltrueyes
sunnymildhighfalseno
sunnycoolnormalfalseyes
rainymildnormalfalseyes
sunnymildnormaltrueyes
overcastmildhightrueyes
overcasthotnormalfalseyes
rainymildhightrueno

This data set represents a hypothetical game played outdoors and contains five variables: Outlook, Temperature, Humidity, Windy, and Play. The goal is to predict whether the game will be played or not based on the other variables.

To begin exploring this data set, let's load it into R. We can do this by first saving the contents into a file called "game.csv" and then using the read.csv() function to read the file:

Now that we have loaded the data into R, let's take a look at its structure. We can use the str() function to do this:

This will output the following:

In R, factors are used to represent categorical variables, while characters are used to represent text data. The difference between them is that factors have a predefined set of possible values (i.e., levels), while characters can have any possible value.

Using factors instead of characters for categorical variables can have several advantages in data mining and statistical modeling:

  1. Factors take up less memory than characters, which can be important when working with large data sets.

  2. Factors can be ordered, which can be useful for variables that have a natural order (e.g., temperature categories: low, medium, high).

  3. Factors can be used to specify contrasts and reference levels in statistical models, which can help to avoid issues with multicollinearity.

  4. Functions that operate on factors (e.g., table(), summary()) automatically generate output that is formatted for categorical data, which can make it easier to understand and interpret.

  5. In general, it's a good practice to use factors instead of characters for categorical variables in R.

Since the variables are categorical, they should be read as factors instead of characters. Here's the correct way to load the data into R:

Now, if we run str(game), we should see that the variables are factors:

We can see that there are 14 observations (rows) and 5 variables (columns), all of which are factors. Factors in R are used to represent categorical variables.

To get a summary of the data set, we can use the summary() function:

We can see that there are three levels of the Outlook variable (overcast, rainy, and sunny), three levels of the Temperature variable (cool, hot, and mild), two levels of the Humidity variable (high and normal), two levels of the Windy variable (false and true), and two levels of the Play variable (no and yes). We can also see that 9 out of the 14 games were played.

Now that we have a basic understanding of the data set, let's start exploring it in more detail using various data mining techniques.

rpart

rpart() is a function in R that can be used to create decision trees. In this case, gametree1 is a decision tree model that predicts the Play variable (whether the game was played or not) based on the other variables in the game data set.

The ~. syntax in the formula specifies that all other variables in the data set should be used to predict Play. The method="class" argument tells rpart() to treat the Play variable as a categorical variable (i.e., a factor) instead of a continuous variable.

To fit the decision tree model to the data, we use the rpart() function with the formula and data as arguments:

Here's a breakdown of what each line means:

  1. n=14: The total number of observations in the data set.

  2. node: The ID number of the current node in the tree.

  3. split: The variable and value that were used to split the data at this node.

  4. n: The number of observations at this node.

  5. loss: The number of misclassified observations at this node.

  6. yval: The predicted class for this node based on the majority class at the node.

  7. (yprob): The probabilities of each class at this node.

  8. *: Indicates that this is a terminal node, meaning that no further splits are made after this point.

In this case, the output shows that the root node (node 1) includes all 14 observations in the data set. The split variable and value are not shown because this is the initial node and no split has been made yet. The node predicts that the majority class is "yes" (i.e., the game will be played) with a probability of 0.64.

This is just the beginning of the decision tree model. As we continue to split the data based on the other variables, the tree will become more complex and accurate. We can use the plot() function to visualize the entire tree.

image-20230319155724710

The printcp() function provides a table of the complexity parameter (CP) values and corresponding tree size, residual deviance, and cross-validation error for each value of CP.

In this case, the output shows that gametree1 has not yet split on any variables, which is why there are no variables listed in the "Variables actually used in tree construction" line. The root node error (i.e., the misclassification rate at the root node) is 0.35714, which is calculated as the number of misclassified observations divided by the total number of observations in the data set.

The table shows that there is only one CP value (CP = 0.01) listed, which means that there is no complexity reduction available for this model. The tree size is 1, the residual deviance is 0 (since there are no splits yet), and the cross-validation error is 0, which indicates perfect prediction.

In general, we want to choose a CP value that balances model complexity with prediction accuracy. This can be done by selecting a CP value that results in a relatively small tree size while still maintaining a reasonable level of accuracy in predicting the outcome variable.

We can use this tree to make predictions for new data (or old data) by passing it to the predict() function:

The output of predict() is "yes", which indicates that the decision tree model predicts that the game will be played based on the values of the predictor variables in new_data.

Better fit

gametree2 is another decision tree model that predicts the Play variable based on the other variables in the game data set, just like gametree1.

The main difference between gametree2 and gametree1 is that gametree2 includes two additional arguments: minbucket and minsplit.

minbucket specifies the minimum number of observations that must exist in a terminal node in order for it to be created. If the number of observations in a node is less than minbucket, the node will not be split, and it will become a terminal node.

minsplit specifies the minimum number of observations that must exist in a node in order for it to be considered for splitting. If the number of observations in a node is less than minsplit, the node will not be split, and it will become a terminal node.

In this case, minbucket = 1 and minsplit = 2, which means that any node with only one observation will become a terminal node, and any node with less than two observations will not be considered for splitting. These values are often used as starting points for tuning the model, and can be adjusted to find the optimal values for the data set.

The output you provided is the summary of the decision tree model gametree2.

For example, at the root node, the yprob values are 0.36 for "no" and 0.64 for "yes", indicating that the majority class at this node is "yes".

image-20230319160741708

In this case, the output shows that gametree2 was fit with minbucket = 1 and minsplit = 2.

The table shows three CP values: 0.30, 0.10, and 0.01. Each row represents a candidate tree, where nsplit is the number of splits in the tree, rel error is the relative error rate for the tree, xerror is the cross-validation error rate for the tree, and xstd is the standard deviation of the cross-validation error.

The table shows three CP values: 0.30, 0.10, and 0.01. Each row represents a candidate tree, where nsplit is the number of splits in the tree, rel error is the relative error rate for the tree, xerror is the cross-validation error rate for the tree, and xstd is the standard deviation of the cross-validation error.

We can use this table to select the optimal tree for our data set. The idea is to choose a CP value that provides the smallest possible tree size while maintaining a reasonable level of accuracy in predicting the outcome variable. In this case, we can see that the tree size is reduced from 8 (the number of terminal nodes in gametree2) to 6 with CP = 0.10, and to 0 with CP = 0.01.

To select the optimal tree, we can use the plotcp() function to visualize the relationship between CP values and tree size, and select the CP value that provides the optimal tradeoff between complexity and accuracy:

image-20230319161135413

This will generate a plot of the CP values and tree size, which can be used to select the optimal CP value.

Make Prediction

you can also use the predict() function to obtain the predicted class labels for the observations in the original data set (game), like this:

The output of predict(gametree2, type = "class") is a vector of predicted class labels for each observation in the original data set game.

The predicted class labels are "no" or "yes", which are the levels of the Play variable. The Levels: no yes part of the output indicates that the predicted class label can take on one of these two levels.

For example, the first observation in game has predictor variables Outlook = "sunny", Temperature = "hot", Humidity = "high", and Windy = "false". According to the decision tree model gametree2, this observation should be classified as "no", which means that the game will not be played.

The output shows that game_pred1 is a vector of predicted class labels for each observation in game. For example, the first predicted label is "no", which corresponds to the first observation in game.

Confusion Matrix

The table() function can be used to create a contingency table that shows the actual and predicted class labels for a given set of data.

To create a contingency table for the original data set game, using the predicted class labels in game_pred2 obtained from the decision tree model gametree2, you can use the following code:

This will create a 2x2 contingency table with the rows representing the actual class labels ("no" and "yes") and the columns representing the predicted class labels ("no" and "yes").

The output will look something like this:

This table shows that the decision tree model gametree2 correctly predicted all of the observations in the no class, and 9 out of 9 observations in the yes class, for a total accuracy rate of 100%.

Accuracy

The accuracy() function can be used to calculate the accuracy of a classification model, by comparing the predicted class labels to the actual class labels.

To calculate the accuracy of the decision tree model gametree2, using the predicted class labels in game_pred2 and the actual class labels in game$Play, you can use the following code:

This will calculate the accuracy of the model as the proportion of correctly classified observations out of the total number of observations.

The output will be a numeric value representing the accuracy rate, expressed as a proportion between 0 and 1. For example, an accuracy rate of 1.0 would indicate that all observations were correctly classified, while an accuracy rate of 0.5 would indicate that only half of the observations were correctly classified.

In this case, the output will be 1, indicating that all of the observations in game were correctly classified by the gametree2 model.

(11)Accuracy=TP + TNTP + TN + FP + FN

Accuracy = 5 + 9 / 14 = 1

Precision

The precision() function can be used to calculate the precision of a classification model, by comparing the predicted positive class labels to the actual positive class labels.

In our case, we are getting a warning message because the function is unable to calculate the precision for the predicted class label "no". This is because there are no observations that were predicted to be "no". The mean() function used internally in the precision() function expects a numeric or logical argument, but in this case it is receiving a character vector that cannot be converted to a numeric or logical value.

To avoid this warning message and calculate the precision correctly, we can convert the data to numerical,

Here's how we can do this:

(12)precision=TPTP+FP

Precision = 9 / 9+0 = 1

Recall

The recall() function in the caret package can be used to calculate the recall (also known as sensitivity or true positive rate) of a classification model. Recall is defined as the number of true positives divided by the sum of true positives and false negatives.

(13)recall=TPTP+FN

Recall = 9 / 9 + 0 = 1

Regression Example (Boston)

Boston dataset is a famous dataset in data mining used for regression analysis. It contains information collected by the U.S Census Service concerning housing in the area of Boston, Massachusetts. The dataset has 506 instances and 14 variables, including the median value of owner-occupied homes in $1000's (the target variable) and other variables such as crime rate, average number of rooms per dwelling, and the proportion of non-retail business acres per town.

Here is a brief description of each variable:

FIELD1crimzninduschasnoxrmagedisradtaxptratioblacklstatmedv
10.00632182.3100.5386.57565.24.09129615.3396.94.9824
20.0273107.0700.4696.42178.94.9671224217.8396.99.1421.6
30.0272907.0700.4697.18561.14.9671224217.8392.834.0334.7
40.0323702.1800.4586.99845.86.0622322218.7394.632.9433.4
50.0690502.1800.4587.14754.26.0622322218.7396.95.3336.2
60.0298502.1800.4586.4358.76.0622322218.7394.125.2128.7

to be more convenient we rename Boston data to a variable named data

Check for missing values in the dataset

image-20230319210522520

Node
Internal Node
Leaf Node (terminal)

The decision to stop splitting further in a decision tree is based on a stopping criterion. This criterion can be either a pre-defined threshold for a certain parameter, such as the minimum number of samples required in a node to allow a split, or a threshold for the improvement in model performance that is achieved by the split.

In the context of a decision tree, deviance is a measure of the impurity or error in a node that is used to determine the optimal split. Specifically, the deviance in a node is the sum of the squared differences between the observed values and the predicted values in that node. The deviance is calculated differently depending on whether the decision tree is a regression tree or a classification tree.

In regression trees, the deviance in a node is typically measured by the residual sum of squares (RSS) or mean squared error (MSE), which represents the sum of squared differences between the observed values and the mean or median predicted value in the node. The RSS is calculated as follows:

(14)RSS=i=1n(yiyi^)2

where yi is the observed value for the i-th data point, y_hat is the predicted value for the i-th data point in the node, and the sum is taken over all data points in the node.

In classification trees, the deviance in a node is typically measured by an impurity measure such as the Gini index or cross-entropy, which represents the probability of misclassification of a randomly chosen data point in the node. The Gini index is calculated as follows:

(15)G=1i=1Kpi2

where p_i is the proportion of data points in the node that belong to the i-th class, and the sum is taken over all classes.

In the output of the rpart function in R, the deviance in each node is reported as the "deviance" or "deviance reduction" in that node, which represents the reduction in deviance achieved by splitting that node. The algorithm selects the split that maximizes the reduction in deviance, which leads to the best split in terms of reducing the impurity or error in the resulting nodes.

We can also manually calcuate the deviance for root:

The output of the R command sum((Boston$medv-mean(Boston$medv))^2) is the TSS for the "medv" variable in the "Boston" dataset.

The TSS is calculated as the sum of the squared differences between each observation and the mean of the response variable. In this case, the TSS for the "medv" variable is 42716.3, which represents the total variation in the median value of owner-occupied homes in the "Boston" dataset.

The complexity parameter table shows the cross-validated error (xerror), the number of splits (nsplit), the relative error reduction (rel error), the complexity parameter (CP), and the standard error of the cross-validation error (xstd) for each candidate tree. The optimal tree size is selected based on the complexity parameter that minimizes the cross-validation error rate.

The code reg <- model assigns the previously fitted regression tree model (model) to a new object called reg. This is a common practice in R to store the results of a previous computation for later use or comparison.

The code pre = predict(reg) creates a new object called pre by predicting the response variable (medv) for each observation in the data dataframe using the previously fitted regression tree model (reg). The predict() function in R takes two arguments: the first argument is the fitted model object, and the second argument is the new data for which we want to make predictions. In this case, the new data is the same as the training data, i.e., the data dataframe.

After executing this code, the object pre will contain a vector of predicted values for the response variable based on the regression tree model.

The code rmse(actual = data$medv,predicted = pre) calculates the root mean squared error (RMSE) between the actual values of the response variable (medv) in the data dataframe and the predicted values from the regression tree model stored in the pre object.

The code mae(actual = data$medv,predicted = pre) calculates the mean absolute error (MAE) between the actual values of the response variable in the data dataframe and the predicted values from the regression tree model stored in the pre object.

The code mape(actual = data$medv,predicted = pre) calculates the mean absolute percentage error (MAPE) between the actual values of the response variable in the data dataframe and the predicted values from the regression tree model stored in the pre object.

(16)RMSE=1ni=1n(yiyi^)2
(17)MAE=1ni=1n|yiyi^|

 

(18)MAPE=1ni=1n|yiyi^yi|×100%

We can also manually calculate them:

The formula can be broken down into the following steps:

  1. Calculate the difference between the predicted and actual values: data$medv - pre

  2. Square the differences: (data$medv - pre)^2

  3. Calculate the mean of the squared differences: mean((data$medv - pre)^2)

  4. Take the square root of the mean squared difference to get the RMSE: sqrt(mean((data$medv - pre)^2))

Larger tree

image-20230319214128981

 

Largest tree

image-20230319222208247

We can use the printcp() function to print the complexity parameter table for the reg3 decision tree model. This table shows the complexity parameter (CP), the number of splits (nsplit), the number of terminal nodes (ncompete), the relative improvement in the model's goodness of fit (rel error reduction), and the estimated error rate of the tree (xerror).

image-20230319222412017

 

Add another custom function:

We use the prune.rpart() function to prune the reg3 decision tree model based on the best complexity parameter value, which you have obtained using the bestcp() function.

The prune.rpart() function takes two arguments: the decision tree model to be pruned (reg3 in this case), and the best complexity parameter value (best in this case). The function returns a new decision tree model that has been pruned based on the specified complexity parameter value.

image-20230319223302892

Classification Example (Iris) with tree

 

The "iris" dataset is a built-in dataset in R that contains information on the length and width of sepals and petals for three different species of iris flowers: setosa, versicolor, and virginica.

sepals vs petals

 Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.13.51.40.2setosa
24.931.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
553.61.40.2setosa
65.43.91.70.4setosa

The model is being trained to predict the "Species" variable based on the other variables in the "df" data frame.

After creating the model, you are using the plot() function to visualize the resulting decision tree, and the text() function to label each node in the tree with the corresponding variable and split point. The cex=0.8 argument adjusts the size of the text labels to make them easier to read.

Finally, we are using the summary() function to display some basic information about the tree model, including the number of terminal nodes (i.e., the number of leaf nodes), the residual mean deviance, and the percentage of correct classifications for each species.

The output of summary(cls) provides some useful information about the decision tree model that you have created.

The decision tree algorithm works by recursively partitioning the data based on the most informative variables. At each step of the algorithm, the variable that provides the most information gain (i.e., the most useful for predicting the target variable) is chosen as the splitting variable.

In the case of the iris dataset, the summary(cls) output indicates that only three variables were used in the tree construction: Petal.Length, Petal.Width, and Sepal.Length. This suggests that these variables are the most informative for predicting the species of iris in this dataset.

It is possible that the other variables (Sepal.Width) did not provide as much information gain as the three selected variables, or they may have been highly correlated with the selected variables. Correlated variables can sometimes lead to overfitting and can make the model less generalizable to new data.

We are using the predict() function to generate predictions for the "df" dataset using the decision tree model "cls". However, we have included the newdata = df argument to specify that we want to generate predictions for the same dataset that was used to train the model.

The table() function with the true labels in the first argument and predicted labels in the second argument creates a confusion matrix that shows the number of correct and incorrect predictions for each class in the dataset.

In this output, the rows of the confusion matrix represent the true labels of the iris flowers ("setosa", "versicolor", and "virginica"), and the columns represent the predicted labels based on the decision tree model.

The diagonal elements of the matrix represent the number of correct predictions for each class, while the off-diagonal elements represent the number of incorrect predictions.

For example, in this output, the decision tree model correctly predicted all 50 instances of the "setosa" class, 47 out of 50 instances of the "versicolor" class, and 49 out of 50 instances of the "virginica" class.

Overall, the model achieved high accuracy in predicting the species of iris flowers, with only 4 misclassifications out of 150 total observations (as shown in the output of summary(cls) earlier).

In the context of decision trees, deviance is a measure of the goodness of fit of a model to the data. Deviance is defined as the difference between the observed response and the predicted response, squared and summed over all observations.

In the case of a classification problem with K classes, the deviance is defined as:

(19)D=2i=1Nk=1Kyiklogp^ik

where N is the number of observations, K is the number of classes, yik is an indicator variable that takes the value 1 if observation i belongs to class k, and p^ik is the predicted probability of observation i belonging to class k.

In the context of decision trees, the deviance is used to measure how well the tree model fits the training data. The goal of building a decision tree is to minimize the deviance, which means finding the tree that best fits the data.

The residual mean deviance, which is the output of the summary() function for a decision tree model, is the deviance of the model divided by the degrees of freedom (i.e., the number of observations minus the number of terminal nodes in the tree). The residual mean deviance can be used to compare the goodness of fit of different tree models. A lower residual mean deviance indicates a better fit to the data.

In the case of the iris dataset, which has three classes ("setosa", "versicolor", and "virginica"), we can calculate the deviance using the formula:

(20)D=2i=1Nk=1Kyiklogp^ik

where N is the total number of observations, K is the number of classes (in this case, K = 3), yik is an indicator variable that takes the value 1 if observation i belongs to class k, and p^ik is the predicted probability of observation i belonging to class k.