FE581 – 0227IntroTerminology and NotationData mining processClassification TreeIntroductionClassification TreesExample 1: Riding MowersMeasure of ImpurityRiding Mowers - RRiding Mowers - Python
There aremany different methods for prediction and classification.
The usefulness of a methodcan depend on factors such as the size of the dataset, the types of patterns thatexist in the data, whether the data meet some underlying assumptions of themethod, how noisy the data are, and the particular goal of the analysis.
A smallillustration is shown:
where the goal is to find a combination ofhousehold income level and household lot size that separates buyers (solid circles) from nonbuyers (hollow circles) of riding mowers. The first method (left panel) looksonly for horizontal and vertical lines to separate buyers from nonbuyers, whereasthe second method (right panel) looks for a single diagonal line.Different methods can lead to different results, and their performance can vary.
Algorithm A specific procedure used to implement a particular data miningtechnique: classification tree, discriminant analysis, and the like.
Attribute see Predictor.
Case see Observation.
Confidence A performance measure in association rules of the type “IF A andB are purchased, THEN C is also purchased.” Confidence is the conditionalprobability that C will be purchased IF A and B are purchased.
Confidence also has a broader meaning in statistics (confidence interval), concerningthe degree of error in an estimate that results from selecting one sampleas opposed to another.
Dependent Variable see Response.
Estimation see Prediction.
Feature see Predictor.
Holdout Data (or holdout set) A sample of data not used in fitting a model,but instead used to assess the performance of that model. Here we use theterms validation set and test set instead of holdout set.
Input Variable see Predictor.
Model An algorithm as applied to a dataset, complete with its settings (many ofthe algorithms have parameters that the user can adjust).
Observation The unit of analysis on which the measurements are taken (a customer,a transaction, etc.), also called instance, sample, example, case, record,pattern, or row. In spreadsheets, each row typically represents a record; eachcolumn, a variable. Note that the use of the term “sample” here is differentfrom its usual meaning in statistics, where it refers to a collection ofobservations.
Outcome Variable see Response.
Output Variable see Response.
P (A j B) The conditional probability of event A occurring given that eventB has occurred, read as “the probability that A will occur given that B hasoccurred.”
Prediction The prediction of the numerical value of a continuous output variable;also called estimation.
Predictor A variable, usually denoted by X, used as an input into a predictivemodel, also called a feature, input variable, independent variable, or from adatabase perspective, a field.
Profile A set of measurements on an observation (e.g., the height, weight, andage of a person).
Record see Observation.
Response A variable, usually denoted by Y , which is the variable being predictedin supervised learning, also called dependent variable, output variable,target variable, or outcome variable.
Sample In the statistical community, “sample” means a collection of observations.In the machine learning community, “sample” means a single observation.
Score A predicted value or class. Scoring new data means using a model developedwith training data to predict output values in new data.
Success Class The class of interest in a binary outcome (e.g., purchasers in theoutcome purchase/no purchase).
Supervised Learning The process of providing an algorithm (logistic regression,regression tree, etc.) with records in which an output variable of interestis known and the algorithm “learns” how to predict this value with newrecords where the output is unknown.
Target see Response.
Test Data (or test set) The portion of the data used only at the end of themodel building and selection process to assess how well the final model mightperform on new data.
Training Data (or training set) The portion of the data used to fit a model.
Unsupervised Learning An analysis in which one attempts to learn patternsin the data other than predicting an output value of interest.
Validation Data (or validation set) The portion of the data used to assesshow well the model fits, to adjust models, and to select the best model fromamong those that have been tried.
Variable Any measurement on the records, including both the input (X) variablesand the output (Y ) variable.
Amongthe data-driven methods, trees are the most transparent and easy to interpret.Trees are based on separating records into subgroups by creating splits on predictors.These splits create logical rules that are transparent and easily understandable,for example, “IF Age < 55 AND Education > 12 THEN class = 1.”
If one had to choose a classification technique that performs well across a widerange of situations without requiring much effort from the analyst while being readily understandable by the consumer of the analysis, a strong contender wouldbe the tree methodology developed by Breiman et al. (1984).
The program thatBreiman et al. created to implement these procedures was called CART (ClassificationAnd Regression Trees).
What is a classification tree? Figure 9.1 shows a tree for classifying bankcustomers who receive a loan offer as either acceptors or nonacceptors, basedon information such as their income, education level, and average credit cardexpenditure.
One of the reasons that tree classifiers are very popular is thatthey provide easily understandable classification rules (at least if the trees are nottoo large).
Consider the tree in the example. The gray terminal nodes are markedwith 0 or 1 corresponding to a nonacceptor (0) or acceptor (1). The values abovethe white nodes give the splitting value on a predictor. The values below thenodes gives the number of records in the split. This tree can easily be translatedinto a set of rules for classifying a bank customer. For example, the bottom-leftrectangle node under the “Family” circle in this tree gives us the following rule:
Two key ideas underlie classification trees. The first is the idea of recursive partitioningof the space of the predictor variables. The second is the idea of pruningusing validation data. In the next few sections, we describe recursive partitioningand in subsequent sections explain the pruning methodology.
Recursive Partitioning
Let us denote the outcome variable by Y and the input (predictor) variables by
This results in three (multidimensional) rectangularregions. This process is continued so that we get smaller and smaller rectangularregions. The idea is to divide the entire X-space up into rectangles suchthat each rectangle is as homogeneous or “pure” as possible. By pure, we meancontaining records that belong to just one class. (Of course, this is not alwayspossible, as there may be records that belong to different classes but have exactlythe same values for every one of the predictor variables.)Let us illustrate recursive partitioning with an example.
We again use the riding-mower example presented in Chapter 3. A ridingmowermanufacturer would like to find a way of classifying families in a cityinto those likely to purchase a riding mower and those not likely to buy one.
Apilot random sample of 12 owners and 12 nonowners in the city is undertaken.
If we apply the classification tree procedure to these data, the procedure willchoose Income for the first split with a splitting value of 60. The (
Notice how the split has created two rectangles, each of which is much morehomogeneous than the rectangle before the split.
The left rectangle contains points that are mostly nonowners (seven nonowners and one owner) and theright rectangle contains mostly owners (11 owners and five nonowners).
How was this particular split selected? The algorithm examined each predictorvariable (in this case, Income and Lot Size) and all possible split valuesfor each variable to find the best split. What are the possible split values fora variable? They are simply the values for each predictor. The possible splitpoints for Income are
Categorical Predictors The previous description used numerical predictors;however, categorical predictors can also be used in the recursive partitioningcontext. To handle categorical predictors, the split choices for a categorical predictorare all ways in which the set of categories can be divided into two subsets.
There are a number of ways to measure impurity. The two most popular measuresare the:
Gini index
Entropy measure
Let us compute the impurity in the riding mower example before and afterthe first split (using Income with the value of 60).
The unsplit dataset contains 12 owners and 12 nonowners. This is a two-class case with an equal number ofrecords from each class. Both impurity measures are therefore at their maximumvalue: Gini = 0.5 and entropy = log2(2) = 1.
After the split, the left rectanglecontains seven nonowners and one owner. The impurity measures for thisrectangle are:
Thus, the Gini impurity index decreased from 0.5 before the split to 0.359 afterthe split. Similarly, the entropy impurity measure decreased from 1 before thesplit to 0.779 after the split.
By comparing the reduction in impurity across all possible splits in all possiblepredictors, the next split is chosen.
The reason the method is called a classification tree algorithm is that each splitcan be depicted as a split of a node into two successor nodes. The first split isshown as a branching of the root node of a tree in Figure 9.7. The full-growntree is shown in Figure 9.8. (Note that in R the split values are integers).
Full view online: https://fe581.fe.azat.cc/RidingMowers
531# -*- coding: utf-8 -*-
2
3# -- Sheet --
4
5# # Example 1: Riding Mowers
6# A ridingmower
7# manufacturer would like to find a way of classifying families in a city
8# into those likely to purchase a riding mower and those not likely to buy one. A
9# pilot random sample of 12 owners and 12 nonowners in the city is undertaken.
10
11
12# Prepare list of packages, compare it to the output from installed.packages()[,"Package"] and install the missing packages.
13
14
15list.of.packages <- c("rpart", "rpart.plot")
16new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
17if(length(new.packages)) install.packages(new.packages)
18
19# Loade packages into the memory
20library(rpart)
21library(rpart.plot)
22
23# read csv file as data and create a data.frame from the data
24mower.df <- read.csv("RidingMowers.csv")
25
26str(mower.df)
27
28# check our data
29mower.df
30
31# user rpart() to run a classification tree.
32# define rpart.control() in rpart() to determine the depth of the tree.
33class.tree <- rpart(Ownership ~ ., data=mower.df, control = rpart.control(maxdepth = 2),method = "class")
34
35# check the classification tree results
36class.tree
37
38## plot the result
39# use prp() to plot the tree. we can control plotting parameters such as color, shape, and information displayed (which and where)
40prp(class.tree, type = 1, extra = 1, split.font = 1, varlen = -10)
41
42# we cam change the tree depth to get more of tree nodes
43class.tree <- rpart(Ownership~., data=mower.df,method = "class",cp=0,minsplit=1)
44prp(class.tree, type = 5, extra = 1, split.font = 1, varlen = -10)
45
46# **Notice**:
47# - The option minbucket provides the smallest number of observations that are allowed in a terminal node. If a split decision breaks up the data into a node with less than the minbucket, it won’t accept it.
48# - The minsplit parameter is the smallest number of observations in the parent node that could be split further. The default is 20. If you have less than 20 records in a parent node, it is labeled as a terminal node.
49# - Finally, the maxdepth parameter prevents the tree from growing past a certain depth / height.
50
51
52?rpart
53
Full view online: https://datalore.azat.ai/view/notebook/yxzbLNGw6K3DXI20a69vqL