FE581 – 0227

Intro

There aremany different methods for prediction and classification.

The usefulness of a methodcan depend on factors such as the size of the dataset, the types of patterns thatexist in the data, whether the data meet some underlying assumptions of themethod, how noisy the data are, and the particular goal of the analysis.

A smallillustration is shown:

2023-02-28 at 13.32.25

where the goal is to find a combination ofhousehold income level and household lot size that separates buyers (solid circles) from nonbuyers (hollow circles) of riding mowers. The first method (left panel) looksonly for horizontal and vertical lines to separate buyers from nonbuyers, whereasthe second method (right panel) looks for a single diagonal line.Different methods can lead to different results, and their performance can vary.

Terminology and Notation

Data mining process

data-mining-process.drawio

Classification Tree

Amongthe data-driven methods, trees are the most transparent and easy to interpret.Trees are based on separating records into subgroups by creating splits on predictors.These splits create logical rules that are transparent and easily understandable,for example, “IF Age < 55 AND Education > 12 THEN class = 1.”

Introduction

If one had to choose a classification technique that performs well across a widerange of situations without requiring much effort from the analyst while being readily understandable by the consumer of the analysis, a strong contender wouldbe the tree methodology developed by Breiman et al. (1984).

The program thatBreiman et al. created to implement these procedures was called CART (ClassificationAnd Regression Trees).

What is a classification tree? Figure 9.1 shows a tree for classifying bankcustomers who receive a loan offer as either acceptors or nonacceptors, basedon information such as their income, education level, and average credit cardexpenditure.

2023-02-28 at 14.22.23

One of the reasons that tree classifiers are very popular is thatthey provide easily understandable classification rules (at least if the trees are nottoo large).

Consider the tree in the example. The gray terminal nodes are markedwith 0 or 1 corresponding to a nonacceptor (0) or acceptor (1). The values abovethe white nodes give the splitting value on a predictor. The values below thenodes gives the number of records in the split. This tree can easily be translatedinto a set of rules for classifying a bank customer. For example, the bottom-leftrectangle node under the “Family” circle in this tree gives us the following rule:

(1)IF(Income106)AND(Education<1.5)AND(Family<2.5)THENClass=0(nonacceptor).

Classification Trees

Two key ideas underlie classification trees. The first is the idea of recursive partitioningof the space of the predictor variables. The second is the idea of pruningusing validation data. In the next few sections, we describe recursive partitioningand in subsequent sections explain the pruning methodology.

Recursive Partitioning

Let us denote the outcome variable by Y and the input (predictor) variables byX1;X2;X3;:::;Xp: In classification, the outcome variable will be a categoricalvariable. Recursive partitioning divides up the p-dimensional space of the Xpredictor variables into nonoverlapping multidimensional rectangles.

2023-02-28 at 14.33.32

This results in three (multidimensional) rectangularregions. This process is continued so that we get smaller and smaller rectangularregions. The idea is to divide the entire X-space up into rectangles suchthat each rectangle is as homogeneous or “pure” as possible. By pure, we meancontaining records that belong to just one class. (Of course, this is not alwayspossible, as there may be records that belong to different classes but have exactlythe same values for every one of the predictor variables.)Let us illustrate recursive partitioning with an example.

Example 1: Riding Mowers

We again use the riding-mower example presented in Chapter 3. A ridingmowermanufacturer would like to find a way of classifying families in a cityinto those likely to purchase a riding mower and those not likely to buy one.

Apilot random sample of 12 owners and 12 nonowners in the city is undertaken.

2023-02-28 at 14.42.11

2023-02-28 at 14.47.14

If we apply the classification tree procedure to these data, the procedure willchoose Income for the first split with a splitting value of 60. The (X1;X2) spaceis now divided into two rectangles, one with Income < 60 and the other with Income > 60.

Notice how the split has created two rectangles, each of which is much morehomogeneous than the rectangle before the split.

2023-02-28 at 14.47.58

The left rectangle contains points that are mostly nonowners (seven nonowners and one owner) and theright rectangle contains mostly owners (11 owners and five nonowners).

How was this particular split selected? The algorithm examined each predictorvariable (in this case, Income and Lot Size) and all possible split valuesfor each variable to find the best split. What are the possible split values fora variable? They are simply the values for each predictor. The possible splitpoints for Income are 33.0;43.2;47.4;:::;110.1 and those for Lot Size are 14.0,14.8,16.0,...23.6 These split points are ranked according to howmuch they reduce impurity (heterogeneity) in the resulting rectangle. A purerectangle is one that is composed of a single class (e.g., owners). The reductionin impurity is defined as overall impurity before the split minus the sum of theimpurities for the two rectangles that result from a split.

Categorical Predictors The previous description used numerical predictors;however, categorical predictors can also be used in the recursive partitioningcontext. To handle categorical predictors, the split choices for a categorical predictorare all ways in which the set of categories can be divided into two subsets.

2023-02-28 at 14.51.49

Measure of Impurity

There are a number of ways to measure impurity. The two most popular measuresare the:

2023-02-28 at 14.54.36

2023-02-28 at 14.55.31

2023-02-28 at 14.55.45

2023-02-28 at 14.56.08

Let us compute the impurity in the riding mower example before and afterthe first split (using Income with the value of 60).

The unsplit dataset contains 12 owners and 12 nonowners. This is a two-class case with an equal number ofrecords from each class. Both impurity measures are therefore at their maximumvalue: Gini = 0.5 and entropy = log2(2) = 1.

2023-02-28 at 14.57.18

After the split, the left rectanglecontains seven nonowners and one owner. The impurity measures for thisrectangle are:

image-20230228150839155

2023-02-28 at 15.09.15

Thus, the Gini impurity index decreased from 0.5 before the split to 0.359 afterthe split. Similarly, the entropy impurity measure decreased from 1 before thesplit to 0.779 after the split.

By comparing the reduction in impurity across all possible splits in all possiblepredictors, the next split is chosen.

The reason the method is called a classification tree algorithm is that each splitcan be depicted as a split of a node into two successor nodes. The first split isshown as a branching of the root node of a tree in Figure 9.7. The full-growntree is shown in Figure 9.8. (Note that in R the split values are integers).

2023-02-28 at 15.10.45

2023-02-28 at 15.10.58

Riding Mowers - R

Full view online: https://fe581.fe.azat.cc/RidingMowers

Riding Mowers - Python

Full view online: https://datalore.azat.ai/view/notebook/yxzbLNGw6K3DXI20a69vqL