General

Statistical Learning (Data mining) methods can be classified as supervised or unsupervised.

\(X\) -> predictors, independent variables, features, variables.
\(Y\) -> response, dependent variable.
\(X = (X_1,X_2,...,X_p)\) and \(Y\) -> \(Y = f(X) + \epsilon\), \(\epsilon\) is a random error term and has mean zero. Data mining refers to a set of approaches to estimate \(f\).

We estimate \(f\) to :
1. prediction. \(\hat Y = \hat f(X)\)
2. inference. ( learn the relationship )

The accuracy of \(\hat Y\) as a prediction for \(Y\) depends on:
1. reducible error
2. irreducible error

overfitting -> It is an undesirable situation because the fit obtained will not yield accurate estimates of the response on new observations that were not part of the original training data set.

Trade-Off between Prediction Accuracy and Model Interpretability

DT is good for interpreting.
Bagging, Boosting, RF has higher Accuracy, hard to interpret.

Supervised vs Unsupervised

  • supervised - for each observation of the predictor measurement(s) \(x_i, i=1,...n\) there is an associated response measurement \(y_i\) -> predict \(Y\) or learn relationship between \(X\) and \(Y\) .
  • unsupervised - for each observations \(x_i, i=1,...n\) we observe a vector of measurements \(x_i\) but no associated response \(y_i\) -> understand relationship between the \(X_i\)s or observations.

Regression vs Classification

regression -> quantitative response, numerical response
classification -> qualitative response, categorical response

Model Accuracy

Regression

\[ MSE = \frac{1}{n} \sum_{i=1}^n {(y_i - \hat f(x_i))^2} \] where \(\hat f(x_i) -> \hat y_i\) (the prediction).

we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data. there is no guarantee that the method with the lowest training MSE will also have the lowest test MSE.

When a given method yields a small training MSE but a large test MSE, we are said to be overfitting the data.

Cross-Validation (CV) can be used to estimate the test MSE on the training data.

Classification

Error rate: \[ error rate = \frac{1}{n} \sum_{i=1}^n {y_i \ne \hat y_i} \] Bayes Classifier: \[ Pr(Y=k|X=x) \\ Pr(Y=1|X=x) > 0.5 \ \therefore k=1 \] KNN \[ Pr(Y=j|X=x_0) = \frac{1}{K} \sum_{i\in N_0}{I(y_i=j)} \] \(N_0\) -> K points in the training data that are closet to \(x_0\)

The Bias-Variance Trande-Off

Expected test MSE for a given value \(x_0\) can be given as:

In order to minimize the expected test error, we need to select a statistical learning method that simultaneously achieves low variance and low bias.
- Variance -> Variance refers to the amount by which \(\hat f\) would change if we estimated it using a different training data set. (due to randomness on data) - Bias -> bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. (due to simplification)

model is too simple -> high bias (just use several \(X_i\))
model is too complex -> low bias (use too much \(X_i\), overfitting)

Classification

Response variable is qualitative.
simple:
- Naive Bayes
- KNN

computer intensive:
- DT
- Begging
- RF
- Boosting

Confusion Matrix

  • Accuracy: ratio of correct predictions over all actual obs. \(\frac{TP+TN}{A0+A1}\) = \(\frac{TP+TN}{P0+P1}\)
  • Sensitivity=Recall=True Positive Rate: # of correct ‘1’ s over all actual ’1’s. \(\frac{TP}{A1}\) = \(\frac{TP}{TP+FN}\).
  • Specificity : # of correct ‘0’ s over all actual ‘0’ s. \(\frac{TN}{A0}\) = \(\frac{TN}{TN+FP}\)
  • Precision: # of correct ‘1’ s over all predicted ‘1’ s. \(\frac{TP}{P1}\) = \(\frac{TP}{TP+FP}\)
  • F-score: \(\frac{2\times precision \ \times\ recall }{precision + recall}\)

Naive Bayes Classifier

\(X: X_1,X_2,...X_p\)
Predicted label is \(Y\)

Bayes theory: \[ P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)} \]

Naive Bayes: (ignore the dependence)
\[ P(X_1,X_2,...X_p|Y) = P(X_1|Y)P(X_2|Y)...P(X_P|Y) \\ \] \[ P(Y=1|X_1,X_2,..X_p) = \frac{P(X_1,X_2,...X_p|Y=1) P(Y=1)}{P(X_1)P(X_2)...P(X_p)} \\ P(Y=0|X_1,X_2,..X_p) = \frac{P(X_1,X_2,...X_p|Y=0) P(Y=0)}{P(X_1)P(X_2)...P(X_p)} \] and \(P(Y=1|X)P(X) + P(Y=0|X)P(X) = 1\)

if the Variable is numerical, we use PDF.

\[ PDF= f(x|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \] \[ Pr(X|Y=1)= \frac{1}{\sqrt{2\pi\sigma_{y=1}^2}}e^{-\frac{(X-\mu_{y=1})^2}{2\sigma_{y=1}^2}} \\ Pr(X|Y=0)= \frac{1}{\sqrt{2\pi\sigma_{y=0}^2}}e^{-\frac{(X-\mu_{y=0})^2}{2\sigma_{y=0}^2}} \\ \]

KNN

KNN dose not require training. All of the methods that needs to calculate the distance, requires normalization.

Estimates conditional distribution of y given x and then classifies a given observation to the class with highest probability.

Step 1: Pick integer K.
Step 2: for \(x_0\) (test observation), identify K points in the data that are closet to the \(x_0\) (\(N_0\)).
Step3: Estimate the conditional probability for each class j, as the fraction of points, in \(N_0\), whose response values equal to j. \[ Pr(y=j|X=x_0) = \frac{1}{K} \sum_{i\in N_0} (I(y_i=j)) \] Step4: Apply bayes rule, classifiy to the class with larger probability.

Resampling Methods

Resampling methods: Draw samples from a training set, to fit the same statistical method multiple times using different subset of the training data.

Resampling Methods:
1. Validation Set Approach
2. Cross Validation
3. Bootstrap
- Evaluating models performance -> model assessment
- Selecting the proper level of complexity -> model selection

Validation Set Approach

We explore the use of the validation set approach in order to estimate the test error rates that result from fitting various linear models on the Auto data set.

  • test error rate can be highly variable
  • only a subset of the observations are used to fit the model

Leave-One-Out Cross-Validation (LOOCV)

LOOCV has a couple of major advantages:

  • First, it has far less bias.
  • Second,there is no randomness in the training/validation set splits.

drawbacks:

  • expensive to implement
  • very time consuming
  • uses too much computational power

LOOCV is a very general method, and can be used with any kind of predictive modeling.

K-Fold Cross-Validation

k-fold CV. This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size.

LOOCV is a special case of k-fold CV in which k is set to equal n. 

advantages:

  • The most obvious advantage is computational.

there is some variability in the CV, But this variability is typically much lower than the variability in the test error estimates that results from the validation set approach

CV can be used for model selection : find the location of the minimum point in the estimated test MSE curve.

\[ CV_{k} = \frac{1}{k} \sum_{i=1}^k {MSE_k} \] \[ CV_{k} = \frac{1}{k} \sum_{i=1}^k {Err_k} \] where \(Err_k = I(Y_i \ne \hat Y)\).

Bootstrap

Randomly select, with replacement, n obs from training dataset. invest in \(X,Y\) (there is some variability). We want: \[ minimize \ Var(\alpha X+(1-\alpha)Y) \\ minimize \ \hat \alpha = \frac{\hat \sigma_Y^2 - \hat \sigma_{XY}}{\hat \sigma_X^2+\hat \sigma_Y^2-2\hat \sigma_{XY}} \] where: \(\hat \sigma_Y^2 = Var(Y), \hat \sigma_X^2 = Var(X),\hat \sigma_{XY} = Cov(X,Y)\) we estimate \(\alpha\) using historical data (train data).

the bootstrap approach allows us to use a computer to emulate the process of obtaining new sample sets, so that we can estimate the variability of ˆα without generating additional samples.

Rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets by repeatedly sampling observations from the original data set.

Tree Based Methods

Decision Tree: Classification Tree + Regression Tree
Logic: Segment the predictor space into a number of Simple Rectangle Regions.

Regression Tree

The goal is to find rectangles (boxes) \(R_1,R_2,...R_J\) that minimizes the RSS given by:

\[ RSS = \sum_{j=1}^J \sum_{i \in R_J} (y_i - \hat y_{Rj})^2 \] \(R_1,R_2,R_3...\) - terminal nodes, leaf nodes.
\(\hat y_{Rj}\) mean response for the training observations within \(j\)th Rectangle (terminal node).

Recursive Binary Splitting (Greedy):
Pick \(X_j\) and the cut pint \(s\), so that it will devide the area to : \(R_1 = \{X|X_j <s\}\) and \(R_2 = \{X|X_j \ge s \}\) . \[ RSS_{R_1} = \sum_{i \in R_1} (y_i - \hat y_{i})^2 \\ RSS_{R_2} = \sum_{i \in R_2} (y_i - \hat y_{i})^2 \\ \text{total RSS from this split}\ RSS = RSS_{R_1} + RSS_{R_2} \]

and we seek the value of \(X_j\) and \(s\) that minimize the total RSS.

We use RSS a method to pick \(X_j\) when split.

Classification Tree

Instead of RSS, in classification tree we use impurety measure to pick the \(X_j\) that minimizes the impurity.

impurity measures are: \[ \text{error rate} \ E=1-\underset{k}{max} (\hat p_k) \\ \text{Gini index} \ G=1-\sum_{k=1}^mp_k^2 \\ \text{Entropy} \ E=-\sum_{k=1}^m p_k log_2(p_k)\\ \text{Deviance} \ D=-2\sum_{k=1}^m n_k ln(p_k)\\ \] \(p_k\) -> propertion of the training obs belong to class m.
\(n_k\) -> number of training obs belong to class m.

When the values are smaller, much better (much pure)

Tree Pruning

The trees might overfit, this is because the resulting tree might be too complex. Smaller tree with fewer splits ( fewer regions) might led to:
- lower variance
- better interpretation ( at the cost of a little bit bias)

Build a Large Tree \(T_0\) -> Prune -> Sub Tree \(T\).

How do we determind the best way to prune the tree? Intuitivly our goal is to select a sub tree that leads to the lowest test error rate (with cv let’s say).

A sequence of Trees indexed by a non negative tuning parameter \(\alpha\), for each value of \(\alpha\) corresponds a sub tree \(T \in T_o\) such that: \[ \sum_{m=1}^{|T|} \sum_{i:i\in R_m} (y_i-\hat y_{R_m})2 + \alpha |T| \]

\(|T|\) # of terminal nodes of the tree T. \(R_m\) is the rectangle corresponding to the \(m\)th terminal node. \(\hat y_{R_m}\) is the predicted response associated with R_m.

when \(\alpha\) increases there is a price to pay for having a tree with many terminal nodes.

we can select a value of \(\alpha\) using a validation set or using CV approach then return to the full data set and obtain the subtree with corresponding \(\alpha\) or tree size.

Ansemble methods : Begging, RF, Boosting

An enseble method is an approach that combines many simple “building block” models in order to obtain a single and potentially very powerful model.

for Beg,RF,Boost those building block is the “regression or classification tree”.

Begging

Bootstrap Aggregation or Begging is a general purpose procedure for reducing the variance of a statistical learning method.

Averaging a set of observation, reduces variance.

Begging: We use Bootstrap, take repeated samples from the single training data set so that we generate B different Bootstrapped training dataset, and we do aggregation (build a separate prediction model using each training set and average the resulting predictions) \[ \hat f_{bag(x)} = \frac{1}{B} \sum_{b=1}^B \hat f_b(x) \] begging can improve predictions.

  1. Split Data: \(D\) -> \(D_T\), \(D_V\), \(len(D_T)=B\)
  2. Bootstrap: \(DT\) -> \(B_1,B_2,B_3,...B_B\)
  3. Train: \(B_1 -> \hat f_1\),…,\(B_B -> \hat f_B\)
  4. Predict(CV/OOB): \(\hat f_1 -> \hat y_1\),…,\(\hat f_B -> \hat y_B\)
  5. Aggregate: \(\hat y = \frac{1}{B}\sum_b^B\hat y_b\) or \(\hat y =\text{Majority Vote}\)

Begging = Bootstrapping + Aggregation

ADV: - averaging reduces variance. - test error rate is lower than test error rate of single tree

OOB-Error

It turns out that there is a very strightforward way to estimate the test error of a begged model, without the need to perform CV.

Each bagged tree \(B_b\) makes use of around 2/3 of the observations. Remaining 1/3 of the observations not used to fit a given bagged tree are referred to as the out-of-bag observations.

So we can use OOB observations for testing, when number of B is sufficiently large, OOB error is visually equivelent to LOOCV error.

OOB error testing is better because is uses observations that model has not seen to test the model.

Variable Importance

Bagging: - ADV: Improve accuracy than single tree - DAV: Difficult to interpret.

VarIMP: - On regression: For each bag, we record the amount of RSS decrease due to split over a given predictor, then we take average on all B trees. - On classification: For each bag, we record the amount of Impurity decrease due to split over a given predictor, then we take average on all B trees.

Random Forest

Random Forest provides an improvment over bagged tree by a way of a small teak – that decorrelates the tree.

In RF, a fresh sample of m predictors is taken randomly on each split. typically \(m=\sqrt{p}\). RF eliminates the affects of very strong predictors by randomly picking inoput predictors.

If there is strong predictors, then almost in all B trees , will use this predictor to split the tree, resulting tees are very much correlated.

averaging many highly correlated quantities dose not lead to as large of reduction in variance.

RF: adds a step tp decorrelates the tree. RF: reduction in both test error and OOB-error over bagging. Bagging,RF: will not overfit if we increase number of B.

  1. Split Data: \(D\) -> \(D_T\), \(D_V\), \(len(D_T)=B\)
  2. Bootstrap: \(DT\) -> \(B_1,B_2,B_3,...B_B\)
  3. Decorrelate X: rnd from \(\{X_1,X_2,...X_p\}\) pick \(\{X_1,X_2,...X_m\}\).
  4. Train: \(B_1 -> \hat f_1\),…,\(B_B -> \hat f_B\)
  5. Predict(CV/OOB): \(\hat f_1 -> \hat y_1\),…,\(\hat f_B -> \hat y_B\)
  6. Aggregate: \(\hat y = \frac{1}{B}\sum_b^B\hat y_b\) or \(\hat y =\text{Majority Vote}\)

Boosting

Yet another approach for improving the predictions resuylting from a decision tree.

bagging involves creating multiple copies of the original training data set using the bootstrap, fitting a separate decision tree to each copy, and then combining all of the trees in order to create a single predictive model.

trees are grown sequentially : each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.

Boosting: - learns slow, with stumps (weak learner) - fit a tree using the current residuals rather than the outcome Y, we then add this new decision tree into fitted function in order to update the residuals.
\[ \hat f(x) = \hat f(x) + \lambda \hat f_b(x) \\ r_i <- r_i - \lambda \hat f_b (x_i) \\ \hat f(x)=\sum_{b-1}^B(\lambda \hat f_b(x)) \] 1. Unlike bagging and random forest, Boosting can overfit if B is too large. 2. Shrinkage controlls the learning rate (lamda) 3. Often we split the each tree to stumps ( each involves onlya single variable)

Using stumps leads to additive model.

---
title: "MT-Review Generic"
output: html_notebook
date: 2023-05-08
author: Azat
---

## General 

Statistical Learning (Data mining) methods can be classified as `supervised` or `unsupervised.`     

$X$ -> predictors, independent variables, features, variables.  
$Y$ -> response, dependent variable.    
$X = (X_1,X_2,...,X_p)$ and $Y$ -> $Y = f(X) + \epsilon$,
$\epsilon$ is a random error term and has mean zero. Data mining refers to a set of approaches to estimate $f$.    

We estimate $f$ to :         
1. prediction. $\hat Y = \hat f(X)$   
2. inference. ( learn the relationship )    

The accuracy of $\hat Y$ as a prediction for $Y$ depends on:    
1. reducible error    
2. irreducible error    
![](assets/error-red-irr.png)     

`overfitting` -> It is an undesirable situation because
the fit obtained will not yield accurate estimates of the response on new observations that were not part of the original training data set.

### Trade-Off between Prediction Accuracy and Model Interpretability 

`DT` is good for interpreting.    
`Bagging`, `Boosting`, `RF` has higher Accuracy, hard to interpret.

### Supervised vs Unsupervised
- `supervised` - for each observation of the predictor measurement(s) $x_i, i=1,...n$ there is an associated response measurement $y_i$ -> predict $Y$ or learn relationship between $X$ and $Y$ .
- `unsupervised` - for each observations $x_i, i=1,...n$ we observe a vector of measurements $x_i$ but **no** associated response $y_i$ -> understand relationship between the $X_i$s or observations.     

### Regression vs Classification    
`regression` -> quantitative response, numerical response   
`classification` -> qualitative response, categorical response  

### Model Accuracy

#### Regression

$$
MSE = \frac{1}{n} \sum_{i=1}^n {(y_i - \hat f(x_i))^2}
$$
where $\hat f(x_i) -> \hat y_i$ (the prediction).

we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data. there is no guarantee that the method with the lowest training MSE will also have the lowest test MSE.   

When a given method yields a small training MSE but a large test MSE, we are said to be `overfitting the data`.    

Cross-Validation (CV) can be used to **estimate** the test MSE on the training data.

#### Classification

Error rate:
$$
error rate = \frac{1}{n} \sum_{i=1}^n {y_i \ne \hat y_i}
$$
Bayes Classifier: 
$$
Pr(Y=k|X=x) \\
Pr(Y=1|X=x) > 0.5 \ \therefore k=1
$$
KNN 
$$
Pr(Y=j|X=x_0) = \frac{1}{K} \sum_{i\in N_0}{I(y_i=j)}
$$
$N_0$ -> K points in the training data that are closet to $x_0$ 

### The Bias-Variance Trande-Off 

Expected test MSE for a given value $x_0$ can be given as:
![](assets/exp-err.png)   

In order to minimize the expected test error,
we need to select a statistical learning method that simultaneously achieves `low variance` and `low bias`.    
- `Variance` -> Variance refers to the amount by which $\hat f$ would change if we estimated it using a different training data set. (due to randomness on data) 
- `Bias` -> bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. (due to simplification) 

model is too simple -> high bias (just use several $X_i$)   
model is too complex -> low bias (use too much $X_i$, overfitting)    

## Classification 

Response variable is qualitative.   
simple:    
- Naive Bayes   
- KNN

computer intensive:   
  - DT    
  - Begging   
  - RF    
  - Boosting    

### Confusion Matrix    
![](assets/cm.png)

- Accuracy: ratio of correct predictions over all actual obs. $\frac{TP+TN}{A0+A1}$ = $\frac{TP+TN}{P0+P1}$ 
- Sensitivity=Recall=True Positive Rate: # of correct '1' s over all actual '1's. $\frac{TP}{A1}$ = $\frac{TP}{TP+FN}$.   
- Specificity : # of correct '0' s over all actual '0' s. $\frac{TN}{A0}$ = $\frac{TN}{TN+FP}$  
- Precision: # of correct '1' s over all predicted '1' s. $\frac{TP}{P1}$ = $\frac{TP}{TP+FP}$ 
- F-score: $\frac{2\times precision \ \times\  recall }{precision + recall}$  

### Naive Bayes Classifier 
$X: X_1,X_2,...X_p$  
Predicted label is $Y$  

Bayes theory: 
$$
P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}
$$

`Naive` Bayes: (ignore the dependence)    
$$
P(X_1,X_2,...X_p|Y) = P(X_1|Y)P(X_2|Y)...P(X_P|Y) \\
$$
$$
P(Y=1|X_1,X_2,..X_p) = \frac{P(X_1,X_2,...X_p|Y=1) P(Y=1)}{P(X_1)P(X_2)...P(X_p)} \\
P(Y=0|X_1,X_2,..X_p) = \frac{P(X_1,X_2,...X_p|Y=0) P(Y=0)}{P(X_1)P(X_2)...P(X_p)} 
$$
and $P(Y=1|X)P(X) + P(Y=0|X)P(X) = 1$ 

if the Variable is numerical, we use PDF. 

$$
PDF= f(x|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
$$
$$
Pr(X|Y=1)= \frac{1}{\sqrt{2\pi\sigma_{y=1}^2}}e^{-\frac{(X-\mu_{y=1})^2}{2\sigma_{y=1}^2}} \\
Pr(X|Y=0)= \frac{1}{\sqrt{2\pi\sigma_{y=0}^2}}e^{-\frac{(X-\mu_{y=0})^2}{2\sigma_{y=0}^2}} \\
$$

### KNN

KNN dose not require training. 
All of the methods that needs to calculate the distance, requires normalization.

Estimates conditional distribution of y given x and then classifies a given observation to the class with highest probability.

Step 1: Pick integer K.  
Step 2: for $x_0$ (test observation), identify K points in the data that are closet to the $x_0$ ($N_0$).    
Step3: Estimate the conditional probability for each class j, as the fraction of points, in $N_0$, whose response values equal to j.
$$
Pr(y=j|X=x_0) = \frac{1}{K} \sum_{i\in N_0} (I(y_i=j))
$$
Step4: Apply bayes rule, classifiy to the class with larger probability.

![](assets/knn.png)

## Resampling Methods   

Resampling methods: Draw samples from a training set, to fit the same statistical method multiple times using different subset of the training data.

Resampling Methods:   
1. Validation Set Approach    
2. Cross Validation         
3. Bootstrap           
- Evaluating models performance -> model assessment    
- Selecting the proper level of complexity -> model selection 

### Validation Set Approach

We explore the use of the validation set approach in order to estimate the
test error rates that result from fitting various linear models on the Auto
data set.

![](assets/val-set.png)         

- test error rate can be highly variable
- only a subset of the observations are used to fit the model

### Leave-One-Out Cross-Validation (LOOCV)    

LOOCV has a couple of major **advantages**:

 - First, it has far less bias.   
 - Second,there is no randomness in the training/validation
set splits.     

**drawbacks**:

- expensive to implement
- very time consuming
- uses too much computational power

LOOCV is a very general method, and can be used with any kind of
predictive modeling.

![](assets/loocv.png)  

### K-Fold Cross-Validation   

k-fold CV. This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. 

LOOCV is a special case of k-fold CV in which k is set to equal n. 

**advantages:** 

- The most obvious advantage is computational.


there is some variability in the CV, But this variability is typically
much lower than the variability in the test error estimates that results from
the validation set approach 


CV can be used for model selection : find the location of the minimum point in the estimated test MSE curve.


$$
CV_{k} = \frac{1}{k} \sum_{i=1}^k {MSE_k}
$$
$$
CV_{k} = \frac{1}{k} \sum_{i=1}^k {Err_k} 
$$
where $Err_k = I(Y_i \ne \hat Y)$.    

### Bootstrap   

Randomly select, with replacement, n obs from training dataset. 
invest in $X,Y$ (there is some variability). We want: 
$$
minimize \ Var(\alpha X+(1-\alpha)Y) \\
minimize \ \hat \alpha = \frac{\hat \sigma_Y^2 - \hat \sigma_{XY}}{\hat \sigma_X^2+\hat \sigma_Y^2-2\hat \sigma_{XY}}
$$
where: $\hat \sigma_Y^2 = Var(Y), \hat \sigma_X^2 = Var(X),\hat \sigma_{XY} = Cov(X,Y)$ 
we estimate $\alpha$ using historical data (train data). 
![](assets/bootstrap.png)

the bootstrap approach allows us
to use a computer to emulate the process of obtaining new sample sets,
so that we can estimate the variability of ˆα without generating additional
samples.

Rather than repeatedly obtaining independent data sets from the
population, we instead obtain distinct data sets by repeatedly sampling
observations from the original data set.

## Tree Based Methods 

Decision Tree: Classification Tree + Regression Tree    
Logic: Segment the predictor space into a number of Simple Rectangle Regions.

- Decision Tree Based methods are simple and useful for interpretation. 
- Begging, Random Forest, Boosting produces multiple tree then combine them to yield a single consensus prediction.    

### Regression Tree 

The goal is to find rectangles (boxes) $R_1,R_2,...R_J$ that minimizes the RSS given by:

$$
RSS = \sum_{j=1}^J \sum_{i \in R_J} (y_i - \hat y_{Rj})^2
$$
$R_1,R_2,R_3...$ - terminal nodes, leaf nodes.    
$\hat y_{Rj}$ mean response for the training observations within $j$th Rectangle (terminal node).   

Recursive Binary Splitting (Greedy):    
Pick $X_j$ and the cut pint $s$, so that it will devide the area to :   $R_1 = \{X|X_j <s\}$  and $R_2 = \{X|X_j \ge s \}$ .
$$
RSS_{R_1} = \sum_{i \in R_1} (y_i - \hat y_{i})^2 \\
RSS_{R_2} = \sum_{i \in R_2} (y_i - \hat y_{i})^2 \\
\text{total RSS from this split}\ RSS = RSS_{R_1} + RSS_{R_2}
$$

and we seek the value of $X_j$ and $s$ that `minimize` the total RSS.    

We use RSS a method to pick $X_j$ when split.

### Classification Tree 

Instead of RSS, in classification tree we use impurety measure to pick the $X_j$ that minimizes the impurity.

impurity measures are: 
$$
\text{error rate} \ E=1-\underset{k}{max} (\hat p_k) \\
\text{Gini index} \ G=1-\sum_{k=1}^mp_k^2 \\
\text{Entropy} \ E=-\sum_{k=1}^m p_k log_2(p_k)\\
\text{Deviance} \ D=-2\sum_{k=1}^m n_k ln(p_k)\\
$$
$p_k$ -> propertion of the training obs belong to class m.    
$n_k$ -> number of training obs belong to class m.

When the values are smaller, much better (much pure)

### Tree Pruning    

The trees might overfit, this is because the resulting tree might be too complex. 
Smaller tree with fewer splits ( fewer regions) might led to:    
- lower variance  
- better interpretation ( at the cost of a little bit bias)

Build a Large Tree $T_0$ -> Prune -> Sub Tree $T$. 

How do we determind the best way to prune the tree? Intuitivly our goal is to select a sub tree that leads to the lowest test error rate (with cv let's say). 

A sequence of Trees indexed by a non negative tuning parameter $\alpha$, for each value of $\alpha$ corresponds a sub tree $T \in T_o$  such that: 
$$
\sum_{m=1}^{|T|} \sum_{i:i\in R_m} (y_i-\hat y_{R_m})2 + \alpha |T|
$$

$|T|$ # of terminal nodes of the tree T.
$R_m$ is the rectangle corresponding to the $m$th terminal node.
$\hat y_{R_m}$ is the predicted response associated with R_m.

when $\alpha$ increases there is a price to pay for having a tree with many terminal nodes.

we can select a value of $\alpha$ using a validation set or using CV approach then return to the full data set and obtain the subtree with corresponding $\alpha$ or tree size.

### Ansemble methods : Begging, RF, Boosting 
An enseble method is an approach that combines many simple "building block" models in order to obtain a single and potentially very powerful model.

for Beg,RF,Boost those building block is the "regression or classification tree". 

### Begging 

Bootstrap Aggregation or Begging is a general purpose procedure for reducing the variance of a statistical learning method.

Averaging a set of observation, reduces variance.

Begging: We use Bootstrap, take repeated samples from the single training data set so that we generate B different Bootstrapped training dataset, and we do aggregation (build a separate prediction model using each training set and average the resulting predictions)
$$
\hat f_{bag(x)} = \frac{1}{B} \sum_{b=1}^B \hat f_b(x) 
$$
begging can improve predictions.

1. Split Data: $D$ -> $D_T$, $D_V$, $len(D_T)=B$
2. Bootstrap: $DT$ -> $B_1,B_2,B_3,...B_B$
3. Train: $B_1 -> \hat f_1$,...,$B_B -> \hat f_B$
4. Predict(CV/OOB): $\hat f_1 -> \hat y_1$,...,$\hat f_B -> \hat y_B$
5. Aggregate: $\hat y = \frac{1}{B}\sum_b^B\hat y_b$ or $\hat y =\text{Majority Vote}$

Begging = Bootstrapping + Aggregation 

ADV: 
- averaging reduces variance. 
- test error rate is lower than test error rate of single tree

#### OOB-Error
It turns out that there is a very strightforward way to estimate the test error of a begged model, without the need to perform CV.

Each bagged tree $B_b$ makes use of around 2/3 of the observations. Remaining 1/3 of the observations not used to fit a given bagged tree are referred to as the out-of-bag observations.

So we can use OOB observations for testing, when number of B is sufficiently large, OOB error is visually equivelent to LOOCV error.

OOB error testing is better because is uses observations that model has not seen to test the model.

#### Variable Importance 

Bagging: 
- ADV: Improve accuracy than single tree
- DAV: Difficult to interpret. 

VarIMP:
- On regression: For each bag, we record the amount of RSS decrease due to split over a given predictor, then we take average on all B trees.
- On classification: For each bag, we record the amount of Impurity decrease due to split over a given predictor, then we take average on all B trees.

### Random Forest

Random Forest provides an improvment over bagged tree by a way of a small teak -- that decorrelates the tree.

In RF, a fresh sample of m predictors is taken randomly on each split. typically $m=\sqrt{p}$.
RF eliminates the affects of very strong predictors by randomly picking inoput predictors.   

If there is strong predictors, then almost in all B trees , will use this predictor to split the tree, resulting tees are very much correlated.

averaging many highly correlated quantities dose not lead to as large of reduction in variance.

RF: adds a step tp decorrelates the tree.
RF: reduction in both test error and OOB-error over bagging. 
Bagging,RF: will not overfit if we increase number of B.

1. Split Data: $D$ -> $D_T$, $D_V$, $len(D_T)=B$
2. Bootstrap: $DT$ -> $B_1,B_2,B_3,...B_B$
3. Decorrelate X: rnd from $\{X_1,X_2,...X_p\}$ pick $\{X_1,X_2,...X_m\}$.
3. Train: $B_1 -> \hat f_1$,...,$B_B -> \hat f_B$
4. Predict(CV/OOB): $\hat f_1 -> \hat y_1$,...,$\hat f_B -> \hat y_B$
5. Aggregate: $\hat y = \frac{1}{B}\sum_b^B\hat y_b$ or $\hat y =\text{Majority Vote}$

### Boosting 

Yet another approach for improving the predictions resuylting from a decision tree.

bagging involves creating multiple copies of the original training
data set using the bootstrap, fitting a separate decision tree to each
copy, and then combining all of the trees in order to create a single predictive
model.

trees are
grown sequentially : each tree is grown using information from previously
grown trees. Boosting does not involve bootstrap sampling; instead each
tree is fit on a modified version of the original data set.

Boosting:
- learns slow, with stumps (weak learner)
- fit a tree using the current residuals rather than the outcome Y, we then add this new decision tree into fitted function in order to update the residuals.    
$$
\hat f(x) = \hat f(x) + \lambda \hat f_b(x) \\
r_i <- r_i - \lambda \hat f_b (x_i) \\
\hat f(x)=\sum_{b-1}^B(\lambda \hat f_b(x))
$$
1. Unlike bagging and random forest, Boosting can overfit if B is too large.
2. Shrinkage controlls the learning rate (lamda)
3. Often we split the each tree to stumps ( each involves onlya single variable)

Using stumps leads to additive model.


