General
Statistical Learning (Data mining) methods can be classified as
supervised
or unsupervised.
\(X\) -> predictors, independent
variables, features, variables.
\(Y\) -> response, dependent
variable.
\(X = (X_1,X_2,...,X_p)\) and \(Y\) -> \(Y =
f(X) + \epsilon\), \(\epsilon\)
is a random error term and has mean zero. Data mining refers to a set of
approaches to estimate \(f\).
We estimate \(f\) to :
1. prediction. \(\hat Y = \hat
f(X)\)
2. inference. ( learn the relationship )
The accuracy of \(\hat Y\) as a
prediction for \(Y\) depends on:
1. reducible error
2. irreducible error
overfitting
-> It is an undesirable situation because
the fit obtained will not yield accurate estimates of the response on
new observations that were not part of the original training data
set.
Trade-Off between Prediction Accuracy and Model
Interpretability
DT
is good for interpreting.
Bagging
, Boosting
, RF
has higher
Accuracy, hard to interpret.
Supervised vs Unsupervised
supervised
- for each observation of the predictor
measurement(s) \(x_i, i=1,...n\) there
is an associated response measurement \(y_i\) -> predict \(Y\) or learn relationship between \(X\) and \(Y\) .
unsupervised
- for each observations \(x_i, i=1,...n\) we observe a vector of
measurements \(x_i\) but
no associated response \(y_i\) -> understand relationship between
the \(X_i\)s or observations.
Regression vs Classification
regression
-> quantitative response, numerical
response
classification
-> qualitative response, categorical
response
Model Accuracy
Regression
\[
MSE = \frac{1}{n} \sum_{i=1}^n {(y_i - \hat f(x_i))^2}
\] where \(\hat f(x_i) -> \hat
y_i\) (the prediction).
we are interested in the accuracy of the predictions that we obtain
when we apply our method to previously unseen test data. there is no
guarantee that the method with the lowest training MSE will also have
the lowest test MSE.
When a given method yields a small training MSE but a large test MSE,
we are said to be overfitting the data
.
Cross-Validation (CV) can be used to estimate the
test MSE on the training data.
Classification
Error rate: \[
error rate = \frac{1}{n} \sum_{i=1}^n {y_i \ne \hat y_i}
\] Bayes Classifier: \[
Pr(Y=k|X=x) \\
Pr(Y=1|X=x) > 0.5 \ \therefore k=1
\] KNN \[
Pr(Y=j|X=x_0) = \frac{1}{K} \sum_{i\in N_0}{I(y_i=j)}
\] \(N_0\) -> K points in the
training data that are closet to \(x_0\)
The Bias-Variance Trande-Off
Expected test MSE for a given value \(x_0\) can be given as:
In order to minimize the expected test error, we need to select a
statistical learning method that simultaneously achieves
low variance
and low bias
.
- Variance
-> Variance refers to the amount by which
\(\hat f\) would change if we estimated
it using a different training data set. (due to randomness on data) -
Bias
-> bias refers to the error that is introduced by
approximating a real-life problem, which may be extremely complicated,
by a much simpler model. (due to simplification)
model is too simple -> high bias (just use several \(X_i\))
model is too complex -> low bias (use too much \(X_i\), overfitting)
Classification
Response variable is qualitative.
simple:
- Naive Bayes
- KNN
computer intensive:
- DT
- Begging
- RF
- Boosting
Confusion Matrix
- Accuracy: ratio of correct predictions over all actual obs. \(\frac{TP+TN}{A0+A1}\) = \(\frac{TP+TN}{P0+P1}\)
- Sensitivity=Recall=True Positive Rate: # of correct ‘1’ s over all
actual ’1’s. \(\frac{TP}{A1}\) = \(\frac{TP}{TP+FN}\).
- Specificity : # of correct ‘0’ s over all actual ‘0’ s. \(\frac{TN}{A0}\) = \(\frac{TN}{TN+FP}\)
- Precision: # of correct ‘1’ s over all predicted ‘1’ s. \(\frac{TP}{P1}\) = \(\frac{TP}{TP+FP}\)
- F-score: \(\frac{2\times precision \
\times\ recall }{precision + recall}\)
Naive Bayes Classifier
\(X: X_1,X_2,...X_p\)
Predicted label is \(Y\)
Bayes theory: \[
P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}
\]
Naive
Bayes: (ignore the dependence)
\[
P(X_1,X_2,...X_p|Y) = P(X_1|Y)P(X_2|Y)...P(X_P|Y) \\
\] \[
P(Y=1|X_1,X_2,..X_p) = \frac{P(X_1,X_2,...X_p|Y=1)
P(Y=1)}{P(X_1)P(X_2)...P(X_p)} \\
P(Y=0|X_1,X_2,..X_p) = \frac{P(X_1,X_2,...X_p|Y=0)
P(Y=0)}{P(X_1)P(X_2)...P(X_p)}
\] and \(P(Y=1|X)P(X) + P(Y=0|X)P(X) =
1\)
if the Variable is numerical, we use PDF.
\[
PDF= f(x|\mu,\sigma^2) =
\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
\] \[
Pr(X|Y=1)=
\frac{1}{\sqrt{2\pi\sigma_{y=1}^2}}e^{-\frac{(X-\mu_{y=1})^2}{2\sigma_{y=1}^2}}
\\
Pr(X|Y=0)=
\frac{1}{\sqrt{2\pi\sigma_{y=0}^2}}e^{-\frac{(X-\mu_{y=0})^2}{2\sigma_{y=0}^2}}
\\
\]
KNN
KNN dose not require training. All of the methods that needs to
calculate the distance, requires normalization.
Estimates conditional distribution of y given x and then classifies a
given observation to the class with highest probability.
Step 1: Pick integer K.
Step 2: for \(x_0\) (test observation),
identify K points in the data that are closet to the \(x_0\) (\(N_0\)).
Step3: Estimate the conditional probability for each class j, as the
fraction of points, in \(N_0\), whose
response values equal to j. \[
Pr(y=j|X=x_0) = \frac{1}{K} \sum_{i\in N_0} (I(y_i=j))
\] Step4: Apply bayes rule, classifiy to the class with larger
probability.
Resampling Methods
Resampling methods: Draw samples from a training set, to fit the same
statistical method multiple times using different subset of the training
data.
Resampling Methods:
1. Validation Set Approach
2. Cross Validation
3. Bootstrap
- Evaluating models performance -> model assessment
- Selecting the proper level of complexity -> model selection
Validation Set Approach
We explore the use of the validation set approach in order to
estimate the test error rates that result from fitting various linear
models on the Auto data set.
- test error rate can be highly variable
- only a subset of the observations are used to fit the model
Leave-One-Out Cross-Validation (LOOCV)
LOOCV has a couple of major advantages:
- First, it has far less bias.
- Second,there is no randomness in the training/validation set
splits.
drawbacks:
- expensive to implement
- very time consuming
- uses too much computational power
LOOCV is a very general method, and can be used with any kind of
predictive modeling.
K-Fold Cross-Validation
k-fold CV. This approach involves randomly dividing the set of
observations into k groups, or folds, of approximately equal size.
LOOCV is a special case of k-fold CV in which k is set to equal
n.
advantages:
- The most obvious advantage is computational.
there is some variability in the CV, But this variability is
typically much lower than the variability in the test error estimates
that results from the validation set approach
CV can be used for model selection : find the location of the minimum
point in the estimated test MSE curve.
\[
CV_{k} = \frac{1}{k} \sum_{i=1}^k {MSE_k}
\] \[
CV_{k} = \frac{1}{k} \sum_{i=1}^k {Err_k}
\] where \(Err_k = I(Y_i \ne \hat
Y)\).
Bootstrap
Randomly select, with replacement, n obs from training dataset.
invest in \(X,Y\) (there is some
variability). We want: \[
minimize \ Var(\alpha X+(1-\alpha)Y) \\
minimize \ \hat \alpha = \frac{\hat \sigma_Y^2 - \hat \sigma_{XY}}{\hat
\sigma_X^2+\hat \sigma_Y^2-2\hat \sigma_{XY}}
\] where: \(\hat \sigma_Y^2 = Var(Y),
\hat \sigma_X^2 = Var(X),\hat \sigma_{XY} = Cov(X,Y)\) we
estimate \(\alpha\) using historical
data (train data).
the bootstrap approach allows us to use a computer to emulate the
process of obtaining new sample sets, so that we can estimate the
variability of ˆα without generating additional samples.
Rather than repeatedly obtaining independent data sets from the
population, we instead obtain distinct data sets by repeatedly sampling
observations from the original data set.
Tree Based Methods
Decision Tree: Classification Tree + Regression Tree
Logic: Segment the predictor space into a number of Simple Rectangle
Regions.
- Decision Tree Based methods are simple and useful for
interpretation.
- Begging, Random Forest, Boosting produces multiple tree then combine
them to yield a single consensus prediction.
Regression Tree
The goal is to find rectangles (boxes) \(R_1,R_2,...R_J\) that minimizes the RSS
given by:
\[
RSS = \sum_{j=1}^J \sum_{i \in R_J} (y_i - \hat y_{Rj})^2
\] \(R_1,R_2,R_3...\) - terminal
nodes, leaf nodes.
\(\hat y_{Rj}\) mean response for the
training observations within \(j\)th
Rectangle (terminal node).
Recursive Binary Splitting (Greedy):
Pick \(X_j\) and the cut pint \(s\), so that it will devide the area to :
\(R_1 = \{X|X_j <s\}\) and \(R_2 = \{X|X_j \ge s \}\) . \[
RSS_{R_1} = \sum_{i \in R_1} (y_i - \hat y_{i})^2 \\
RSS_{R_2} = \sum_{i \in R_2} (y_i - \hat y_{i})^2 \\
\text{total RSS from this split}\ RSS = RSS_{R_1} + RSS_{R_2}
\]
and we seek the value of \(X_j\) and
\(s\) that minimize
the
total RSS.
We use RSS a method to pick \(X_j\)
when split.
Classification Tree
Instead of RSS, in classification tree we use impurety measure to
pick the \(X_j\) that minimizes the
impurity.
impurity measures are: \[
\text{error rate} \ E=1-\underset{k}{max} (\hat p_k) \\
\text{Gini index} \ G=1-\sum_{k=1}^mp_k^2 \\
\text{Entropy} \ E=-\sum_{k=1}^m p_k log_2(p_k)\\
\text{Deviance} \ D=-2\sum_{k=1}^m n_k ln(p_k)\\
\] \(p_k\) -> propertion of
the training obs belong to class m.
\(n_k\) -> number of training obs
belong to class m.
When the values are smaller, much better (much pure)
Tree Pruning
The trees might overfit, this is because the resulting tree might be
too complex. Smaller tree with fewer splits ( fewer regions) might led
to:
- lower variance
- better interpretation ( at the cost of a little bit bias)
Build a Large Tree \(T_0\) ->
Prune -> Sub Tree \(T\).
How do we determind the best way to prune the tree? Intuitivly our
goal is to select a sub tree that leads to the lowest test error rate
(with cv let’s say).
A sequence of Trees indexed by a non negative tuning parameter \(\alpha\), for each value of \(\alpha\) corresponds a sub tree \(T \in T_o\) such that: \[
\sum_{m=1}^{|T|} \sum_{i:i\in R_m} (y_i-\hat y_{R_m})2 + \alpha |T|
\]
\(|T|\) # of terminal nodes of the
tree T. \(R_m\) is the rectangle
corresponding to the \(m\)th terminal
node. \(\hat y_{R_m}\) is the predicted
response associated with R_m.
when \(\alpha\) increases there is a
price to pay for having a tree with many terminal nodes.
we can select a value of \(\alpha\)
using a validation set or using CV approach then return to the full data
set and obtain the subtree with corresponding \(\alpha\) or tree size.
Ansemble methods : Begging, RF, Boosting
An enseble method is an approach that combines many simple “building
block” models in order to obtain a single and potentially very powerful
model.
for Beg,RF,Boost those building block is the “regression or
classification tree”.
Begging
Bootstrap Aggregation or Begging is a general purpose procedure for
reducing the variance of a statistical learning method.
Averaging a set of observation, reduces variance.
Begging: We use Bootstrap, take repeated samples from the single
training data set so that we generate B different Bootstrapped training
dataset, and we do aggregation (build a separate prediction model using
each training set and average the resulting predictions) \[
\hat f_{bag(x)} = \frac{1}{B} \sum_{b=1}^B \hat f_b(x)
\] begging can improve predictions.
- Split Data: \(D\) -> \(D_T\), \(D_V\), \(len(D_T)=B\)
- Bootstrap: \(DT\) -> \(B_1,B_2,B_3,...B_B\)
- Train: \(B_1 -> \hat
f_1\),…,\(B_B -> \hat
f_B\)
- Predict(CV/OOB): \(\hat f_1 -> \hat
y_1\),…,\(\hat f_B -> \hat
y_B\)
- Aggregate: \(\hat y =
\frac{1}{B}\sum_b^B\hat y_b\) or \(\hat
y =\text{Majority Vote}\)
Begging = Bootstrapping + Aggregation
ADV: - averaging reduces variance. - test error rate is lower than
test error rate of single tree
OOB-Error
It turns out that there is a very strightforward way to estimate the
test error of a begged model, without the need to perform CV.
Each bagged tree \(B_b\) makes use
of around 2/3 of the observations. Remaining 1/3 of the observations not
used to fit a given bagged tree are referred to as the out-of-bag
observations.
So we can use OOB observations for testing, when number of B is
sufficiently large, OOB error is visually equivelent to LOOCV error.
OOB error testing is better because is uses observations that model
has not seen to test the model.
Variable Importance
Bagging: - ADV: Improve accuracy than single tree - DAV: Difficult to
interpret.
VarIMP: - On regression: For each bag, we record the amount of RSS
decrease due to split over a given predictor, then we take average on
all B trees. - On classification: For each bag, we record the amount of
Impurity decrease due to split over a given predictor, then we take
average on all B trees.
Random Forest
Random Forest provides an improvment over bagged tree by a way of a
small teak – that decorrelates the tree.
In RF, a fresh sample of m predictors is taken randomly on each
split. typically \(m=\sqrt{p}\). RF
eliminates the affects of very strong predictors by randomly picking
inoput predictors.
If there is strong predictors, then almost in all B trees , will use
this predictor to split the tree, resulting tees are very much
correlated.
averaging many highly correlated quantities dose not lead to as large
of reduction in variance.
RF: adds a step tp decorrelates the tree. RF: reduction in both test
error and OOB-error over bagging. Bagging,RF: will not overfit if we
increase number of B.
- Split Data: \(D\) -> \(D_T\), \(D_V\), \(len(D_T)=B\)
- Bootstrap: \(DT\) -> \(B_1,B_2,B_3,...B_B\)
- Decorrelate X: rnd from \(\{X_1,X_2,...X_p\}\) pick \(\{X_1,X_2,...X_m\}\).
- Train: \(B_1 -> \hat
f_1\),…,\(B_B -> \hat
f_B\)
- Predict(CV/OOB): \(\hat f_1 -> \hat
y_1\),…,\(\hat f_B -> \hat
y_B\)
- Aggregate: \(\hat y =
\frac{1}{B}\sum_b^B\hat y_b\) or \(\hat
y =\text{Majority Vote}\)
Boosting
Yet another approach for improving the predictions resuylting from a
decision tree.
bagging involves creating multiple copies of the original training
data set using the bootstrap, fitting a separate decision tree to each
copy, and then combining all of the trees in order to create a single
predictive model.
trees are grown sequentially : each tree is grown using information
from previously grown trees. Boosting does not involve bootstrap
sampling; instead each tree is fit on a modified version of the original
data set.
Boosting: - learns slow, with stumps (weak learner) - fit a tree
using the current residuals rather than the outcome Y, we then add this
new decision tree into fitted function in order to update the
residuals.
\[
\hat f(x) = \hat f(x) + \lambda \hat f_b(x) \\
r_i <- r_i - \lambda \hat f_b (x_i) \\
\hat f(x)=\sum_{b-1}^B(\lambda \hat f_b(x))
\] 1. Unlike bagging and random forest, Boosting can overfit if B
is too large. 2. Shrinkage controlls the learning rate (lamda) 3. Often
we split the each tree to stumps ( each involves onlya single
variable)
Using stumps leads to additive model.
---
title: "MT-Review Generic"
output: html_notebook
date: 2023-05-08
author: Azat
---

## General 

Statistical Learning (Data mining) methods can be classified as `supervised` or `unsupervised.`     

$X$ -> predictors, independent variables, features, variables.  
$Y$ -> response, dependent variable.    
$X = (X_1,X_2,...,X_p)$ and $Y$ -> $Y = f(X) + \epsilon$,
$\epsilon$ is a random error term and has mean zero. Data mining refers to a set of approaches to estimate $f$.    

We estimate $f$ to :         
1. prediction. $\hat Y = \hat f(X)$   
2. inference. ( learn the relationship )    

The accuracy of $\hat Y$ as a prediction for $Y$ depends on:    
1. reducible error    
2. irreducible error    
![](assets/error-red-irr.png)     

`overfitting` -> It is an undesirable situation because
the fit obtained will not yield accurate estimates of the response on new observations that were not part of the original training data set.

### Trade-Off between Prediction Accuracy and Model Interpretability 

`DT` is good for interpreting.    
`Bagging`, `Boosting`, `RF` has higher Accuracy, hard to interpret.

### Supervised vs Unsupervised
- `supervised` - for each observation of the predictor measurement(s) $x_i, i=1,...n$ there is an associated response measurement $y_i$ -> predict $Y$ or learn relationship between $X$ and $Y$ .
- `unsupervised` - for each observations $x_i, i=1,...n$ we observe a vector of measurements $x_i$ but **no** associated response $y_i$ -> understand relationship between the $X_i$s or observations.     

### Regression vs Classification    
`regression` -> quantitative response, numerical response   
`classification` -> qualitative response, categorical response  

### Model Accuracy

#### Regression

$$
MSE = \frac{1}{n} \sum_{i=1}^n {(y_i - \hat f(x_i))^2}
$$
where $\hat f(x_i) -> \hat y_i$ (the prediction).

we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data. there is no guarantee that the method with the lowest training MSE will also have the lowest test MSE.   

When a given method yields a small training MSE but a large test MSE, we are said to be `overfitting the data`.    

Cross-Validation (CV) can be used to **estimate** the test MSE on the training data.

#### Classification

Error rate:
$$
error rate = \frac{1}{n} \sum_{i=1}^n {y_i \ne \hat y_i}
$$
Bayes Classifier: 
$$
Pr(Y=k|X=x) \\
Pr(Y=1|X=x) > 0.5 \ \therefore k=1
$$
KNN 
$$
Pr(Y=j|X=x_0) = \frac{1}{K} \sum_{i\in N_0}{I(y_i=j)}
$$
$N_0$ -> K points in the training data that are closet to $x_0$ 

### The Bias-Variance Trande-Off 

Expected test MSE for a given value $x_0$ can be given as:
![](assets/exp-err.png)   

In order to minimize the expected test error,
we need to select a statistical learning method that simultaneously achieves `low variance` and `low bias`.    
- `Variance` -> Variance refers to the amount by which $\hat f$ would change if we estimated it using a different training data set. (due to randomness on data) 
- `Bias` -> bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. (due to simplification) 

model is too simple -> high bias (just use several $X_i$)   
model is too complex -> low bias (use too much $X_i$, overfitting)    

## Classification 

Response variable is qualitative.   
simple:    
- Naive Bayes   
- KNN

computer intensive:   
  - DT    
  - Begging   
  - RF    
  - Boosting    

### Confusion Matrix    
![](assets/cm.png)

- Accuracy: ratio of correct predictions over all actual obs. $\frac{TP+TN}{A0+A1}$ = $\frac{TP+TN}{P0+P1}$ 
- Sensitivity=Recall=True Positive Rate: # of correct '1' s over all actual '1's. $\frac{TP}{A1}$ = $\frac{TP}{TP+FN}$.   
- Specificity : # of correct '0' s over all actual '0' s. $\frac{TN}{A0}$ = $\frac{TN}{TN+FP}$  
- Precision: # of correct '1' s over all predicted '1' s. $\frac{TP}{P1}$ = $\frac{TP}{TP+FP}$ 
- F-score: $\frac{2\times precision \ \times\  recall }{precision + recall}$  

### Naive Bayes Classifier 
$X: X_1,X_2,...X_p$  
Predicted label is $Y$  

Bayes theory: 
$$
P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}
$$

`Naive` Bayes: (ignore the dependence)    
$$
P(X_1,X_2,...X_p|Y) = P(X_1|Y)P(X_2|Y)...P(X_P|Y) \\
$$
$$
P(Y=1|X_1,X_2,..X_p) = \frac{P(X_1,X_2,...X_p|Y=1) P(Y=1)}{P(X_1)P(X_2)...P(X_p)} \\
P(Y=0|X_1,X_2,..X_p) = \frac{P(X_1,X_2,...X_p|Y=0) P(Y=0)}{P(X_1)P(X_2)...P(X_p)} 
$$
and $P(Y=1|X)P(X) + P(Y=0|X)P(X) = 1$ 

if the Variable is numerical, we use PDF. 

$$
PDF= f(x|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
$$
$$
Pr(X|Y=1)= \frac{1}{\sqrt{2\pi\sigma_{y=1}^2}}e^{-\frac{(X-\mu_{y=1})^2}{2\sigma_{y=1}^2}} \\
Pr(X|Y=0)= \frac{1}{\sqrt{2\pi\sigma_{y=0}^2}}e^{-\frac{(X-\mu_{y=0})^2}{2\sigma_{y=0}^2}} \\
$$

### KNN

KNN dose not require training. 
All of the methods that needs to calculate the distance, requires normalization.

Estimates conditional distribution of y given x and then classifies a given observation to the class with highest probability.

Step 1: Pick integer K.  
Step 2: for $x_0$ (test observation), identify K points in the data that are closet to the $x_0$ ($N_0$).    
Step3: Estimate the conditional probability for each class j, as the fraction of points, in $N_0$, whose response values equal to j.
$$
Pr(y=j|X=x_0) = \frac{1}{K} \sum_{i\in N_0} (I(y_i=j))
$$
Step4: Apply bayes rule, classifiy to the class with larger probability.

![](assets/knn.png)

## Resampling Methods   

Resampling methods: Draw samples from a training set, to fit the same statistical method multiple times using different subset of the training data.

Resampling Methods:   
1. Validation Set Approach    
2. Cross Validation         
3. Bootstrap           
- Evaluating models performance -> model assessment    
- Selecting the proper level of complexity -> model selection 

### Validation Set Approach

We explore the use of the validation set approach in order to estimate the
test error rates that result from fitting various linear models on the Auto
data set.

![](assets/val-set.png)         

- test error rate can be highly variable
- only a subset of the observations are used to fit the model

### Leave-One-Out Cross-Validation (LOOCV)    

LOOCV has a couple of major **advantages**:

 - First, it has far less bias.   
 - Second,there is no randomness in the training/validation
set splits.     

**drawbacks**:

- expensive to implement
- very time consuming
- uses too much computational power

LOOCV is a very general method, and can be used with any kind of
predictive modeling.

![](assets/loocv.png)  

### K-Fold Cross-Validation   

k-fold CV. This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. 

LOOCV is a special case of k-fold CV in which k is set to equal n. 

**advantages:** 

- The most obvious advantage is computational.


there is some variability in the CV, But this variability is typically
much lower than the variability in the test error estimates that results from
the validation set approach 


CV can be used for model selection : find the location of the minimum point in the estimated test MSE curve.


$$
CV_{k} = \frac{1}{k} \sum_{i=1}^k {MSE_k}
$$
$$
CV_{k} = \frac{1}{k} \sum_{i=1}^k {Err_k} 
$$
where $Err_k = I(Y_i \ne \hat Y)$.    

### Bootstrap   

Randomly select, with replacement, n obs from training dataset. 
invest in $X,Y$ (there is some variability). We want: 
$$
minimize \ Var(\alpha X+(1-\alpha)Y) \\
minimize \ \hat \alpha = \frac{\hat \sigma_Y^2 - \hat \sigma_{XY}}{\hat \sigma_X^2+\hat \sigma_Y^2-2\hat \sigma_{XY}}
$$
where: $\hat \sigma_Y^2 = Var(Y), \hat \sigma_X^2 = Var(X),\hat \sigma_{XY} = Cov(X,Y)$ 
we estimate $\alpha$ using historical data (train data). 
![](assets/bootstrap.png)

the bootstrap approach allows us
to use a computer to emulate the process of obtaining new sample sets,
so that we can estimate the variability of ˆα without generating additional
samples.

Rather than repeatedly obtaining independent data sets from the
population, we instead obtain distinct data sets by repeatedly sampling
observations from the original data set.

## Tree Based Methods 

Decision Tree: Classification Tree + Regression Tree    
Logic: Segment the predictor space into a number of Simple Rectangle Regions.

- Decision Tree Based methods are simple and useful for interpretation. 
- Begging, Random Forest, Boosting produces multiple tree then combine them to yield a single consensus prediction.    

### Regression Tree 

The goal is to find rectangles (boxes) $R_1,R_2,...R_J$ that minimizes the RSS given by:

$$
RSS = \sum_{j=1}^J \sum_{i \in R_J} (y_i - \hat y_{Rj})^2
$$
$R_1,R_2,R_3...$ - terminal nodes, leaf nodes.    
$\hat y_{Rj}$ mean response for the training observations within $j$th Rectangle (terminal node).   

Recursive Binary Splitting (Greedy):    
Pick $X_j$ and the cut pint $s$, so that it will devide the area to :   $R_1 = \{X|X_j <s\}$  and $R_2 = \{X|X_j \ge s \}$ .
$$
RSS_{R_1} = \sum_{i \in R_1} (y_i - \hat y_{i})^2 \\
RSS_{R_2} = \sum_{i \in R_2} (y_i - \hat y_{i})^2 \\
\text{total RSS from this split}\ RSS = RSS_{R_1} + RSS_{R_2}
$$

and we seek the value of $X_j$ and $s$ that `minimize` the total RSS.    

We use RSS a method to pick $X_j$ when split.

### Classification Tree 

Instead of RSS, in classification tree we use impurety measure to pick the $X_j$ that minimizes the impurity.

impurity measures are: 
$$
\text{error rate} \ E=1-\underset{k}{max} (\hat p_k) \\
\text{Gini index} \ G=1-\sum_{k=1}^mp_k^2 \\
\text{Entropy} \ E=-\sum_{k=1}^m p_k log_2(p_k)\\
\text{Deviance} \ D=-2\sum_{k=1}^m n_k ln(p_k)\\
$$
$p_k$ -> propertion of the training obs belong to class m.    
$n_k$ -> number of training obs belong to class m.

When the values are smaller, much better (much pure)

### Tree Pruning    

The trees might overfit, this is because the resulting tree might be too complex. 
Smaller tree with fewer splits ( fewer regions) might led to:    
- lower variance  
- better interpretation ( at the cost of a little bit bias)

Build a Large Tree $T_0$ -> Prune -> Sub Tree $T$. 

How do we determind the best way to prune the tree? Intuitivly our goal is to select a sub tree that leads to the lowest test error rate (with cv let's say). 

A sequence of Trees indexed by a non negative tuning parameter $\alpha$, for each value of $\alpha$ corresponds a sub tree $T \in T_o$  such that: 
$$
\sum_{m=1}^{|T|} \sum_{i:i\in R_m} (y_i-\hat y_{R_m})2 + \alpha |T|
$$

$|T|$ # of terminal nodes of the tree T.
$R_m$ is the rectangle corresponding to the $m$th terminal node.
$\hat y_{R_m}$ is the predicted response associated with R_m.

when $\alpha$ increases there is a price to pay for having a tree with many terminal nodes.

we can select a value of $\alpha$ using a validation set or using CV approach then return to the full data set and obtain the subtree with corresponding $\alpha$ or tree size.

### Ansemble methods : Begging, RF, Boosting 
An enseble method is an approach that combines many simple "building block" models in order to obtain a single and potentially very powerful model.

for Beg,RF,Boost those building block is the "regression or classification tree". 

### Begging 

Bootstrap Aggregation or Begging is a general purpose procedure for reducing the variance of a statistical learning method.

Averaging a set of observation, reduces variance.

Begging: We use Bootstrap, take repeated samples from the single training data set so that we generate B different Bootstrapped training dataset, and we do aggregation (build a separate prediction model using each training set and average the resulting predictions)
$$
\hat f_{bag(x)} = \frac{1}{B} \sum_{b=1}^B \hat f_b(x) 
$$
begging can improve predictions.

1. Split Data: $D$ -> $D_T$, $D_V$, $len(D_T)=B$
2. Bootstrap: $DT$ -> $B_1,B_2,B_3,...B_B$
3. Train: $B_1 -> \hat f_1$,...,$B_B -> \hat f_B$
4. Predict(CV/OOB): $\hat f_1 -> \hat y_1$,...,$\hat f_B -> \hat y_B$
5. Aggregate: $\hat y = \frac{1}{B}\sum_b^B\hat y_b$ or $\hat y =\text{Majority Vote}$

Begging = Bootstrapping + Aggregation 

ADV: 
- averaging reduces variance. 
- test error rate is lower than test error rate of single tree

#### OOB-Error
It turns out that there is a very strightforward way to estimate the test error of a begged model, without the need to perform CV.

Each bagged tree $B_b$ makes use of around 2/3 of the observations. Remaining 1/3 of the observations not used to fit a given bagged tree are referred to as the out-of-bag observations.

So we can use OOB observations for testing, when number of B is sufficiently large, OOB error is visually equivelent to LOOCV error.

OOB error testing is better because is uses observations that model has not seen to test the model.

#### Variable Importance 

Bagging: 
- ADV: Improve accuracy than single tree
- DAV: Difficult to interpret. 

VarIMP:
- On regression: For each bag, we record the amount of RSS decrease due to split over a given predictor, then we take average on all B trees.
- On classification: For each bag, we record the amount of Impurity decrease due to split over a given predictor, then we take average on all B trees.

### Random Forest

Random Forest provides an improvment over bagged tree by a way of a small teak -- that decorrelates the tree.

In RF, a fresh sample of m predictors is taken randomly on each split. typically $m=\sqrt{p}$.
RF eliminates the affects of very strong predictors by randomly picking inoput predictors.   

If there is strong predictors, then almost in all B trees , will use this predictor to split the tree, resulting tees are very much correlated.

averaging many highly correlated quantities dose not lead to as large of reduction in variance.

RF: adds a step tp decorrelates the tree.
RF: reduction in both test error and OOB-error over bagging. 
Bagging,RF: will not overfit if we increase number of B.

1. Split Data: $D$ -> $D_T$, $D_V$, $len(D_T)=B$
2. Bootstrap: $DT$ -> $B_1,B_2,B_3,...B_B$
3. Decorrelate X: rnd from $\{X_1,X_2,...X_p\}$ pick $\{X_1,X_2,...X_m\}$.
3. Train: $B_1 -> \hat f_1$,...,$B_B -> \hat f_B$
4. Predict(CV/OOB): $\hat f_1 -> \hat y_1$,...,$\hat f_B -> \hat y_B$
5. Aggregate: $\hat y = \frac{1}{B}\sum_b^B\hat y_b$ or $\hat y =\text{Majority Vote}$

### Boosting 

Yet another approach for improving the predictions resuylting from a decision tree.

bagging involves creating multiple copies of the original training
data set using the bootstrap, fitting a separate decision tree to each
copy, and then combining all of the trees in order to create a single predictive
model.

trees are
grown sequentially : each tree is grown using information from previously
grown trees. Boosting does not involve bootstrap sampling; instead each
tree is fit on a modified version of the original data set.

Boosting:
- learns slow, with stumps (weak learner)
- fit a tree using the current residuals rather than the outcome Y, we then add this new decision tree into fitted function in order to update the residuals.    
$$
\hat f(x) = \hat f(x) + \lambda \hat f_b(x) \\
r_i <- r_i - \lambda \hat f_b (x_i) \\
\hat f(x)=\sum_{b-1}^B(\lambda \hat f_b(x))
$$
1. Unlike bagging and random forest, Boosting can overfit if B is too large.
2. Shrinkage controlls the learning rate (lamda)
3. Often we split the each tree to stumps ( each involves onlya single variable)

Using stumps leads to additive model.


