FE581 – 0522 - Logistic Regression and Clusstering Methods

FE581 – 0522 - Logistic Regression and Clusstering Methods4.3 Logistic Regression4.3.1 The Logistic Model4.3.2 Estimating the regression Coefficients4.3.3 Making Predictions4.4.4 multiple logistic regression 10.3 Clustering Methods10.3.1 K-Means Clustering10.3.2 Hierarchical Clustering

4.3 Logistic Regression


library(ISLR2)
df <- Default
head(df)


  default student   balance    income
1      No      No  729.5265 44361.625
2      No     Yes  817.1804 12106.135
3      No      No 1073.5492 31767.139
4      No      No  529.2506 35704.494
5      No      No  785.6559 38463.496
6      No     Yes  919.5885  7491.559


str(df)


'data.frame': 10000 obs. of  4 variables:
 $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
 $ balance: num  730 817 1074 529 786 ...
 $ income : num  44362 12106 31767 35704 38463 ...

In the default data set

x1 - student	x2 - balance	x3 - income	Y
yes / no	num	num	yes/no

Rather than modeling this response Y directly, logistic regression models the probability that Y belongs to a particular category.

Because we want to estimate the probability of Y belongs to one of the particular category, that means the result of our estimation must fall into the range of [0,1]. but if we use normal linear regression, it wont able to achieve this. As in linear regression, the line can have any value even less than zero and bigger than one.

That is why we need logistic regression , where it can be written as:

P r (d e f a u l t = y e s | b a l a n c e) \Rightarrow p (b a l a n c e)

$p(balance)$ must have value between 0 and 1 (cause it is a probability).

4.3.1 The Logistic Model

$p(x)=Pr(Y=1|X)$ $X$ ?

If we use linear regression to represent this, then it will be:

$\begin{align*} p(X) = \beta_0 + \beta_1X \tag{4.1} \end{align*}$ But it will introduce new problems, as we said earlier, the result of (4.1) can be negative, can be bigger than 1, so it dose not fit our target of predicting 'probability'. so we must find some other way to model it so that the result just sits between 0 and 1.

In logistic regression, the logistic function or the sigma function is used:

p (x) = \frac{e^{f (x)}}{1 + e^{f (x)}}

$f(x) = \beta_0 + \beta_1X$

So we can rewrite the logistic function as:

\begin{matrix} (4.2) & p (x) = \frac{e^{β_{0} + β_{1} X}}{1 + e^{β_{0} + β_{1} X}} \end{matrix}

for Example:

Pasted image 20230602111709

$\beta_0,\beta_1$ $x$ is, the value of the function always lies between 0 and 1. (perfect for probability)

$\beta_0, \beta_1$ $\beta_0,\beta_1$ $\hat \beta_0, \hat \beta_1$ $\beta_0,\beta_1$ .

We can rewrite the function in (4.2) in a slightly different way:

$\begin{split} p(x) = \frac{e^{\beta_0 + \beta_1X }}{1+e^{\beta_0 + \beta_1X}} \\\\ p(x)\cdot(1+e^{\beta_0 + \beta_1X}) = 1+e^{\beta_0 + \beta_1X} \\\\ p(x) + p(x)\cdot e^{\beta_0 + \beta_1X} = 1+e^{\beta_0 + \beta_1X} \\\\ p(x) = 1+e^{\beta_0 + \beta_1X} - p(x)\cdot e^{\beta_0 + \beta_1X} \\\\ p(x) = \big(1-p(x)\big)\cdot e^{\beta_0 + \beta_1X} \\\\ \frac{p(x)}{1-p(x)} = e^{\beta_0 + \beta_1X} \end{split}$

so we can achieve:

\begin{matrix} (4.3) & \frac{p (x)}{1 - p (x)} = e^{β_{0} + β_{1} X} \end{matrix}

$\frac{p(x)}{1-p(x)}$ odds $(0,\infty)$ .

$\frac{1}{5}=0.2$ and it corresponds to odds=1/4. (1 success 4 failure)

By taking logarithm on (4.3) we can further make changes as:

\begin{matrix} (4.4) & l n (\frac{p (x)}{1 - p (x)}) = β_{0} + β_{1} X \end{matrix}

the LHS is called as logit or log-odds, and we can easily realize that, the relationship between X and p(X) are not linear.

4.3.2 Estimating the regression Coefficients

$\hat \beta_0$ $\hat \beta_1$ $\beta_0$ $\beta_1$

$\hat \beta_0$ $\hat \beta_1$ evolves a statistical method named as "maximum likelihood" and given by the Maximum Likelihood Function :

l (β_{0}, β_{1}) = \prod_{i : y_{i} = 1} p (x_{i}) \cdot \prod_{i^{'} : y_{i}^{'} = 0} 1 - p (x_{i}^{'})

From the formula we can say the actually what maximum likelihood dose is it is trying to maximize the probability of binomial success outcome.

By using some programming languages like R or Python or maybe Excel (lol it is not a programming language) we can get the estimated values of the (4.3) or (4.2), which are the coefficients of the logistic function.

And statistically we will get the resulting table:

-	Coefficients	Std. error	Z-statistics	p-value
Intercept	-10.6513	0.3612	-29.5	0.000230045
balance	0.0055	0.0002	24.9	0.00001452
which results the following logistic regression model :

$\begin{split} f(X) = e^{\beta_0+\beta_1 x_1} = e^{-10.65 + 0.0055 x_1} \\ p(X) = \frac{e^{f(X)}}{1+e^{f(X)}} = \frac{e^{-10.65 + 0.0055 x_1}}{1+e^{-10.65 + 0.0055 x_1}} \end{split}$ or we can also get the function for odds:

\frac{p (X)}{1 - p (X)} = e^{- 10.65 + 0.0055 x_{1}}

4.3.3 Making Predictions

It is as easy as putting the values back into our function. if x is a quantitative variable (numerical) we just simply put x back into the formula of (4.2) or (4.3). If x is a qualitative variable (categorical) we use dummy variable that takes the value of 1 or 0, then put it back to the formula.

4.4.4 multiple logistic regression

$\begin{split} f(X) &= \beta_0 + \beta_1x_1 +...+ \beta_p x_p \\\\ p(X) &= \frac{e^{\beta_0 + \beta_1x_1 +...+ \beta_p x_p}}{1+e^{\beta_0 + \beta_1x_1 +...+ \beta_p x_p}} \\\\ \frac{p(X)}{1-p(X)} &= e^{\beta_0 + \beta_1x_1 +...+ \beta_p x_p} \\\\ ln(\frac{p(X)}{1-p(X)} )&= \beta_0 + \beta_1x_1 +...+ \beta_p x_p \end{split}$

10.3 Clustering Methods

Clustering is an unsupervised learning method, which aims to find relationships between either:

observations on the different row (grouping)
variables on the different column (reasoning)

Simply, it is a method to put similar things together.

So one of the most important things to consider in clustering methods is how we define things (observations) are similar

10.3.1 K-Means Clustering

2023-06-03-at-14.30.58

Algorithm:

$x_1(,,..,),x_2(,,..,),...x_n(,,..,)$ .

$K_i$ $C_i$ )
$\mu_1,\mu_2,...,\mu_K$
$\mu_i$ $x_i$ $d(x_i,\mu_i)$
$x_j$ $K_j$ $d(x_j,\mu_j)$
repeat 1-4 until we get a minimal total within class variation(mean of Euclidian distances) or SSE (Sum of Squared Error ,calculated as total Euclidian distances).

Calculation Example:

Pasted image 20230603182855

Pasted image 20230603182943

10.3.2 Hierarchical Clustering

One potential disadvantages of K-means is that we need to first define the # of clusters.

Main tasks in hierarchical clustering includes:

How to draw the dendrogram
How to cut the dendrogram (where to cut)
How to evaluate the model

To build a dendrogram (bottom to top approach):

Calculate the dissimilarities between each observations. (to build up from the bottom)
1. $|x_1-x_2|+|y_1-y_2|$ )
2. $d^2 =(x_1-x_2)^2+(y_1-y_2)^2$ )
3. $d^2 =\sqrt{(x_1-x_2)^2+(y_1-y_2)^2}$ )
Pick the minimum distance and fuse.
Calculate the linkage between groups of observations. (to build up further branches)
1. complete
2. average
3. single
4. centroid
Pick the minimum linkage and fuse.
repeat 1-4 until we reach the root of the dendrogram.

${x_1,x_2}$ ${x_4,x_5}$

Single Linkage: Find the minimum distance between any point in the first set to any point in the second set. That is, find the minimum among the distances d(x1,x4), d(x1,x5), d(x2,x4), and d(x2,x5). This minimum distance is the single linkage distance.
Complete Linkage: Find the maximum distance between any point in the first set to any point in the second set. That is, find the maximum among the distances d(x1,x4), d(x1,x5), d(x2,x4), and d(x2,x5). This maximum distance is the complete linkage distance.
Average Linkage: Calculate the average of all the distances between points in the first set to points in the second set. That is, calculate the average of the distances d(x1,x4), d(x1,x5), d(x2,x4), and d(x2,x5). This average distance is the average linkage distance.
Centroid Linkage: Calculate the centroids of the two sets first. The centroid of a set is the point whose coordinates are the average of the coordinates of all the points in the set. Let c1 be the centroid of {x1,x2} and c2 be the centroid of {x4,x5}. Then, calculate the distance between these two centroids, d(c1,c2). This distance is the centroid linkage distance.