FE581 – 0522 - Logistic Regression and Clusstering Methods

4.3 Logistic Regression

In the default data set

x1 - studentx2 - balancex3 - incomeY
yes / nonumnumyes/no

Rather than modeling this response Y directly, logistic regression models the probability that Y belongs to a particular category.

Because we want to estimate the probability of Y belongs to one of the particular category, that means the result of our estimation must fall into the range of [0,1]. but if we use normal linear regression, it wont able to achieve this. As in linear regression, the line can have any value even less than zero and bigger than one.

That is why we need logistic regression , where it can be written as:

Pr(default=yes|balance)p(balance)

p(balance) must have value between 0 and 1 (cause it is a probability).

4.3.1 The Logistic Model

Question: How should we model the relationship between p(x)=Pr(Y=1|X) and X ?

If we use linear regression to represent this, then it will be:

(4.1)p(X)=β0+β1X But it will introduce new problems, as we said earlier, the result of (4.1) can be negative, can be bigger than 1, so it dose not fit our target of predicting 'probability'. so we must find some other way to model it so that the result just sits between 0 and 1.

In logistic regression, the logistic function or the sigma function is used:

p(x)=ef(x)1+ef(x)

in our case, f(x)=β0+β1X

So we can rewrite the logistic function as:

(4.2)p(x)=eβ0+β1X1+eβ0+β1X

for Example:

Pasted image 20230602111709

Then we can find that no matter what is the value of β0,β1 and x is, the value of the function always lies between 0 and 1. (perfect for probability)

Now we have the function that satisfies our requirements for the output ( should be range between 0 and 1), our next job is to find the values of β0,β1, it is obvious that we do not have the overall population data, so the values of β0,β1 will always be unknown, so we use statistical estimations to find β^0,β^1 which best estimates the values of β0,β1.

We can rewrite the function in (4.2) in a slightly different way:

p(x)=eβ0+β1X1+eβ0+β1Xp(x)(1+eβ0+β1X)=1+eβ0+β1Xp(x)+p(x)eβ0+β1X=1+eβ0+β1Xp(x)=1+eβ0+β1Xp(x)eβ0+β1Xp(x)=(1p(x))eβ0+β1Xp(x)1p(x)=eβ0+β1X

so we can achieve:

(4.3)p(x)1p(x)=eβ0+β1X

where the LHS: p(x)1p(x) is called the odds, and can take values between (0,).

Odds basically means success ratio, such as one success out of five means we have 15=0.2 and it corresponds to odds=1/4. (1 success 4 failure)

By taking logarithm on (4.3) we can further make changes as:

(4.4)ln(p(x)1p(x))=β0+β1X

the LHS is called as logit or log-odds, and we can easily realize that, the relationship between X and p(X) are not linear.

4.3.2 Estimating the regression Coefficients

As we talked earlier, we need to find β^0 and β^1 such that they are the closest (or best) estimaters of the β0 and β1

The process of finding β^0 and β^1 evolves a statistical method named as "maximum likelihood" and given by the Maximum Likelihood Function :

l(β0,β1)=i:yi=1p(xi)i:yi=01p(xi)

From the formula we can say the actually what maximum likelihood dose is it is trying to maximize the probability of binomial success outcome.

By using some programming languages like R or Python or maybe Excel (lol it is not a programming language) we can get the estimated values of the (4.3) or (4.2), which are the coefficients of the logistic function.

And statistically we will get the resulting table:

-CoefficientsStd. errorZ-statisticsp-value
Intercept-10.65130.3612-29.50.000230045
balance0.00550.000224.90.00001452
which results the following logistic regression model :    

f(X)=eβ0+β1x1=e10.65+0.0055x1p(X)=ef(X)1+ef(X)=e10.65+0.0055x11+e10.65+0.0055x1 or we can also get the function for odds:

p(X)1p(X)=e10.65+0.0055x1

4.3.3 Making Predictions

It is as easy as putting the values back into our function. if x is a quantitative variable (numerical) we just simply put x back into the formula of (4.2) or (4.3). If x is a qualitative variable (categorical) we use dummy variable that takes the value of 1 or 0, then put it back to the formula.

4.4.4 multiple logistic regression

In multiple x case, the formula for slightly changes into: f(X)=β0+β1x1+...+βpxpp(X)=eβ0+β1x1+...+βpxp1+eβ0+β1x1+...+βpxpp(X)1p(X)=eβ0+β1x1+...+βpxpln(p(X)1p(X))=β0+β1x1+...+βpxp

 

10.3 Clustering Methods

Clustering is an unsupervised learning method, which aims to find relationships between either:

  1. observations on the different row (grouping)

  2. variables on the different column (reasoning)

Simply, it is a method to put similar things together.

So one of the most important things to consider in clustering methods is how we define things (observations) are similar

clustering
K-means clustering (define k class and get groups by distance)
Hierarchical clustering(get dendogram and cut)

10.3.1 K-Means Clustering

2023-06-03-at-14.30.58

Algorithm:

We have n data observations as x1(,,..,),x2(,,..,),...xn(,,..,).

  1. Randomly assign each observation point to a cluster Ki, (or can be written as to a cluster center Ci)

  2. Compute the centroid (mean) of each cluster μ1,μ2,...,μK

  3. Calculate the Euclidian distance between each class centroid μi and each observation point xi , as d(xi,μi)

  4. Reassign the observation xj to the cluster Kj which has the minimum distance d(xj,μj)

  5. repeat 1-4 until we get a minimal total within class variation(mean of Euclidian distances) or SSE (Sum of Squared Error ,calculated as total Euclidian distances).

 

Calculation Example:

Pasted image 20230603182855

Pasted image 20230603182943

10.3.2 Hierarchical Clustering

One potential disadvantages of K-means is that we need to first define the # of clusters.

Main tasks in hierarchical clustering includes:

  1. How to draw the dendrogram

  2. How to cut the dendrogram (where to cut)

  3. How to evaluate the model

To build a dendrogram (bottom to top approach):

  1. Calculate the dissimilarities between each observations. (to build up from the bottom)

    1. Manhattan Distance (|x1x2|+|y1y2|)

    2. Euclidian Distance Squared (d2=(x1x2)2+(y1y2)2)

    3. Euclidian Distance (d2=(x1x2)2+(y1y2)2)

  2. Pick the minimum distance and fuse.

  3. Calculate the linkage between groups of observations. (to build up further branches)

    1. complete

    2. average

    3. single

    4. centroid

  4. Pick the minimum linkage and fuse.

  5. repeat 1-4 until we reach the root of the dendrogram.

When calculating the linkage, there are 4 different methods as given above, and we have different calculation approach, let's assume in some time, we end up with two cluster sets as x1,x2 and x4,x5

  1. Single Linkage: Find the minimum distance between any point in the first set to any point in the second set. That is, find the minimum among the distances d(x1,x4), d(x1,x5), d(x2,x4), and d(x2,x5). This minimum distance is the single linkage distance.

  2. Complete Linkage: Find the maximum distance between any point in the first set to any point in the second set. That is, find the maximum among the distances d(x1,x4), d(x1,x5), d(x2,x4), and d(x2,x5). This maximum distance is the complete linkage distance.

  3. Average Linkage: Calculate the average of all the distances between points in the first set to points in the second set. That is, calculate the average of the distances d(x1,x4), d(x1,x5), d(x2,x4), and d(x2,x5). This average distance is the average linkage distance.

  4. Centroid Linkage: Calculate the centroids of the two sets first. The centroid of a set is the point whose coordinates are the average of the coordinates of all the points in the set. Let c1 be the centroid of {x1,x2} and c2 be the centroid of {x4,x5}. Then, calculate the distance between these two centroids, d(c1,c2). This distance is the centroid linkage distance.