0410 KNN

library(class)
df <- iris
summary(df)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Scaling

str(df)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# df.scaled <- scale(df)

we need to exclude Species, as it is not numerical

df.scaled <- scale(df[,-5])

The code provided scales the df data frame by centering and scaling each variable except the fifth variable (which is assumed to be the response variable). Here’s what each part of the code does:

  1. df[,-5]: This selects all columns of the df data frame except the fifth column (assuming that the response variable is in the fifth column).

  2. scale(): This function standardizes each column of the data frame by subtracting the column mean and dividing by the column standard deviation. This centers the data around zero and scales the data so that each column has unit variance.

  3. df.scaled: This assigns the standardized data frame to a new variable called df.scaled.

In other words, the resulting df.scaled data frame has the same number of rows and columns as the original df data frame, but with each variable centered at zero and scaled to have unit variance (except the response variable, which is left unchanged). This is a common preprocessing step in machine learning to ensure that all variables have the same scale and to prevent any one variable from dominating the model due to its larger scale.

summary(df.scaled)
##   Sepal.Length       Sepal.Width       Petal.Length      Petal.Width     
##  Min.   :-1.86378   Min.   :-2.4258   Min.   :-1.5623   Min.   :-1.4422  
##  1st Qu.:-0.89767   1st Qu.:-0.5904   1st Qu.:-1.2225   1st Qu.:-1.1799  
##  Median :-0.05233   Median :-0.1315   Median : 0.3354   Median : 0.1321  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.67225   3rd Qu.: 0.5567   3rd Qu.: 0.7602   3rd Qu.: 0.7880  
##  Max.   : 2.48370   Max.   : 3.0805   Max.   : 1.7799   Max.   : 1.7064
apply(df.scaled, 2, sd)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##            1            1            1            1

To get the standard deviation of each variable in the df.scaled data frame, you can use the sd() function with the apply() function, as follows:

apply(df.scaled, 2, sd)

This will calculate the standard deviation of each column (i.e., variable) in the df.scaled data frame using the sd() function, and return a vector of standard deviations. The 2 argument specifies that the apply() function should operate on columns (i.e., variables), rather than rows.

The resulting vector will have the same length as the number of columns in the df.scaled data frame, with each element representing the standard deviation of the corresponding column. Since we used the scale() function to standardize the variables, the standard deviation of each variable should be close to one.

Partiioning (splitting)

set.seed(100)
sp <- sample(1:nrow(df), 100)
df.train <- df.scaled[sp,]
df.test <- df.scaled[-sp,]
df.train.y <- df$Species[sp]
df.test.y <- df$Species[-sp]
dim(df.train)
## [1] 100   4
dim(df.test)
## [1] 50  4
length(df.train.y)
## [1] 100
length(df.test.y)
## [1] 50
set.seed(1)
knn.prediction1 <-
  knn(
    train = df.train,
    test = df.test,
    cl = df.train.y,
    k = 1
  )
table(knn.prediction1, df.test.y)
##                df.test.y
## knn.prediction1 setosa versicolor virginica
##      setosa         16          0         0
##      versicolor      0         15         2
##      virginica       0          2        15
knn.prediction2 <-
  knn(
    train = df.train,
    test = df.test,
    cl = df.train.y,
    k = 3
  )
table(knn.prediction2, df.test.y)
##                df.test.y
## knn.prediction2 setosa versicolor virginica
##      setosa         16          0         0
##      versicolor      0         15         0
##      virginica       0          2        17