library(class)
df <- iris
summary(df)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
str(df)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# df.scaled <- scale(df)
we need to exclude Species, as it is not numerical
df.scaled <- scale(df[,-5])
The code provided scales the df
data frame by centering
and scaling each variable except the fifth variable (which is assumed to
be the response variable). Here’s what each part of the code does:
df[,-5]
: This selects all columns of the
df
data frame except the fifth column (assuming that the
response variable is in the fifth column).
scale()
: This function standardizes each column of
the data frame by subtracting the column mean and dividing by the column
standard deviation. This centers the data around zero and scales the
data so that each column has unit variance.
df.scaled
: This assigns the standardized data frame
to a new variable called df.scaled
.
In other words, the resulting df.scaled
data frame has
the same number of rows and columns as the original df
data
frame, but with each variable centered at zero and scaled to have unit
variance (except the response variable, which is left unchanged). This
is a common preprocessing step in machine learning to ensure that all
variables have the same scale and to prevent any one variable from
dominating the model due to its larger scale.
summary(df.scaled)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :-1.86378 Min. :-2.4258 Min. :-1.5623 Min. :-1.4422
## 1st Qu.:-0.89767 1st Qu.:-0.5904 1st Qu.:-1.2225 1st Qu.:-1.1799
## Median :-0.05233 Median :-0.1315 Median : 0.3354 Median : 0.1321
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.67225 3rd Qu.: 0.5567 3rd Qu.: 0.7602 3rd Qu.: 0.7880
## Max. : 2.48370 Max. : 3.0805 Max. : 1.7799 Max. : 1.7064
apply(df.scaled, 2, sd)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 1 1 1
To get the standard deviation of each variable in the
df.scaled
data frame, you can use the sd()
function with the apply()
function, as follows:
apply(df.scaled, 2, sd)
This will calculate the standard deviation of each column (i.e.,
variable) in the df.scaled
data frame using the
sd()
function, and return a vector of standard deviations.
The 2
argument specifies that the apply()
function should operate on columns (i.e., variables), rather than
rows.
The resulting vector will have the same length as the number of
columns in the df.scaled
data frame, with each element
representing the standard deviation of the corresponding column. Since
we used the scale()
function to standardize the variables,
the standard deviation of each variable should be close to one.
set.seed(100)
sp <- sample(1:nrow(df), 100)
df.train <- df.scaled[sp,]
df.test <- df.scaled[-sp,]
df.train.y <- df$Species[sp]
df.test.y <- df$Species[-sp]
dim(df.train)
## [1] 100 4
dim(df.test)
## [1] 50 4
length(df.train.y)
## [1] 100
length(df.test.y)
## [1] 50
set.seed(1)
knn.prediction1 <-
knn(
train = df.train,
test = df.test,
cl = df.train.y,
k = 1
)
table(knn.prediction1, df.test.y)
## df.test.y
## knn.prediction1 setosa versicolor virginica
## setosa 16 0 0
## versicolor 0 15 2
## virginica 0 2 15
knn.prediction2 <-
knn(
train = df.train,
test = df.test,
cl = df.train.y,
k = 3
)
table(knn.prediction2, df.test.y)
## df.test.y
## knn.prediction2 setosa versicolor virginica
## setosa 16 0 0
## versicolor 0 15 0
## virginica 0 2 17