LOST-STATS/lost-stats.github.io

Add to KNN page - R walkthrough

RosalieSherry opened this issue · 2 comments

R

The simplest way to perform KNN in R is with the package class. It has a KNN function that is rather user friendly and does not require you to do distance computing as it runs everything with euclidean distance. For more advanced types nearest neighbors testing it would be best to use the matchit function from the matchit package. To verify results this example also used the confusionMatrix function from the package caret.
Due to how this package is designed the easiest room for error would be during normalization by normalizing variables such as character or other ones that do not require normalization. Another good source of error is not including drop = TRUE for your target, or y, vector which will prevent the model from running. Finally, the way this example verifies results it is vital to convert the target into a factor as the data has to be in similar kind in order for R to give you an output.


library(tidyverse)
library(readr)

#For KNN
library(class)
library(caret)


#Import the Dataset
df <- read_csv("wdbc.csv")
view(df)

#the first column is an identifier so remove that, anything that does not aid in classifying can be removed
df <- df[-1]


#See the count of the target, either B, benign, or M, malignant
table(df[1])

#Normalize the Dataset

normal<- function(x) { return ((x - min(x)) / (max(x) - min(x))) }

#Apply to what needs to be normalized, in this case not the target
df_norm <- as.data.frame(lapply(df[2:31], normal))

#Verify that normalization has occurred
summary(df_norm[1])
summary(df_norm[3])
summary(df_norm[11])
summary(df_norm[23])


#Split the dataframe into test and train datasets - note there are two dataframes
#First test and train from the features, here is an example of about a 70/30 split for testing and training

x_train <- df_norm[1:397,]

x_test <- df_norm[398:568,]


#Now test and train for the target - here is import that you do ", 1" to indicate only one column
#It will not work unless you use drop = TRUE
y_train <- df[1:397, 1, drop = TRUE]

y_test <- df[398:568, 1, drop = TRUE]


#The purpose of installing those packages were to use these next functions, first KNN
#Like the python example states, best practice for K unless assigned is the square root of the number of observations
pred <- knn(train = x_train, test = x_test, cl = y_train, k = 23)

#Confusion Matrix from Caret

#KNN converts to a factor with two levels so we need to make sure the test dataset is similar
y_test <- y_test %>% factor(levels = c("B", "M"))

#See how well the model did
confusionMatrix(y_test, pred)

References for R walkthrough

The dataset used is from the UCI Machine Learning Repository under Breast Cancer Wisconsin (Diagnostic) Data Set. Rdocumentation for KNN was used in order to work on this example. Also, statology's "how to create a confusion matrix"
wdbc.csv

Hi Rosalie,

This looks cool, but I think you meant to submit it as a New Page rather than as an Issue. See Nick's guide here: https://lost-stats.github.io/Contributing/Contributing.html

I'll close in the meantime.

I think this is an existing page, but doesn't have R. Editing it into the existing page rather than submitting as an Issue would be a good idea. Can you do that Rosalie?