k-Nearest Neighbors
Introduction
k-Nearest Neighbors (k-NN) is a simple, non-parametric, and lazy learning algorithm used for both classification and regression tasks. It works by finding the k nearest data points in the training set to a given input and making predictions based on the majority class (for classification) or average value (for regression) of these neighbors. In this lecture, we will learn how to perform k-NN analysis in R, including model building, evaluation, and interpretation.
Key Concepts
1. What is k-Nearest Neighbors?
k-NN makes predictions by identifying the k nearest neighbors to a query point and using their known values to predict the value for the query point. The choice of k, the number of neighbors, can significantly impact the model’s performance.
2. Advantages and Disadvantages
Advantages:
Simple and intuitive.
No assumptions about data distribution.
Effective with large training datasets.
Disadvantages:
Computationally intensive, especially with large datasets.
Sensitive to the choice of k and the distance metric.
Poor performance with high-dimensional data.
Performing k-NN Analysis in R
1. Installing Required Packages
We will use the class
package for building k-NN models.
# Installing the class package
install.packages("class")
2. Building the Model
You can build a k-NN model using the knn()
function from the class
package.
# Loading the required package
library(class)
# Creating a sample dataset
set.seed(123)
<- data.frame(
data
x1 = rnorm(100),
x2 = rnorm(100),
y = factor(sample(c("A", "B"), 100, replace = TRUE))
)
# Splitting the data into training and testing sets
library(caret)
<- createDataPartition(data$y, p = 0.7, list = FALSE)
trainIndex
<- data[trainIndex, ]
train_data
<- data[-trainIndex, ]
test_data
# Extracting features and labels
<- train_data[, c("x1", "x2")]
train_features
<- train_data$y
train_labels
<- test_data[, c("x1", "x2")]
test_features
<- test_data$y
test_labels
# Building the k-NN model
<- 5
k
<- knn(train = train_features, test = test_features, cl = train_labels, k = k)
predictions
print(predictions)
3. Evaluating the Model
You can evaluate the model’s performance using various metrics such as accuracy and confusion matrix.
# Confusion Matrix
<- table(predictions, test_labels)
confusion_matrix
print(confusion_matrix)
# Calculating accuracy
<- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
print(paste("Accuracy:", accuracy))
4. Choosing the Optimal k
The choice of k can significantly impact the performance of the k-NN algorithm. You can use cross-validation to select the optimal value of k.
# Cross-validation to choose the optimal k
<- sapply(1:20, function(k) {
accuracy_list
<- knn(train = train_features, test = test_features, cl = train_labels, k = k)
predictions
<- table(predictions, test_labels)
confusion_matrix
<- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
return(accuracy)
})
# Plotting accuracy vs k
plot(1:20, accuracy_list, type = "b", xlab = "Number of Neighbors (k)", ylab = "Accuracy", main = "Accuracy vs k")
<- which.max(accuracy_list)
optimal_k
print(paste("Optimal k:", optimal_k))
Example: Comprehensive k-NN Analysis
Here’s a comprehensive example of performing k-NN analysis in R.
# Creating a sample dataset
set.seed(123)
<- data.frame(
data
x1 = rnorm(100),
x2 = rnorm(100),
y = factor(sample(c("A", "B"), 100, replace = TRUE))
)
# Splitting the data into training and testing sets
library(caret)
<- createDataPartition(data$y, p = 0.7, list = FALSE)
trainIndex
<- data[trainIndex, ]
train_data
<- data[-trainIndex, ]
test_data
# Extracting features and labels
<- train_data[, c("x1", "x2")]
train_features
<- train_data$y
train_labels
<- test_data[, c("x1", "x2")]
test_features
<- test_data$y
test_labels
# Building the k-NN model
library(class)
<- 5
k
<- knn(train = train_features, test = test_features, cl = train_labels, k = k)
predictions
# Evaluating the model
<- table(predictions, test_labels)
confusion_matrix
<- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
print(paste("Accuracy:", accuracy))
# Choosing the optimal k
<- sapply(1:20, function(k) {
accuracy_list
<- knn(train = train_features, test = test_features, cl = train_labels, k = k)
predictions
<- table(predictions, test_labels)
confusion_matrix
<- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
return(accuracy)
})
# Plotting accuracy vs k
plot(1:20, accuracy_list, type = "b", xlab = "Number of Neighbors (k)", ylab = "Accuracy", main = "Accuracy vs k")
<- which.max(accuracy_list)
optimal_k
print(paste("Optimal k:", optimal_k))
Summary
In this lecture, we covered how to perform k-NN analysis in R, including building the model, evaluating its performance, making predictions, and selecting the optimal value of k. k-NN is a simple and effective algorithm for both classification and regression tasks, offering flexibility through the choice of k and distance metrics.
Further Reading
For more detailed information, consider exploring the following resources:
Call to Action
If you found this lecture helpful, make sure to check out the other lectures in the ML R series. Happy coding!