Decision Trees
Introduction
Decision trees are a popular machine learning algorithm used for both classification and regression tasks. They work by recursively splitting the data into subsets based on the value of input features. In this lecture, we will learn how to perform decision tree analysis in R, including model building, evaluation, and visualization.
Key Concepts
1. What is a Decision Tree?
A decision tree is a flowchart-like structure where each internal node represents a feature (or attribute), each branch represents a decision rule, and each leaf node represents the outcome. The paths from root to leaf represent classification rules.
2. Advantages and Disadvantages
Advantages:
Easy to understand and interpret.
Requires little data preprocessing.
Can handle both numerical and categorical data.
Disadvantages:
Prone to overfitting.
Can be unstable with small changes in data.
May require pruning to improve generalization.
Performing Decision Tree Analysis in R
1. Installing Required Packages
We will use the rpart
package for building decision trees and the rpart.plot
package for visualization.
# Installing required packages
install.packages("rpart")
install.packages("rpart.plot")
2. Building the Model
You can build a decision tree model using the rpart()
function.
# Loading the required packages
library(rpart)
library(rpart.plot)
# Creating a sample dataset
set.seed(123)
<- data.frame(
data
x1 = rnorm(100),
x2 = rnorm(100),
y = factor(sample(c("A", "B"), 100, replace = TRUE))
)
# Splitting the data into training and testing sets
library(caret)
<- createDataPartition(data$y, p = 0.7, list = FALSE)
trainIndex
<- data[trainIndex, ]
train_data
<- data[-trainIndex, ]
test_data
# Building the decision tree model
<- rpart(y ~ x1 + x2, data = train_data, method = "class")
model
print(model)
3. Visualizing the Tree
You can visualize the decision tree using the rpart.plot()
function.
# Plotting the decision tree
rpart.plot(model, main = "Decision Tree")
4. Making Predictions
You can use the model to make predictions on new data.
# Making predictions on the test set
<- predict(model, newdata = test_data, type = "class")
predictions
# Confusion Matrix
<- table(predictions, test_data$y)
confusion_matrix
print(confusion_matrix)
# Calculating accuracy
<- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
print(paste("Accuracy:", accuracy))
Example: Comprehensive Decision Tree Analysis
Here’s a comprehensive example of performing decision tree analysis in R.
# Creating a sample dataset
set.seed(123)
<- data.frame(
data
x1 = rnorm(100),
x2 = rnorm(100),
y = factor(sample(c("A", "B"), 100, replace = TRUE))
)
# Splitting the data into training and testing sets
library(caret)
<- createDataPartition(data$y, p = 0.7, list = FALSE)
trainIndex
<- data[trainIndex, ]
train_data
<- data[-trainIndex, ]
test_data
# Building the decision tree model
library(rpart)
library(rpart.plot)
<- rpart(y ~ x1 + x2, data = train_data, method = "class")
model
# Visualizing the tree
rpart.plot(model, main = "Decision Tree")
# Making predictions on the test set
<- predict(model, newdata = test_data, type = "class")
predictions
# Evaluating the model
<- table(predictions, test_data$y)
confusion_matrix
<- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
print(paste("Accuracy:", accuracy))
Summary
In this lecture, we covered how to perform decision tree analysis in R, including building the model, evaluating its performance, making predictions, and visualizing the results. Decision trees are a powerful tool for both classification and regression tasks, offering a clear and interpretable model structure.
Further Reading
For more detailed information, consider exploring the following resources:
Call to Action
If you found this lecture helpful, make sure to check out the other lectures in the ML R series. Happy coding!