Hierarchical Clustering

Machine Learning
R Programming
Clustering
Learn how to perform hierarchical clustering in R, including model building, evaluation, and visualization. This lecture covers essential techniques for implementing hierarchical clustering in R.
Author

Your Name

Published

June 21, 2024

Hierarchical Clustering

Introduction

Hierarchical clustering is an unsupervised learning method used to group similar objects into clusters. Unlike k-means clustering, hierarchical clustering does not require the number of clusters to be specified in advance. Instead, it builds a hierarchy of clusters that can be visualized as a dendrogram. In this lecture, we will learn how to perform hierarchical clustering in R, including model building, evaluation, and visualization.

Key Concepts

1. What is Hierarchical Clustering?

Hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters. There are two main types of hierarchical clustering:

  • Agglomerative (bottom-up): Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
  • Divisive (top-down): All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

2. Linkage Criteria

The way clusters are merged or split is determined by the linkage criteria. Common linkage criteria include:

  • Single linkage: Minimum distance between points in two clusters.
  • Complete linkage: Maximum distance between points in two clusters.
  • Average linkage: Average distance between points in two clusters.
  • Ward’s method: Minimizes the total within-cluster variance.

Performing Hierarchical Clustering in R

1. Installing Required Packages

We will use the stats and ggplot2 packages for hierarchical clustering and visualization.

# Installing required packages
install.packages("ggplot2")

2. Data Preparation

Before clustering, we need to prepare the data, which may include scaling the features.

# Loading the required packages
library(ggplot2)

# Creating a sample dataset
set.seed(123)
data <- data.frame(
  x1 = rnorm(100),
  x2 = rnorm(100)
)

# Scaling the data
scaled_data <- scale(data)
print(head(scaled_data))

3. Building the Model

You can build a hierarchical clustering model using the hclust() function.

# Computing the distance matrix
dist_matrix <- dist(scaled_data)

# Performing hierarchical clustering using complete linkage
hc_model <- hclust(dist_matrix, method = "complete")

# Plotting the dendrogram
plot(hc_model, main = "Hierarchical Clustering Dendrogram", xlab = "", sub = "", cex = 0.9)

4. Cutting the Dendrogram

To create clusters, you need to cut the dendrogram at a specified height.

# Cutting the dendrogram to create 3 clusters
clusters <- cutree(hc_model, k = 3)

# Adding cluster labels to the original data
data$cluster <- as.factor(clusters)
print(head(data))

5. Visualizing the Clusters

You can visualize the clusters using a scatter plot.

# Plotting the clusters
ggplot(data, aes(x = x1, y = x2, color = cluster)) +
  geom_point() +
  labs(title = "Hierarchical Clustering", x = "Feature 1", y = "Feature 2")

Example: Comprehensive Hierarchical Clustering Analysis

Here’s a comprehensive example of performing hierarchical clustering analysis in R.

# Loading the required packages
library(ggplot2)

# Creating a sample dataset
set.seed(123)
data <- data.frame(
  x1 = rnorm(100),
  x2 = rnorm(100)
)

# Scaling the data
scaled_data <- scale(data)

# Computing the distance matrix
dist_matrix <- dist(scaled_data)

# Performing hierarchical clustering using complete linkage
hc_model <- hclust(dist_matrix, method = "complete")

# Plotting the dendrogram
plot(hc_model, main = "Hierarchical Clustering Dendrogram", xlab = "", sub = "", cex = 0.9)

# Cutting the dendrogram to create 3 clusters
clusters <- cutree(hc_model, k = 3)

# Adding cluster labels to the original data
data$cluster <- as.factor(clusters)

# Plotting the clusters
ggplot(data, aes(x = x1, y = x2, color = cluster)) +
  geom_point() +
  labs(title = "Hierarchical Clustering", x = "Feature 1", y = "Feature 2")

Summary

In this lecture, we covered how to perform hierarchical clustering in R, including building the model, cutting the dendrogram, and visualizing the results. Hierarchical clustering is a powerful tool for grouping similar objects without needing to specify the number of clusters in advance.

Further Reading

For more detailed information, consider exploring the following resources:

Call to Action

If you found this lecture helpful, make sure to check out the other lectures in the ML R series. Happy coding!