Hierarchical Clustering in Python: A Comprehensive Guide

Algorithms such as Hierarchical Clustering belong to unsupervised machine learning and are used for clustering data objects into cluster. However, it is not necessitated to mention the number of clusters a prior just like K-means clustering; rather it builds hierarchical groups. Therefore, one can observe the result by plotting it as a dendrogram. This method helps to use in case confidence one about number of clusters or along with it need hustle for the data set to cluster.

Types of Hierarchical Clustering

Hierarchical clustering can be of two types:

Agglomerative (bottom-up approach) Clustering
Divisive (top-down approach) Clustering

Dendrogram

Dendrogram: A tree-like diagram showing the structure of all clusters generated by a hierarchical clustering algorithm. The dendrogram connects data points using U-shaped lines in either a tree or hierarchical form. The height of each U shows how far apart the two points or clusters it connects are from one another.

Agglomerative clustering

Agglomerative clustering is also known as a bottom-up clustering hierarchical algorithm. It starts with each data point as a separate cluster and iteratively merges the closest clusters until all points belong to a single cluster.

Algorithm of Agglomerative Clustering

a. First, start with N clusters, where N is the number of data points.

Diagram 1
b. Next, merge the two closest clusters.

Diagram 2 Diagram 3 Diagram 4 Diagram 5
c. Then, compute distances for each pair of clusters.

Diagram 6

d. Finally, repeat steps b and c until only one cluster remains or a stopping criterion is met.

Diagram 7

Implementation in Python

# Importing Libraries
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
import numpy as np

# Sample data
X = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]])

# Create clustering model
clustering = AgglomerativeClustering(n_clusters=2)

# Fit the model and get cluster labels
labels = clustering.fit_predict(X)

# Compute the linkage matrix
linkage_matrix = linkage(X, method='ward')

# Plot the dendrogram
dendrogram(linkage_matrix)
plt.title('Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

Advantages:

No need to specify the number of clusters in advance
Can capture clusters of different shapes
Produces a hierarchical representation (dendrogram)
Works well with small to medium-sized datasets

Disadvantages:

Computationally expensive for large datasets (O(n^3) time complexity)
Cannot undo previous steps (merges)
Sensitive to noise and outliers

Applications:

Taxonomy creation in biology
Document clustering in information retrieval
Customer segmentation in marketing
Anomaly detection in various fields

Divisive Clustering

Divisive Clustering (Top-down) Divisiveness clustering is the less used against agglomerative in hierarchical methods of partitional algorithms. Initially, all the data points belong to a single cluster and then the algorithm splits this large into smaller sub-clusters iteratively by ensuring each data point lies in its own cluster or meets some stopping criterion.

Algorithm of Divisive Clustering

a. Begin with all points in a single cluster.

b. First, select a cluster to split.

c. Then, use a flat clustering algorithm (like K-means) to split the chosen cluster into two subclusters.

d. Afterward, repeat steps b and c until each cluster contains only one data point or a stopping criterion is met.

Diagram 8

Implementation in Python

import numpy as np
from sklearn.cluster import KMeans

def div_cluster(X, min_cluster_size=2):
    clusters = [X]
    while True:
        cluster_to_split = max(clusters, key=len)
        if len(cluster_to_split) <= min_cluster_size:
            break
        
        kmeans = KMeans(n_clusters=2)
        labels = kmeans.fit_predict(cluster_to_split)
        
        clusters.remove(cluster_to_split)
        clusters.extend([
            cluster_to_split[labels == 0],
            cluster_to_split[labels == 1]
        ])
    
    return clusters

# Example usage
X = np.random.rand(100, 2)
result = div_cluster(X)

Advantages:

Can be faster than agglomerative clustering for large datasets.
Often produces more balanced hierarchies.
Can be more accurate for top-level partitions.

Disadvantages:

More complex to implement than agglomerative clustering.
Less commonly used and thus less supported in standard libraries.
Can struggle with non-globular cluster shapes.

Applications:

Document classification
Image segmentation

Conclusion

Thus, hierarchical clustering represents a powerful method for finding the natural structure of your data set and is particularly useful when you don’t know beforehand how many clusters there are. It generates a dendrogram, which is the best way to illustrate how links are forming clusters and it can be interpret as well easily with respect of data groupings.

For this reason, while examining very small and medium datasets, we find abundant use for the approach; albeit specialized & optimized methods exist towards handling larger datasets.

Types of Hierarchical Clustering

Dendrogram

Agglomerative clustering

Implementation in Python

Divisive Clustering

Implementation in Python

Conclusion

Related Posts

Leave a Comment Cancel Reply