Coders Packet

Hierachical Clustering using Python

By Shivang Kohli

It shows how to perform hierarchical clustering and print the dendrograms using Python libraries: pandas, NumPy, scikit-learn, matplotlib, Seaborn and scipy.

It shows how to do hierarchical clustering using Python's libraries: pandas, NumPy, sci-kit learn, Matplotlib, Seaborn, and Scipy.

We have taken a dataset of students' performance over 5 tests in a year and are going to categorize the data into clusters using a hierarchical clustering algorithm. Hierarchical clustering is an algorithm that can be used to group similar objects into groups called clusters. As a result we obtain a cluster that has differs from the other clusters however, all the data points belonging to a single cluster are similar to each other. 

First, we imprort the header files:

import warnings

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
import scipy.cluster.hierarchy as sch
import plotnine

%matplotlib inline


Next, we read the dataset using pandas:

tests = pd.read_csv("")

tests_wide = pd.read_csv("")

Then we divide extract the features from:

features = ['zero','one','two','three']
x = tests_wide[features]

Dataset head

We can visualize the relationship by using a heatmap from the seaborn library: 

tests_wide["cluster2"] = clusters


Now create an object of Agglomerative Clustering and fitting it to our model: 

hac = AgglomerativeClustering(n_clusters = 3, affinity = 'euclidean', linkage ='ward')
#affinity = method for finding distance: #Euclidean,Manhattan,Hamming,Cosine
#linkage = Checking the closeness 2 clusters : #simple,average,complete,centroid,ward

Hence we have successfully fitted the model on our data. Next, we print the dendrogram from our model: 

dendro = sch.dendrogram(sch.linkage(x,method = 'ward'))
#High density at the bottom and high separation at the top


We can also see our labels for each data point in the dataset

Finally, we get our score using the silhouette score.

The Silhouette Coefficient is the difference between the mean intra-cluster distance(a) and the mean nearest-cluster distance (b). The Silhouette Coefficient for a sample is (b - a) / max(a, b). Where, b is the distance between a sample and the nearest cluster that the sample isn't a part of. 

clusters = hac.labels_

#OUTPUT: [0 0 0 1 0 1 0 1 0 2 1 0 1 0 0 0 1 1 1 0 2 2 2 2 2 2 2 2 2 2]


#OUTPUT: 0.6555623791318137


Download Complete Code


No comments yet