It shows how to perform hierarchical clustering and print the dendrograms using Python libraries: pandas, NumPy, scikit-learn, matplotlib, Seaborn and scipy.
It shows how to do hierarchical clustering using Python's libraries: pandas, NumPy, sci-kit learn, Matplotlib, Seaborn, and Scipy.
We have taken a dataset of students' performance over 5 tests in a year and are going to categorize the data into clusters using a hierarchical clustering algorithm. Hierarchical clustering is an algorithm that can be used to group similar objects into groups called clusters. As a result we obtain a cluster that has differs from the other clusters however, all the data points belonging to a single cluster are similar to each other.
First, we imprort the header files:
import warnings warnings.filterwarnings('ignore') import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from sklearn.cluster import AgglomerativeClustering from sklearn.metrics import silhouette_score import scipy.cluster.hierarchy as sch import plotnine %matplotlib inline
Next, we read the dataset using pandas:
tests = pd.read_csv("https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/testperform_long.csv") tests_wide = pd.read_csv("https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/testperform.csv")
Then we divide extract the features from:
features = ['zero','one','two','three'] x = tests_wide[features] x.head()
We can visualize the relationship by using a heatmap from the seaborn library:
tests_wide["cluster2"] = clusters sns.heatmap(tests_wide)
Now create an object of Agglomerative Clustering and fitting it to our model:
hac = AgglomerativeClustering(n_clusters = 3, affinity = 'euclidean', linkage ='ward') #affinity = method for finding distance: #Euclidean,Manhattan,Hamming,Cosine #linkage = Checking the closeness 2 clusters : #simple,average,complete,centroid,ward hac.fit(x)
Hence we have successfully fitted the model on our data. Next, we print the dendrogram from our model:
dendro = sch.dendrogram(sch.linkage(x,method = 'ward')) #High density at the bottom and high separation at the top
We can also see our labels for each data point in the dataset
Finally, we get our score using the silhouette score.
The Silhouette Coefficient is the difference between the mean intra-cluster distance(a) and the mean nearest-cluster distance (b). The Silhouette Coefficient for a sample is
(b - a) / max(a, b). Where,
b is the distance between a sample and the nearest cluster that the sample isn't a part of.
clusters = hac.labels_ print(clusters) #OUTPUT: [0 0 0 1 0 1 0 1 0 2 1 0 1 0 0 0 1 1 1 0 2 2 2 2 2 2 2 2 2 2] silhouette_score(x,clusters) #OUTPUT: 0.6555623791318137
Submitted by Shivang Kohli (shivangkohli)
Download packets of source code on Coders Packet