# Understanding and Implementing Hierarchical Clustering Algorithm in Python

Hierarchical Clustering Algorithm implementation in Python on Human Gene Dataset for multi-class classification

Clustering means grouping data based on a given attribute. We divide the data points into various groups so that each data point is more similar to other data points in that given group. There is no ideal clustering algorithm in every case, but it depends on the type of output You need and expect. So, we combine multiple algorithms to bring out the output we expect. Using Multiple clustering algorithms results in higher accuracy and better classification.

Hierarchical Clustering is an unsupervised type of clustering where similar data points are arranged in a hierarchical order. Hierarchical Clustering uses dendrograms to showcase the output in a graphical format. Hierarchical clustering is further divided into Agglomerative Clustering (Bottom Up) and Divisive Clustering (Top Down).

The Algorithm:

Agglomerative Clustering:

In Agglomerative or Bottom-up clustering method, we start with individual points as a cluster. Then, compute the distance between each of the clusters and join the two most similar clusters until we are only left with a single cluster consisting of all the points.

Rows in the distance matrix are merged when we form a cluster and points in the newly formed cluster are modified so that they are not taken up for consideration again while forming a new cluster. The newly formed cluster will have distances to all points outside the cluster using one of the heuristics (min, max, mean).

 MODIFICATION AFTER CLUSTERING FORMULA Maximum or complete-linkage clustering Max(d(a,b)) Minimum or single-linkage clustering Min(d(a,b)) Mean or average linkage clustering sum of all d(a,b) |A|+|B|

*a belongs to A, b belongs to B

Distance Matrix:

Distance Matrix is an NxN matrix where a point (i, j) denotes the alignment distance between the ith and the jth DNA sequence strings. Due to the slower computation power of Python over C++, computation of edit distance between DNA sequence strings takes a much longer time (around 22000 seconds). As a result, this computation can not be repeated. Hence the distance matrix is stored as a pickle file to be reused later.

Scipy uses a special matrix called a linkage matrix to draw dendrograms. The shape of the matrix is 2N-2 x 4, where the ith row represents the merging of two clusters to form the (n+i)th cluster. The first and second columns of the matrix contain the clusters being merged, the third column contains the distance between the two clusters being merged, and the fourth column contains the number of elements in the merged cluster.   Visual Summary:

• The first diagram implies the dendrogram diagram with Max or complete-linkage clustering.
• The second graphical diagram shows the dendrogram diagram of Min Agglomerative Clustering
• The third diagram shows the Agglomerative clustering classification based on Mean.

Python Implementation:

The Dataset:

The Dataset used here is a Human Gene Data sequence in a text file format, where the key is the gene sequence’s name and the value contains the entire gene string.

Implementing Agglomerative Clustering:

The necessary packages and modules are imported and we create the necessary functions to create the distance matrix, the minimum distance cluster pair and we merge the clusters based on their distance respectively. Here I’ve used the hierarchy cluster module and dendrogram from scipy package.

Code:

Importing the Libraries:

```import numpy as np

import weakref

from collections import defaultdict

from similarity import computeDistance

from time import time

import math

from pathlib import Path

import pickle

import pprint

from matplotlib import pyplot as plt

from scipy.cluster.hierarchy import dendrogram

import sys```

Compute Distance Matrix:

```# Compute Distance among DNA Sequence
cls.simMatrix = np.ones((cls.ClusterCount, cls.ClusterCount))
for cID in range(cls.ClusterCount):
clusterA = cls.getClusterById(cID)
for _cID in range(cID, cls.ClusterCount):
clusterB = cls.getClusterById(_cID)
seq1 = clusterA.sequences
seq2 = clusterB.sequences
similarity_1 = computeDistance(seq1, seq2)
cls.simMatrix[cID, _cID] = similarity_1
cls.simMatrix[_cID, cID] = similarity_1
print("similarity between {} and {} = {}\r".format(cID, _cID, similarity_1), end='', flush=True)
sys.stdout.flush()
print('')
```

Merge Clusters based on Max, Min, or Mean:

```def mergeSimilarClusters(cls, mergedRC, toDelete, iteration, dist, heuristic =
'Centroid'):

''' Cluster merging and new CLuster Creation based on the preset Heuristic
Available Heuristic Values are,
- Centroid
- Max
- Min'''
outdated_m = mergedRC
outdated_d = toDelete

delCluster = cls.getClusterById(outdated_d)
toDelete = delCluster._id
mergedCluster = cls.getClusterById(outdated_m)
mergedRC = mergedCluster._id

d_mem = delCluster.memberCount
m_mem = mergedCluster.memberCount

if heuristic == 'Centroid': #Compute using Centroid Calculation
rowSum = (cls.simMatrix[outdated_m, :]*m_mem + cls.simMatrix[outdated_d, :]*d_mem)/(m_mem+d_mem)
elif heuristic == 'Max':    #Compute using Max Calculation
rowStack = np.vstack((cls.simMatrix[outdated_m, :], cls.simMatrix[outdated_d, :]))
rowSum = np.amax(rowStack, axis=0)
elif heuristic == 'Min':    #Compute using Min Calculation
rowStack = np.vstack((cls.simMatrix[outdated_m, :], cls.simMatrix[outdated_d, :]))
rowSum = np.amin(rowStack, axis=0)

#Update the new row
cls.simMatrix[:, outdated_m] = rowSum
cls.simMatrix[outdated_m, :] = rowSum
cls.simMatrix[:, outdated_d] = 1*math.inf
cls.simMatrix[outdated_d, :] = 1*math.inf

#Merge Data Points into the cluster
for key in delCluster.clusterMembers.keys():
mergedCluster.updateID(iteration)
mergedCluster.incrementFactor()
print('Union ({} - {}), distance {}'.format(toDelete, mergedRC, dist))
```

Conclusion:

Hence using the Agglomerative Algorithm, Clustering was performed and the dendrograms were obtained on max, min, and mean-based clustering attributes.