By Jacin Jacob
Hierarchical Clustering Algorithm implementation in Python on Human Gene Dataset for multi-class classification
Clustering means grouping data based on a given attribute. We divide the data points into various groups so that each data point is more similar to other data points in that given group. There is no ideal clustering algorithm in every case, but it depends on the type of output You need and expect. So, we combine multiple algorithms to bring out the output we expect. Using Multiple clustering algorithms results in higher accuracy and better classification.
Hierarchical Clustering is an unsupervised type of clustering where similar data points are arranged in a hierarchical order. Hierarchical Clustering uses dendrograms to showcase the output in a graphical format. Hierarchical clustering is further divided into Agglomerative Clustering (Bottom Up) and Divisive Clustering (Top Down).
The Algorithm:
Agglomerative Clustering:
In Agglomerative or Bottom-up clustering method, we start with individual points as a cluster. Then, compute the distance between each of the clusters and join the two most similar clusters until we are only left with a single cluster consisting of all the points.
Rows in the distance matrix are merged when we form a cluster and points in the newly formed cluster are modified so that they are not taken up for consideration again while forming a new cluster. The newly formed cluster will have distances to all points outside the cluster using one of the heuristics (min, max, mean).
MODIFICATION AFTER CLUSTERING |
FORMULA |
Maximum or complete-linkage clustering |
Max(d(a,b)) |
Minimum or single-linkage clustering |
Min(d(a,b)) |
Mean or average linkage clustering |
sum of all d(a,b) |A|+|B| |
*a belongs to A, b belongs to B
Distance Matrix:
Distance Matrix is an NxN matrix where a point (i, j) denotes the alignment distance between the i^{th} and the j^{th} DNA sequence strings. Due to the slower computation power of Python over C++, computation of edit distance between DNA sequence strings takes a much longer time (around 22000 seconds). As a result, this computation can not be repeated. Hence the distance matrix is stored as a pickle file to be reused later.
Linkage Matrix
Scipy uses a special matrix called a linkage matrix to draw dendrograms. The shape of the matrix is 2N-2 x 4, where the i^{th} row represents the merging of two clusters to form the (n+i)^{th} cluster. The first and second columns of the matrix contain the clusters being merged, the third column contains the distance between the two clusters being merged, and the fourth column contains the number of elements in the merged cluster.
Visual Summary:
Python Implementation:
The Dataset:
The Dataset used here is a Human Gene Data sequence in a text file format, where the key is the gene sequence’s name and the value contains the entire gene string.
Implementing Agglomerative Clustering:
The necessary packages and modules are imported and we create the necessary functions to create the distance matrix, the minimum distance cluster pair and we merge the clusters based on their distance respectively. Here I’ve used the hierarchy cluster module and dendrogram from scipy package.
Code:
Importing the Libraries:
import numpy as np import weakref from collections import defaultdict from data_reader import DataReader from similarity import computeDistance from time import time import math from pathlib import Path import pickle import pprint from matplotlib import pyplot as plt from scipy.cluster.hierarchy import dendrogram import sys
Compute Distance Matrix:
# Compute Distance among DNA Sequence cls.simMatrix = np.ones((cls.ClusterCount, cls.ClusterCount)) for cID in range(cls.ClusterCount): clusterA = cls.getClusterById(cID) for _cID in range(cID, cls.ClusterCount): clusterB = cls.getClusterById(_cID) seq1 = clusterA.sequences[0] seq2 = clusterB.sequences[0] similarity_1 = computeDistance(seq1, seq2) cls.simMatrix[cID, _cID] = similarity_1 cls.simMatrix[_cID, cID] = similarity_1 print("similarity between {} and {} = {}\r".format(cID, _cID, similarity_1), end='', flush=True) sys.stdout.flush() print('')
Merge Clusters based on Max, Min, or Mean:
def mergeSimilarClusters(cls, mergedRC, toDelete, iteration, dist, heuristic = 'Centroid'): ''' Cluster merging and new CLuster Creation based on the preset Heuristic Available Heuristic Values are, - Centroid - Max - Min''' outdated_m = mergedRC outdated_d = toDelete delCluster = cls.getClusterById(outdated_d) toDelete = delCluster._id mergedCluster = cls.getClusterById(outdated_m) mergedRC = mergedCluster._id d_mem = delCluster.memberCount m_mem = mergedCluster.memberCount if heuristic == 'Centroid': #Compute using Centroid Calculation rowSum = (cls.simMatrix[outdated_m, :]*m_mem + cls.simMatrix[outdated_d, :]*d_mem)/(m_mem+d_mem) elif heuristic == 'Max': #Compute using Max Calculation rowStack = np.vstack((cls.simMatrix[outdated_m, :], cls.simMatrix[outdated_d, :])) rowSum = np.amax(rowStack, axis=0) elif heuristic == 'Min': #Compute using Min Calculation rowStack = np.vstack((cls.simMatrix[outdated_m, :], cls.simMatrix[outdated_d, :])) rowSum = np.amin(rowStack, axis=0) #Update the new row cls.simMatrix[:, outdated_m] = rowSum cls.simMatrix[outdated_m, :] = rowSum cls.simMatrix[:, outdated_d] = 1*math.inf cls.simMatrix[outdated_d, :] = 1*math.inf #Merge Data Points into the cluster for key in delCluster.clusterMembers.keys(): mergedCluster.addMember(key, delCluster.clusterMembers[key]) mergedCluster.updateID(iteration) mergedCluster.incrementFactor() print('Union ({} - {}), distance {}'.format(toDelete, mergedRC, dist))
Conclusion:
Hence using the Agglomerative Algorithm, Clustering was performed and the dendrograms were obtained on max, min, and mean-based clustering attributes.
Submitted by Jacin Jacob (JacinJacob)
Download packets of source code on Coders Packet
Comments