Coders Packet

Understanding and Implementing Hierarchical Clustering Algorithm in Python

By Jacin Jacob

Hierarchical Clustering Algorithm implementation in Python on Human Gene Dataset for multi-class classification

Clustering means grouping data based on a given attribute. We divide the data points into various groups so that each data point is more similar to other data points in that given group. There is no ideal clustering algorithm in every case, but it depends on the type of output You need and expect. So, we combine multiple algorithms to bring out the output we expect. Using Multiple clustering algorithms results in higher accuracy and better classification.

Hierarchical Clustering is an unsupervised type of clustering where similar data points are arranged in a hierarchical order. Hierarchical Clustering uses dendrograms to showcase the output in a graphical format. Hierarchical clustering is further divided into Agglomerative Clustering (Bottom Up) and Divisive Clustering (Top Down).

The Algorithm:

Agglomerative Clustering:

In Agglomerative or Bottom-up clustering method, we start with individual points as a cluster. Then, compute the distance between each of the clusters and join the two most similar clusters until we are only left with a single cluster consisting of all the points.

Rows in the distance matrix are merged when we form a cluster and points in the newly formed cluster are modified so that they are not taken up for consideration again while forming a new cluster. The newly formed cluster will have distances to all points outside the cluster using one of the heuristics (min, max, mean).

MODIFICATION AFTER CLUSTERING

FORMULA

Maximum or complete-linkage clustering

Max(d(a,b))

Minimum or single-linkage clustering

Min(d(a,b))

Mean or average linkage clustering

sum of all d(a,b) |A|+|B|

*a belongs to A, b belongs to B

Distance Matrix:

Distance Matrix is an NxN matrix where a point (i, j) denotes the alignment distance between the ith and the jth DNA sequence strings. Due to the slower computation power of Python over C++, computation of edit distance between DNA sequence strings takes a much longer time (around 22000 seconds). As a result, this computation can not be repeated. Hence the distance matrix is stored as a pickle file to be reused later.

Linkage Matrix

Scipy uses a special matrix called a linkage matrix to draw dendrograms. The shape of the matrix is 2N-2 x 4, where the ith row represents the merging of two clusters to form the (n+i)th cluster. The first and second columns of the matrix contain the clusters being merged, the third column contains the distance between the two clusters being merged, and the fourth column contains the number of elements in the merged cluster.

Agglomerative Clustering Max

Agglomerative Clustering - Min

Agglomerative Clustering - Mean

Visual Summary:

  • The first diagram implies the dendrogram diagram with Max or complete-linkage clustering.
  • The second graphical diagram shows the dendrogram diagram of Min Agglomerative Clustering
  • The third diagram shows the Agglomerative clustering classification based on Mean.

 

Python Implementation:

The Dataset:

The Dataset used here is a Human Gene Data sequence in a text file format, where the key is the gene sequence’s name and the value contains the entire gene string.

Implementing Agglomerative Clustering:

The necessary packages and modules are imported and we create the necessary functions to create the distance matrix, the minimum distance cluster pair and we merge the clusters based on their distance respectively. Here I’ve used the hierarchy cluster module and dendrogram from scipy package.

Code:

Importing the Libraries:

import numpy as np

import weakref

from collections import defaultdict

from data_reader import DataReader

from similarity import computeDistance

from time import time

import math

from pathlib import Path

import pickle

import pprint

from matplotlib import pyplot as plt

from scipy.cluster.hierarchy import dendrogram

import sys

Compute Distance Matrix:

# Compute Distance among DNA Sequence
                cls.simMatrix = np.ones((cls.ClusterCount, cls.ClusterCount))
                for cID in range(cls.ClusterCount):
                    clusterA = cls.getClusterById(cID)
                    for _cID in range(cID, cls.ClusterCount):
                        clusterB = cls.getClusterById(_cID)
                        seq1 = clusterA.sequences[0]
                        seq2 = clusterB.sequences[0]
                        similarity_1 = computeDistance(seq1, seq2)
                        cls.simMatrix[cID, _cID] = similarity_1
                        cls.simMatrix[_cID, cID] = similarity_1
                        print("similarity between {} and {} = {}\r".format(cID, _cID, similarity_1), end='', flush=True)
                        sys.stdout.flush()
                    print('')

Merge Clusters based on Max, Min, or Mean:

def mergeSimilarClusters(cls, mergedRC, toDelete, iteration, dist, heuristic = 
'Centroid'):
    
''' Cluster merging and new CLuster Creation based on the preset Heuristic
Available Heuristic Values are,
                - Centroid
                - Max
                - Min'''
        outdated_m = mergedRC
        outdated_d = toDelete
        
        delCluster = cls.getClusterById(outdated_d)
        toDelete = delCluster._id
        mergedCluster = cls.getClusterById(outdated_m)
        mergedRC = mergedCluster._id

        d_mem = delCluster.memberCount
        m_mem = mergedCluster.memberCount

        if heuristic == 'Centroid': #Compute using Centroid Calculation
            rowSum = (cls.simMatrix[outdated_m, :]*m_mem + cls.simMatrix[outdated_d, :]*d_mem)/(m_mem+d_mem)
        elif heuristic == 'Max':    #Compute using Max Calculation
            rowStack = np.vstack((cls.simMatrix[outdated_m, :], cls.simMatrix[outdated_d, :]))
            rowSum = np.amax(rowStack, axis=0)
        elif heuristic == 'Min':    #Compute using Min Calculation
            rowStack = np.vstack((cls.simMatrix[outdated_m, :], cls.simMatrix[outdated_d, :]))
            rowSum = np.amin(rowStack, axis=0)

        #Update the new row
        cls.simMatrix[:, outdated_m] = rowSum
        cls.simMatrix[outdated_m, :] = rowSum
        cls.simMatrix[:, outdated_d] = 1*math.inf
        cls.simMatrix[outdated_d, :] = 1*math.inf

        #Merge Data Points into the cluster
        for key in delCluster.clusterMembers.keys():
            mergedCluster.addMember(key, delCluster.clusterMembers[key])
        mergedCluster.updateID(iteration)
        mergedCluster.incrementFactor()
        print('Union ({} - {}), distance {}'.format(toDelete, mergedRC, dist))

 

Conclusion:

Hence using the Agglomerative Algorithm, Clustering was performed and the dendrograms were obtained on max, min, and mean-based clustering attributes.

Download Complete Code

Comments

No comments yet