Coders Packet

Implementing K-Means Clustering Algorithm in C++ with an Example.

By Yash Kothekar

In this project, K-Means Clustering is used to group Mall Customers based on their age, Annual Income, and Spending Score. C++ 17 is used.

This is an implementation of the K-Means Clustering algorithm. The data of Mall Customers was taken from Kaggle, then refined to get precise clustering(eg: binary data was avoided). The data consists of 3 columns-Age(Years), Annual Income(in thousand dollars), and Spending Score.

Standardization is a good practice if data is in varying scales to prevent larger-scale variables from dominating cluster formation. It was performed here too, even though scales are quite similar here. Struct data structure is used to represent 'Points' on the coordinate axes. First, the required number of clusters was found via 'The elbow method'. The program was run for K=1-9 and 'elbow was observed at K=4. Above this value of K, we can observe the program overfitting. Then initial cluster centroid values were chosen from the data points and were then updated via the algorithm until the difference was less than the error limit specified. Initial cluster and final points were displayed along with SSE(sum of squared error).

The final Centroids were recorded and the data points were put into a CSV file along with their cluster-id. Below you can see the clusters and centroids(in the 2nd Image, darker and diamond-shaped ). The axis is based on standardized data, hence negative values may occur.

Thus K-Means clustering was implemented and visualized via C++.

 

Download project

Reviews Report

Submitted by Yash Kothekar (YashK)

Download packets of source code on Coders Packet