Product purchase prediction using K Nearest Neighbors classification model in Python
By Nitya Harshitha T
This project makes use of the KNN concept in order to predict whether a product will be purchased by a customer based on their age and estimated salary.
Kindly install the jupyter notebook or if you are already familiar with using VS code add the jupyter plugin. This is where we will be executing the code as interlinked snippets.
We will be using pandas, NumPy, matplotlib, sklearn, seaborn libraries in python, so have them installed too.
- This project is implementing the KNN classifier concept in machine learning, which helps the program to predict whether a product will be bought or not when the age and salary of the customers are specified as inputs.
- The concept has been implemented both the scratch as well as using the inbuilt library to check how well the model performs in either case.
- The scratch model also helps us to learn the inner detailed concepts of KNN classification for learning and understanding purposes.
- The inbuilt model makes use of which makes use of an inbuilt library in python called ‘sklearn’ or ‘scikit learn’. This library consists of almost all kinds of popular ML algorithms like linear regression, logistic regression, decision trees, Naïve Bayes classifier, etc.
The dataset used is a product dataset that consists of User ID, Gender, Age, and Estimated Salary of the customer who has purchased the product (given as 1) or not purchased the product (given as 0). For classification purposes, the first 2 columns aren’t of much use we drop them and make use of only the age and salary attributes as inputs and the purchased attribute as the output variable.
We will be executing the code individually as separate snippets which will work together to form the KNN model for our prediction purpose.
INBUILT LIBRARY MODEL:
- Import necessary libraries for loading the CSV file and print the DataFrame to check if it's correctly loaded. We have used a Product.csv data file to get the information for performing this program.
- Give X variable as inputs and Y as the output, this can be used to perform the train and test splitting to train the KNN model. Standardize the values of age and salary to common units.
- Fit the KNearestNeighbors Model from the scikit learns library on the training data.
- Use this model to perform prediction on the test data and compare with the original values to get the accuracy.
- We can use the classification report and confusion matrix modules to get a detailed analysis of the number of correct and incorrect classifications made by the inbuilt function trained model.
- I have finally used the seaborn module to plot the confusion matrix.
- Here we perform the same X and Y split as we have done before.
- There are many distance measures that can be used to perform KNN classification but in my case, I have considered the Euclidean distance measure to get the distance between the prediction input instance and the other distances in the dataset.
- Next, a KNN function is declared, inside which for a particular point in X_test for which we have to perform KNN. This point is converted to a NumPy array just to make the distance calculation simple. Then a distance list is also declared.
- I treating through the length of X_train, the euclidean distance of all other points in the X_train from X-test point is calculated and these values are appended to the distance list.
- Next, this distance list is converted into pandas DataFrame in order to perform the inbuilt Pandas function sort_values on it which can sort the distances in ascending order.
- Find the maximum no. of classes to determine which class the new point belongs to using the Scipy mode function
- Now I am attaching this mode to the Y_bar list. This process gets executed repeatedly to give us the list of y_predicted values.
- After performing the train and test split, apply the scratch KNN function on the training data and fit it for the test data to get the predicted values.
- Find the accuracy score by comparing the model predicted values and the original values present in the dataset.
The accuracy for most of the scratch implementations tends to be lesser than those produced by the inbuilt functions (which repeatedly become better with each execution). This not only applies to KNN but also to the majority of other infamous machine learning algorithms like linear regression, logistic regression, decision trees, Naïve Bayes, K means clustering, etc.