In this Python project, we applied machine learning classification algorithms to predict whether or not the patients in the dataset have diabetes or not.
The main aim of this project is to predict whether or not the patients in the dataset have diabetes or not in Python using machine learning.
The data set is collected from Kaggle.
The dataset consists of several medical predictor variables and one target variable. columns are following:
Import the necessary packages
import the dataset from the local folder
Exploratory data analysis: It is all about getting an overall understanding of data. It is done to find its properties, visualization, and help us to assure that our data is correct and ready to use for the machine learning algorithms.
Splitting the dataset for training and testing the model.
RANDOM FOREST CLASSIFIER
Finally, we have trained our model on the basis of the following metrics.
Also, I have included Area Under Curve(AUC) as AUC is a good way of comparing which is a better model.
In this dataset, there is a higher focus on the accuracy of predicting true positives hence true negatives are not really a priority. As such, greater focus will be placed on Accuracy and Recall.
we can see that the test accuracy of the various models are generally within the same range, from approximately 73% to 81%
Based on Accuracy and Recall score, overall the KNN produced the best results, and it has a good AUC score as well.