Data Exploration and Model Training (using 5 Classification Algorithms) is done on Diabetes Dataset using certain libraries of Python like Numpy, Pandas, Seaborn, Sklearn, and Matplotlib.
This is a Machine Learning Project done using Python. The dataset used is Diabetes Dataset. The overall structure of the project is given below.
Importing the Libraries: First of all, the required Python libraries are imported namely Numpy, Pandas, Seaborn, Sklearn, and Matplotlib. Numpy and Pandas are used for data analysis and calculations. Matplotlib and Seaborn are used for data visualization in Python. Sklearn is used for predictive data analysis.
Reading Data from File: The Diabetes CSV file is read using Pandas.
Data Exploration: This includes inspecting the data, visualizing the data, and cleaning the data. Some of the steps used are as follows:
1. Viewing the data statistics.
2. Finding out the dimensions of the dataset, the variable names, the data types, etc.
3. Checking for null values.
4. Inspecting the target variable using pie plot and count plot.
5. Finding out the correlation among different features using heatmap and the bivariate relation between each pair of features using pair plot.
Model Training: 5 Classification Algorithms have been used to find out the best one. These are Logistic Regression, Support Vector Machine, Random Forest, K-Nearest Neighbours, and Naive Bayes.
In each of the algorithms, the steps followed are as follows:
1. Importing the library for the algorithm.
2. Creating an instance of the Classifier(with default values of parameters or by specifying certain values in certain cases).
3. Training the model on the train set.
4. Prediction on the test set using the trained model.
5. Calculating the accuracy of the prediction.
Prediction Accuracy Comparison: The accuracies of the 5 algorithms are compared to find out the one which does the best prediction.