By Jeet Chawla
This project contains a Jupyter Notebook where various machine learning algorithms were used to predict gender based on voice data. The programming language used was Python.
Major Keypoints of the project :
- The data used was built using many samples of voices of males and females. The output was a preprocessed file. This WAV file was converted to a CSV file. This CSV file was downloaded from Kaggle.
1 target variable: label (male or female) and 20 independent variables ( different attributes that help in differentiating female voice from male ones or vice-versa)
- The data was extracted into the notebook and the data were checked for outliers or null values. As it was already preprocessed by the source, we just needed to apply label encoding to the target column.
- The features were selected after extraction and now the data was ready to be split. The training data size was 80% while the testing data size was 20%.
- Machine learning models applied were Decision Tree, Random Forest, SVM, and Naive Bayes. The Random forest has the best accuracy for this problem. (97.79 accuracy score for test data ). Support Vector Machine and Naive Bayes performed averagely while Decision tree was second best.
- Evaluation metrics used were F1-Score, Recall, Precision, Confusion Matrix, and ROC Curve. Along with these metrics, a few more graphs were plotted in order to visualize the performance of different models.
CONCLUSION: BASED ON PRECISION, F1-SCORE, RECALL VALUES, PREDICTION PIE CHART, ACCURACY SCORE, CONFUSION MATRIX, ROC CURVE, AND AUC LINE GRAPH, WE CAN SAY THAT RANDOM FOREST CLASSIFICATION IS BEST SUITED FOR THIS PROBLEM WHERE WE ARE PREDICTING THE PERSON AS MALE/FEMALE BASED UPON THEIR VOICE SIGNALS. SUPPORT VECTOR MACHINE IS LEAST SUITABLE BASED ON SAME ANALYSIS.