This project contains python code which predicts whether a given individual will be employed based on certain criteria. The packages used are sci-kit learn, pandas, numpy and deap framework.
Given the dataset of details of different people on which the algorithm is trained, it predicts whether a test person will be employed or not. The dataset contains various features such as relevant experience of person, nature of studies, enrollment in an institution, nature of job and company they are applying to, etc. The data is cleaned and passed through a feature selection model (genetic algorithm) after which it is passed through a prediction model, which is the random forest classifier.
The first step of the process is to obtain the entire data and split the target (output) vector. The data is then passed through a series of preprocessing steps, where missing values are first recognised and rectified (based on nature of data and value of other features), and encoding of categorical features is performed to make sure that the data being passed has only numerical values (no categorical data will be passed). All these things are done using pandas and numpy.
After this, train, test and cross-validation data is obtained by splitting the initial dataset. This is now passed to a feature selection model which is called the genetic algorithm, which is implemented using the deap framework. The genetic algorithm works based on Darwin's theory of evolution, where only the fittest survive, so the algorithm selects various combinations of features and selects the fittest among them based on how they perform in a prediction task. This is then tested against the cross validation data and the fittest individuals (features) are obtained as those which provide the highest accuracies. These features are now extracted from the original dataset as the best combination of features to be used for training the model. This is the engineered data.
The train-test split of data is again performed on the engineered dataset. The train data is then used to train the prediction model, which is chosen as the random forest classifier. The test data is predicted using the trained classifier and the predictions are obtained. The accuracies of the model are also obtained, and the accuracy was found to increase by about 7-10% when using the genetic algorithm as a feature selection model instead of using all the features given in the dataset.
This code can be used for other datasets too. The main difference here is that filling out outliers and missing values depends entirely on what dataset is being used, so the user has to go through the dataset and get an idea of what all values and features are present, and fill out the missing values and detect/remove outliers based on that. An important and helpful technique for data preprocessing is data visualisation. If the data is visualised in the form os histograms, heatmaps and other structures, the person will get a much better idea on what kind of data they are dealing with. This can be done using libraries like matplotlib and seaborn.
Submitted by Srivatsan Sridhar (srivatsansridhar99)
Download packets of source code on Coders Packet