Introduction
Whenever we perform the real-world Machine Learning projects, Dealing with missing data is the first step. Datasets collected from various sources such as surveys, web scraping, Kaggle…. etc. often contain the missing or incomplete values in the dataset. So, inĀ this blog I am going to perform various technique to handle the missing data.
Table of Contents
- Introduction
- What is a Data?
- Why do we Need to Handle the missing values?
- Techniques Used
- Procedure
- Conclusion
What is a data?
Data is a collection of information.
There are two types of data:
- structured data – Data that can be organized into rows and columns. (Ex: Numbers)
- unstructured data-Data that cannot be organized into rows and columns. (Ex: Images, Audio …etc.)
Why We need to handle the missing data in machine learning?
- To increase the efficiency of our model
- By handling missing values, we get the accurate outputs
- When we handle the missing data, we easily train our model
Techniques Used to Handle the Data in Machine Learning Model:
- Drop Rows –Use when few rows have the missing values.
- Fill With Mean/Median/mode — use when the dataset contains the numerical features.
Procedure:
Step-01: Import the necessary libraries.
! pip install NumPy, pandas import NumPy as np import pandas as pd
Step-02: Load the dataset –Sample data
#sample data data = { 'Age': [20,30,45, np.nan,12,49, np.nan], 'Gender': ["Male",np.nan, "Female", np.nan,"Male","Male",np.nan], 'Salary': [2000,3000, np.nan,1000, np.nan,4500, np.nan], 'Purchased': ["yes","no","yes","no","yes","yes","no"] } #Create a data frame df = pd.DataFrame(data) print(data) #display the data in the form of rows and columns
Step-03: Display and will drop the rows
dropped_rows=df[df.isnull(). any(axis=1)] #finding and displaying the null rows and columns print(dropped_rows) df_cleaned=df.dropna() #drop the rows which are contains the null values
Output:
Age Gender Salary Purchase 20 Male 2000 yes 49 Female 4000 yes
Conclusion:
Based on the above blog I have concluded that handling the missing values in the model is crucial and more important to reduce the error in the training and give the accurate outputs