How To Handle Missing Data in Machine Learning

Introduction

Whenever we perform the real-world Machine Learning projects, Dealing with missing data is the first step. Datasets collected from various sources such as surveys, web scraping, Kaggle…. etc. often contain the missing or incomplete values in the dataset. So, inĀ  this blog I am going to perform various technique to handle the missing data.

Table of Contents

  • Introduction
  • What is a Data?
  • Why do we Need to Handle the missing values?
  • Techniques Used
  • Procedure
  • Conclusion

What is a data?

Data is a collection of information.

There are two types of data:

  1. structured data – Data that can be organized into rows and columns. (Ex: Numbers)
  2. unstructured data-Data that cannot be organized into rows and columns. (Ex: Images, Audio …etc.)

Why We need to handle the missing data in machine learning?

  • To increase the efficiency of our model
  • By handling missing values, we get the accurate outputs
  • When we handle the missing data, we easily train our model

Techniques Used to Handle the Data in Machine Learning Model:

  1. Drop Rows –Use when few rows have the missing values.
  2. Fill With Mean/Median/mode — use when the dataset contains the numerical features.
Procedure:

Step-01: Import the necessary libraries.

! pip install NumPy, pandas
import NumPy as np
import pandas as pd

Step-02: Load the dataset –Sample data

#sample data
data = {
          'Age': [20,30,45, np.nan,12,49, np.nan],
          'Gender': ["Male",np.nan, "Female", np.nan,"Male","Male",np.nan],
          'Salary': [2000,3000, np.nan,1000, np.nan,4500, np.nan],
          'Purchased': ["yes","no","yes","no","yes","yes","no"]
       }
#Create a data frame
df = pd.DataFrame(data)
print(data) #display the data in the form of rows and columns

Step-03: Display and will drop the rows

dropped_rows=df[df.isnull(). any(axis=1)] #finding and displaying the null rows and columns
print(dropped_rows)
df_cleaned=df.dropna() #drop the rows which are contains the null values
Output:
Age     Gender     Salary    Purchase  
20      Male       2000      yes
49      Female     4000      yes

 

Conclusion:

Based on the above blog I have concluded that handling the missing values in the model is crucial and more important to reduce the error in the training and give the accurate outputs


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top