Data Preprocessing Using Pandas and NumPy

In this tutorial, we are going to learn and understand data preprocessing using pandas and numPy

Why is Data pre-processing important?

Data preprocessing is a crucial task in machine learning because often raw dataset contains inconsistency, noise, missing value, and redundant information. This kind of problem may impact on the performance of the model. Preprocessing ensures that the dataset is clean, structured,  and suitable for modeling.

To understand the basics of pandas you can visit:   Data Wrangling using Pandas

Step 1: Import required libraries

In this tutorial, we will focus on pandas and Numpy. So we will import these two libraries.

import pandas as pd
import numpy as np

Step 2: Load the Dataset

data = {
    'Age': [25,25, np.nan, 30, 35, 40, np.nan, 50 , np.nan],
    'Salary': [50000,50000 ,54000, np.nan, 62000, 72000, 80000, np.nan,np.nan],
    'City': ['Delhi','Delhi' ,'Mumbai', 'Pune', np.nan, 'Kolkatta', 'Benguluru', 'Hyderabad',np.nan]
}
df = pd.DataFrame(data)
print(df)

Output:

 

Step 3: Handing  Missing Values

There are two methods to handle missing values.

  1. Removing Missing Values
  2. Imputing Missing Values
df.info()

Output:

Let’s check if there are any missing values.

df.isnull().sum()

Output :

3.1 Removing Missing Values

This method is used when we have a large dataset and many fields of record are empty. We can see that there is a row having empty values. If we find such empty records we can easily drop them.

df.dropna(how='all', inplace=True)
print(df)

Output :

 

3.2 Imputing Missing Values

This method is used to handle missing values in numerical fields. We can impute them with statistical terms mean, median, and mode according to preference

Here we have two numerical columns ‘Age’ and ‘Salary’.  Age is an integer value so if we have to impute that field we have to impute it with an integer. So If we take the median of nonempty records it will be an integer. For Salary, we can take the mean of nonempty records which can maintain the distribution of records.

df['Age'].fillna(df['Age'].median(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
df['City'].fillna('Unknown', inplace=True)
print(df)

Output :

After handling missing values we can see there are duplicate data so we remove duplicates.

df.drop_duplicates(inplace=True)
print(df)

Output :

Step 4: Data Transformation

Machine learning models require numerical inputs. So we convert categorical data into numerical format. There are the following types of encoding:

  1. One Hot Encoding

  2. Label Encoding

Here we are going to perform Label Encoding. We give labels to the city as per the following.

print(f"Label Encoding for {'City'}:")
print(dict(enumerate(df['City'].astype("category").cat.categories)))

Output :

Label Encoding for City:
{0: 'Benguluru', 1: 'Delhi', 2: 'Hyderabad', 3: 'Kolkatta', 4: 'Mumbai', 5: 'Pune', 6: 'Unknown'}

Now we replace these labels with cities in records.

df['City'] = df['City'].astype('category').cat.codes
print(df)

Output :

 

Conclusion

In this tutorial, we have learned the basics of preprocessing.

  • Handled Missing Values
  • Removed Duplicates
  • Converted categorical data

Now the data is clean and ready for machine learning modeling.

 

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top