How to Detect and Remove Outliers in Pandas DataFrame

In this tutorial, we are going to learn and understand: How to Detect and Remove Outliers in Pandas DataFrame.

Introduction

Outliers are those weird, extreme values in your dataset that can mess up your analysis. In this tutorial, we will use Pandas and NumPy to find and remove them using the Z-score method.

Why Remove Outliers?

Outliers can skew results and make models less accurate.

  • They distort averages and standard deviations.
  • They can mislead machine learning models.
  • They sometimes come from data entry errors.

Step 1: Importing Required Libraries

We first need to import Pandas and NumPy to handle our dataset.

import pandas as pd
import numpy as np

Step 2: Create a Sample DataFrame

We create a simple dataset with some numbers.

data = {'col1': [10, 12, 11, 10, 105, 13, 12, 10, 11, 15, 12]} 
df = pd.DataFrame(data)

Step 3: Detect Outliers Using Z-Score

The Z-score is basically how far a value is from the mean which we check in terms of standard deviation. If absolute value of Z-score is greater than 3, it is considered an outlier.

df['z_score'] = (df['col1'] - df['col1'].mean()) / df['col1'].std() 
outliers = df[df['z_score'].abs() > 3

Step 4: Remove the Outliers

Once we’ve identified the outliers, we filter them out and drop the extra column.

df_clean = df[df['z_score'].abs() <= 3].drop(columns=['z_score'])

Step 5: Display the Results

Let’s print everything out to see the difference before and after removing outliers.

print("Original DataFrame:") 
print(df) 
print("\nOutliers detected:") 
print(outliers) 
print("\nDataFrame after removing outliers:") 
print(df_clean)

Output

Original DataFrame:
    col1   z_score
0     10 -0.357822
1     12 -0.286902
2     11 -0.322362
3     10 -0.357822
4    105  3.010864
5     13 -0.251443
6     12 -0.286902
7     10 -0.357822
8     11 -0.322362
9     15 -0.180523
10    12 -0.286902

Outliers detected:
   col1   z_score
4   105  3.010864

DataFrame after removing outliers:
    col1
0     10
1     12
2     11
3     10
5     13
6     12
7     10
8     11
9     15
10    12

Conclusion

In this tutorial, we learned how to detect and remove outliers using the Z-score method in Pandas.

  • Detect Outliers: Using Z-score to find values far from the mean.
  • Remove Outliers: Filtering out extreme values to clean the dataset.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top