In this tutorial, we are going to learn and understand: How to Detect and Remove Outliers in Pandas DataFrame.
Introduction
Outliers are those weird, extreme values in your dataset that can mess up your analysis. In this tutorial, we will use Pandas and NumPy to find and remove them using the Z-score method.
Why Remove Outliers?
Outliers can skew results and make models less accurate.
- They distort averages and standard deviations.
- They can mislead machine learning models.
- They sometimes come from data entry errors.
Step 1: Importing Required Libraries
We first need to import Pandas and NumPy to handle our dataset.
import pandas as pd import numpy as np
Step 2: Create a Sample DataFrame
We create a simple dataset with some numbers.
data = {'col1': [10, 12, 11, 10, 105, 13, 12, 10, 11, 15, 12]} df = pd.DataFrame(data)
Step 3: Detect Outliers Using Z-Score
The Z-score is basically how far a value is from the mean which we check in terms of standard deviation. If absolute value of Z-score is greater than 3, it is considered an outlier.
df['z_score'] = (df['col1'] - df['col1'].mean()) / df['col1'].std() outliers = df[df['z_score'].abs() > 3
Step 4: Remove the Outliers
Once we’ve identified the outliers, we filter them out and drop the extra column.
df_clean = df[df['z_score'].abs() <= 3].drop(columns=['z_score'])
Step 5: Display the Results
Let’s print everything out to see the difference before and after removing outliers.
print("Original DataFrame:") print(df) print("\nOutliers detected:") print(outliers) print("\nDataFrame after removing outliers:") print(df_clean)
Output
Original DataFrame: col1 z_score 0 10 -0.357822 1 12 -0.286902 2 11 -0.322362 3 10 -0.357822 4 105 3.010864 5 13 -0.251443 6 12 -0.286902 7 10 -0.357822 8 11 -0.322362 9 15 -0.180523 10 12 -0.286902 Outliers detected: col1 z_score 4 105 3.010864 DataFrame after removing outliers: col1 0 10 1 12 2 11 3 10 5 13 6 12 7 10 8 11 9 15 10 12
Conclusion
In this tutorial, we learned how to detect and remove outliers using the Z-score method in Pandas.
- Detect Outliers: Using Z-score to find values far from the mean.
- Remove Outliers: Filtering out extreme values to clean the dataset.