Using Pandas for Exploratory Data Analysis in Machine Learning

In this article, we will discuss about Pandas library and the important steps of EDA (Exploratory Data Analysis) using Pandas. Further we would dive deep in data cleaning, visualization and statistical analysis.

Introduction to Pandas

Pandas, a powerful open-source library in Python, widely supports data manipulation. It features 2 primary data structures – Series (for one-dimensional data) and DataFrame (for two-dimensional data) that analysts use to analyze big data and draw statistical conclusions based on it.

Where do we use Pandas in Exploratory Data Analysis?

  • Loading the dataset:-

In EDA, we use pandas to load the dataset. It supports various forms such as CSV, Excel, JSON.

import pandas as pd
df = pd.read_csv("dataset.csv")
print(df.head())
  • Analyzing the data:-

Through pandas, we can get overview of the dataset. We can know statistic summary of data, number of columns,shape of the data.

#Gives shape of data
print(df.shape)

#Statistic summary
print(df.describe())

#Display column names
print(df.columns)
  • Handling missing values:-

Pandas provide various ways through which we can handle missing values. We can eitherr remove the missing values or fill it using mean, median and mode based on our requirement.

#To remove missing values
df_new = df.dropna()

#To fill missing values with mean
df_modified = df.fillna(df.mean())
  • Handling Outliers:-

We can use pandas to find IQR (Interquartile Range) to detect outliers.

Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3-Q1
  • Understanding Correlation:-

Pandas helps to know which features are highly correlated and which one are not. So, highly correlated columns can be removed to increase model performance.

print(df.corr())

Advantages of Using Pandas in EDA

  • It provides intuitive data structures like Series and DataFrame that makes data manipulation easy.
  • It offers built-in functions to handle missing values, duplicates and incorrect data types.
  • It also helps in understanding data distributions, outliers, and feature relationships.
  • It can read and write data from multiple sources like CSV, Excel, JSON, SQL databases, and API.
  • Its documentation and tutorials make it easy to learn and use.

 

Also read,

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top