Pandas and Seaborn: Exploratory Data Analysis (EDA) in Python

Using Pandas and Seaborn, this lesson explores exploratory data analysis (EDA) in Python, covering fundamental ideas, methods for exploring data, and visualization using real-world examples. It’s intended to improve your capacity for efficient data trend analysis and interpretation.

EDA in Python Using Pandas and Seaborn

The steps involved are illustrated in the code below. Follow the sequence to understand the implementation process.

Step 1: Import Libraries

#Importing Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Step -2: Load the Dataset

# Load the dataset
df = pd.read_csv('/content/data.csv')

Step 3: Handling Missing Data

# Drop rows with missing values
df_cleaned = df.dropna()

# Alternatively, you can fill missing values
# df_filled = df.fillna(value=...)

Step 4: Visualizing Distributions of Numerical Variables

# Distribution of Engine HP
sns.histplot(df['Engine HP'], kde=True)
plt.title('Distribution of Engine HP')
plt.show()
# Distribution of MSRP 
sns.histplot(df['MSRP'], kde=True) 
plt.title('Distribution of MSRP')
plt.show()

Step 5: Analyzing Categorical Variables

# Count plot for 'Make'
plt.figure(figsize=(10, 6))
sns.countplot(y='Make', data=df_cleaned, order=df['Make'].value_counts().index)
plt.title('Count of Cars by Make')
plt.show()

# Count plot for 'Transmission Type'
sns.countplot(x='Transmission Type', data=df_cleaned)
plt.title('Count of Cars by Transmission Type')
plt.show()

 

Step 6: Correlation Analysis

# Select only the numerical columns from the dataset
numerical_df = df_cleaned.select_dtypes(include=['float64', 'int64'])
# Calculate the correlation matrix for numerical columns only
corr = numerical_df.corr()

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

 

Step 7: Pairplots for Numerical Variables

# Pairplot for selected numerical features
sns.pairplot(df_cleaned[['Engine HP', 'Engine Cylinders', 'highway MPG', 'city mpg', 'MSRP']])
plt.show()

Step 8: Analyzing Relationships Between Features

# Boxplot of MSRP vs. Make
plt.figure(figsize=(15, 8))
sns.boxplot(x='Make', y='MSRP', data=df_cleaned)
plt.xticks(rotation=90)
plt.title('Boxplot of MSRP by Make')
plt.show()

# Scatter plot of Engine HP vs. MSRP
sns.scatterplot(x='Engine HP', y='MSRP', hue='Make', data=df_cleaned)
plt.title('Engine HP vs. MSRP')
plt.show()

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top