Data wrangling using Pandas

Data wrangling is process of preparing raw data for analysis. It involves cleaning, structuring and enriching of raw data.
It is also known as data preprocessing. Pandas is great tool which provides with all the features required for data preprocessing.
The process includes:

  • Data Cleaning
  • Data Transformation
  • Data Integration
  • Data Filtering
  • Data Validation
  • Data Formatting

Loading dataset

Let’s start the process by importing pandas library and loading the dataset

import pandas as pd

#load data from csv file

df = pd.read_csv('data.csv')

Exploring the dataset

Accessing single column

Series is one dimensional labelled array.

series = df['column-name']
print(series.head()) #first five rows of series

Above code accesses one column from the dataset on which further operations can be done.
However, there are more ways to access series. It is also accessed using dot notation.

series = df.column_name
Accessing multiple columns

It is also known as dataframe.
Dataframe is two-dimensional labelled data structure.

data_frame = df[['column1', 'column2']]

We can also merge two dataframes into one

merged_df = pd.merge(df1, df2, on='ID')    #ID being common column between both dataframes

Padas provides us with flexibility to apply functions to series and furthermore operations on dataframe.

df['new_column'] = df['column_name'].apply(lambda x: x ** 2)

# String operations on a Series
df['column_name'] = df['column_name'].str.upper()  # Convert to uppercase

 

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top