Data wrangling using Pandas

Data wrangling is process of preparing raw data for analysis. It involves cleaning, structuring and enriching of raw data.
It is also known as data preprocessing. Pandas is great tool which provides with all the features required for data preprocessing.
The process includes:

Data Cleaning
Data Transformation
Data Integration
Data Filtering
Data Validation
Data Formatting

Loading dataset

Let’s start the process by importing pandas library and loading the dataset

import pandas as pd

#load data from csv file

df = pd.read_csv('data.csv')

Exploring the dataset

Accessing single column

Series is one dimensional labelled array.

series = df['column-name']
print(series.head()) #first five rows of series

Above code accesses one column from the dataset on which further operations can be done.
However, there are more ways to access series. It is also accessed using dot notation.

series = df.column_name

Accessing multiple columns

It is also known as dataframe.
Dataframe is two-dimensional labelled data structure.

data_frame = df[['column1', 'column2']]

We can also merge two dataframes into one

merged_df = pd.merge(df1, df2, on='ID')    #ID being common column between both dataframes

Padas provides us with flexibility to apply functions to series and furthermore operations on dataframe.

df['new_column'] = df['column_name'].apply(lambda x: x ** 2)

# String operations on a Series
df['column_name'] = df['column_name'].str.upper()  # Convert to uppercase

Loading dataset

Exploring the dataset

Accessing single column

Accessing multiple columns

Related Posts

Leave a Comment Cancel Reply