Data wrangling is process of preparing raw data for analysis. It involves cleaning, structuring and enriching of raw data.
It is also known as data preprocessing. Pandas is great tool which provides with all the features required for data preprocessing.
The process includes:
- Data Cleaning
- Data Transformation
- Data Integration
- Data Filtering
- Data Validation
- Data Formatting
Loading dataset
Let’s start the process by importing pandas library and loading the dataset
import pandas as pd #load data from csv file df = pd.read_csv('data.csv')
Exploring the dataset
Accessing single column
Series is one dimensional labelled array.
series = df['column-name'] print(series.head()) #first five rows of series
Above code accesses one column from the dataset on which further operations can be done.
However, there are more ways to access series. It is also accessed using dot notation.
series = df.column_name
Accessing multiple columns
It is also known as dataframe.
Dataframe is two-dimensional labelled data structure.
data_frame = df[['column1', 'column2']]
We can also merge two dataframes into one
merged_df = pd.merge(df1, df2, on='ID') #ID being common column between both dataframes
Padas provides us with flexibility to apply functions to series and furthermore operations on dataframe.
df['new_column'] = df['column_name'].apply(lambda x: x ** 2) # String operations on a Series df['column_name'] = df['column_name'].str.upper() # Convert to uppercase