By NIKETH ANNAM
In this tutorial, we will learn how to identify the columns having dates and take the difference to obtain the number of days in a new column in Python.
Feature engineering plays a major role in obtaining a better accuracy for the model. Some of the datasets may contain dates(For example, Start date and End date) which might be important for our model. Hence we will see how we can deal with the dates.
--> Firstly we will see how to find the columns containing dates.
Let us import required libraries and create a small data frame in Python using the pandas library.
import pandas as pd df = pd.DataFrame({'A':['1111', '2222', '3333', '4444'], 'B': ['07-19-2020', '07-20-2020', '07-21-2020', '07-22-2020'], 'C':[10,20,30,40], 'D': ['07-23-2020', '07-24-2020', '07-24-2020', '07-25-2020'], 'E': ['07-26-2020', '07-27-2020', '07-28-2020', '07-29-2020'], 'F': ['2020-08-01', '08-02-2020', '08-03-2020', '08-04-2020'], 'G': ['08-05-2020', '08-06-2020', '08-07-2020', '08-08-2020']}) df.columns
Output:
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
Now we will find the columns with dates
def columns_with_dates(df): columns=[] # Creating an empty list for column in df.columns: if df[column].dtype == 'O': # If the column type is of Object try: df[column]= pd.to_datetime(df[column]) # Converting the string to datetime format columns.append(column) except ValueError: pass elif df[column].dtype.str[1] == 'M': # If the column type is of Datetime format columns.append(column) print('The Columns with Dates are:', columns)
columns_with_dates(df)
Output:
The Columns with Dates are: ['B', 'D', 'E', 'F', 'G']
--> Finding the difference of these columns as the number of days
def diff_days(df, columns = columns): for i in range(len(columns)): df[columns[i] + '-' + columns[i-1]] = abs(df[columns[i]].sub(df[columns[i-1]], axis=0)) # Checking if the column difference exists if ((columns[i] + '-' + columns[i-2]) in df.columns) or ((columns[i-2] + '-' + columns[i]) in df.columns): pass else: df[columns[i] + '-' + columns[i-2]] = abs(df[columns[i]].sub(df[columns[i-2]], axis=0)) return df.columns
diff_days(df)
Output:
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'B-G', 'B-F', 'D-B', 'D-G', 'E-D', 'E-B', 'F-E', 'F-D', 'G-F', 'G-E'], dtype='object')
We have obtained the column differences as the number of days. This can be applied for 'n' columns differences
--> Finally we will remove original columns having dates
def drop_columns(df): for col in df.columns: if df[col].dtype.str[1] == 'M': # If the column is of Datetime format df = df.drop(col, axis=1) else : pass return df.columns
drop_columns(df)
Output:
Index(['A', 'C', 'B-G', 'B-F', 'D-B', 'D-G', 'E-D', 'E-B', 'F-E', 'F-D', 'G-F', 'G-E'], dtype='object')
Submitted by NIKETH ANNAM (nikethannam)
Download packets of source code on Coders Packet
Comments