Dealing with the dates in a Python DataFrame

Packet_1.ipynb

In this tutorial, we will learn how to identify the columns having dates and take the difference to obtain the number of days in a new column in Python.

Feature engineering plays a major role in obtaining a better accuracy for the model. Some of the datasets may contain dates(For example, Start date and End date) which might be important for our model. Hence we will see how we can deal with the dates.

--> Firstly we will see how to find the columns containing dates.

Let us import required libraries and create a small data frame in Python using the pandas library.

import pandas as pd

df = pd.DataFrame({'A':['1111', '2222', '3333', '4444'], 'B': ['07-19-2020', '07-20-2020', '07-21-2020', '07-22-2020'],
                   'C':[10,20,30,40], 'D': ['07-23-2020', '07-24-2020', '07-24-2020', '07-25-2020'], 
                   'E': ['07-26-2020', '07-27-2020', '07-28-2020', '07-29-2020'],
                   'F': ['2020-08-01', '08-02-2020', '08-03-2020', '08-04-2020'],
                   'G': ['08-05-2020', '08-06-2020', '08-07-2020', '08-08-2020']})
df.columns

Output:

Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')

Now we will find the columns with dates

def columns_with_dates(df): 
  columns=[] # Creating an empty list 
  for column in df.columns: 
    if df[column].dtype == 'O': # If the column type is of Object 
      try: 
        df[column]= pd.to_datetime(df[column]) # Converting the string to datetime format 
        columns.append(column) 
      except ValueError: 
        pass 
    elif df[column].dtype.str[1] == 'M': # If the column type is of Datetime format 
      columns.append(column) 
  print('The Columns with Dates are:', columns)
columns_with_dates(df)

Output:

The Columns with Dates are: ['B', 'D', 'E', 'F', 'G']

--> Finding the difference of these columns as the number of days

def diff_days(df, columns = columns):
  for i in range(len(columns)):
    df[columns[i] + '-' + columns[i-1]] = abs(df[columns[i]].sub(df[columns[i-1]], axis=0))   
# Checking if the column difference exists
    if ((columns[i] + '-' + columns[i-2]) in df.columns) or ((columns[i-2] + '-' + columns[i]) in df.columns):
      pass
    else:
      df[columns[i] + '-' + columns[i-2]] = abs(df[columns[i]].sub(df[columns[i-2]], axis=0))
  return df.columns
diff_days(df)

Output:

Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'B-G', 'B-F', 'D-B', 'D-G', 'E-D', 'E-B', 'F-E', 'F-D', 'G-F', 'G-E'], dtype='object')

We have obtained the column differences as the number of days. This can be applied for 'n' columns differences

--> Finally we will remove original columns having dates

def drop_columns(df):
  for col in df.columns:
    if df[col].dtype.str[1] == 'M':  # If the column is of Datetime format
      df = df.drop(col, axis=1)
    else :
      pass
  return df.columns
drop_columns(df)

Output:

Index(['A', 'C', 'B-G', 'B-F', 'D-B', 'D-G', 'E-D', 'E-B', 'F-E', 'F-D', 'G-F', 'G-E'], dtype='object')

Coders Packet

Dealing with the dates in a Python DataFrame

Comments