Feature Engineering with Dates

5 minute read

Feature Engineering with Dates

In this tutorial I present the datetime format that Pandas provides to handle datetime features. In the end I create a function that generates 23 features from a single one.

Dates and Times are very precious features for businesses and sales. While a single column can’t help us make meaningful insights and take important decisions, creating a handful of features from it can transform data into profit.

Import Libraries

import numpy as np
import pandas as pd
import datetime

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('max_colwidth', 100)
data = pd.read_csv('dataset_with_dates.csv')
data.head()
card_id purchase_date city_id category_1 category_2 category_3 month_lag purchase_amount
0 C_ID_415bb3a509 3/11/2018 14:57 107 N 1.0 B 1 -0.557574
1 C_ID_415bb3a509 3/19/2018 18:53 140 N 1.0 B 1 -0.569580
2 C_ID_415bb3a509 4/26/2018 14:08 330 N 1.0 B 2 -0.551037
3 C_ID_415bb3a509 3/7/2018 9:43 -1 Y NaN B 1 -0.671926
4 C_ID_ef55cf8d4b 3/22/2018 21:07 -1 Y NaN B 1 -0.659904

First Step

Convert date to datetime format using Pandas.

After conversion each Series (one column) can be used to access and return several datetime properties like *Series.dt.property

data['purchase_date'] = pd.to_datetime(data['purchase_date']) 
data.head(1)
card_id purchase_date city_id category_1 category_2 category_3 month_lag purchase_amount
0 C_ID_415bb3a509 2018-03-11 14:57:00 107 N 1.0 B 1 -0.557574

Properties

Series.dt.date - Returns the date part of datetime without timezone information

data['purchase_date'].dt.date.head(1)
0    2018-03-11
Name: purchase_date, dtype: object

Series.dt.time - Returns the time part of datetime

data['purchase_date'].dt.time.head(1)
0    14:57:00
Name: purchase_date, dtype: object

Series.dt.year - Returns the year (int) of the datetime

data['purchase_date'].dt.year.head(1)
0    2018
Name: purchase_date, dtype: int64

Series.dt.month - Returns the month (int) of the datetime

data['purchase_date'].dt.month.head(1)
0    3
Name: purchase_date, dtype: int64

Series.dt.day - Returns the day (int) of the datetime

data['purchase_date'].dt.day.head(1)
0    11
Name: purchase_date, dtype: int64

Series.dt.hour - Returns the hour (int) of the datetime

data['purchase_date'].dt.hour.head(1)
0    14
Name: purchase_date, dtype: int64

Series.dt.minute - Returns the minute (int) of the datetime

data['purchase_date'].dt.minute.head(1)
0    57
Name: purchase_date, dtype: int64

Series.dt.second/microsecond/nanosecond - Returns the second/microsecond/nanosecond (int) of the datetime

data['purchase_date'].dt.second.head(1)
0    0
Name: purchase_date, dtype: int64

Series.dt.week/weekofyear - Returns the week number of the year. Both properties are the same

data['purchase_date'].dt.week.head(1)
0    10
Name: purchase_date, dtype: int64

Series.dt.dayofweek/weekday - Returns the day of the week with Monday=0 and Sunday=6. Both properties are the same

data['purchase_date'].dt.dayofweek.head(1)
0    6
Name: purchase_date, dtype: int64

Series.dt.dayofyear - Returns the ordinal day of the year

data['purchase_date'].dt.dayofyear.head(1)
0    70
Name: purchase_date, dtype: int64

Series.dt.quarter - Returns the quarter of the date

data['purchase_date'].dt.quarter.head(1)
0    1
Name: purchase_date, dtype: int64

Series.dt.is_month_start - Indicator for whether the date is the first day of the month

data['purchase_date'].dt.is_month_start.head(1)
0    False
Name: purchase_date, dtype: bool

Series.dt.is_month_end - Indicator for whether the date is the last day of the month

data['purchase_date'].dt.is_month_end.head(1)
0    False
Name: purchase_date, dtype: bool

Series.dt.is_quarter_start - Indicator for whether the date is the first day of a quarter

data['purchase_date'].dt.is_quarter_start.head(1)
0    False
Name: purchase_date, dtype: bool

Series.dt.is_quarter_end - Indicator for whether the date is the last day of a quarter

data['purchase_date'].dt.is_quarter_end.head(1)
0    False
Name: purchase_date, dtype: bool

Series.dt.is_year_start - Indicate whether the date is the first day of a year

data['purchase_date'].dt.is_year_start.head(1)
0    False
Name: purchase_date, dtype: bool

Series.dt.is_year_end - Indicate whether the date is the last day of a year

data['purchase_date'].dt.is_year_end.head(1)
0    False
Name: purchase_date, dtype: bool

Series.dt.is_leap_year - Indicate whether the date belongs to a leap year

data['purchase_date'].dt.is_leap_year.head(1)
0    False
Name: purchase_date, dtype: bool

Feature Engineering

# Takes a dataframe and the column name as inputs
def create_datetime_features(df, c):
    if (df[c].dtype!='datetime64[ns]'):
        pd.to_datetime(df[c]) 
        
    df['year'] = df[c].dt.year
    df['month'] = data[c].dt.month
    df['day'] = data[c].dt.day
    df['hour'] = data[c].dt.hour
    df['session'] = df['hour'].apply(lambda x: get_session(x))
    df['minute'] = data[c].dt.minute
    df['second'] = data[c].dt.second
    df['week'] = data[c].dt.week
    df['dayofweek'] = data[c].dt.dayofweek
    df['isWeekend'] = (df[c].dt.dayofweek >=5).astype(int)
    df['isWeekday'] = (df[c].dt.dayofweek < 5).astype(int)
    df['dayofyear'] = data[c].dt.dayofyear
    df['quarter'] = data[c].dt.quarter
    df['isMonthStart'] = data[c].dt.is_month_start.astype(int)
    df['isMonthEnd'] = data[c].dt.is_month_end.astype(int)
    df['isQuarterStart'] = data[c].dt.is_quarter_start.astype(int)
    df['isQuarterEnd'] = data[c].dt.is_quarter_end.astype(int)
    df['isYearStart'] = data[c].dt.is_year_start.astype(int)
    df['isYearEnd'] = data[c].dt.is_year_end.astype(int)
    df['isLeapYear'] = data[c].dt.is_leap_year.astype(int)
    df['monthDifferenceFromToday'] = (datetime.datetime.today() - df[c]).dt.days//30
    df['dayDifferenceFromToday'] = (datetime.datetime.today() - df[c]).dt.days
    # Count days until a specific date arrives (value=0 on the day of the event)
    df['blackFriday2017']=(pd.to_datetime('2017-11-24') - df[c]).dt.days.apply(lambda x: x if x >= 0 and x < 100 else -1)

    
def get_session(time):
    if time > 4 and time<12:
        return 0 # Morning
    elif time >= 12 and time < 17:
        return 1 # Afternoon
    elif time >= 17 and time < 21:
        return 2 # Evening
    else:
        return 3 # Night    
create_datetime_features(data, 'purchase_date')
data.head(10)
card_id purchase_date city_id category_1 category_2 category_3 month_lag purchase_amount year month day hour session minute second week dayofweek isWeekend isWeekday dayofyear quarter isMonthStart isMonthEnd isQuarterStart isQuarterEnd isYearStart isYearEnd isLeapYear monthDifferenceFromToday dayDifferenceFromToday blackFriday2017
0 C_ID_415bb3a509 2018-03-11 14:57:00 107 N 1.0 B 1 -0.557574 2018 3 11 14 1 57 0 10 6 1 0 70 1 0 0 0 0 0 0 0 12 365 -1
1 C_ID_415bb3a509 2018-03-19 18:53:00 140 N 1.0 B 1 -0.569580 2018 3 19 18 2 53 0 12 0 0 1 78 1 0 0 0 0 0 0 0 11 357 -1
2 C_ID_415bb3a509 2018-04-26 14:08:00 330 N 1.0 B 2 -0.551037 2018 4 26 14 1 8 0 17 3 0 1 116 2 0 0 0 0 0 0 0 10 319 -1
3 C_ID_415bb3a509 2018-03-07 09:43:00 -1 Y NaN B 1 -0.671926 2018 3 7 9 0 43 0 10 2 0 1 66 1 0 0 0 0 0 0 0 12 369 -1
4 C_ID_ef55cf8d4b 2018-03-22 21:07:00 -1 Y NaN B 1 -0.659904 2018 3 22 21 3 7 0 12 3 0 1 81 1 0 0 0 0 0 0 0 11 354 -1
5 C_ID_ef55cf8d4b 2018-04-02 12:53:00 231 N 1.0 B 2 -0.633007 2018 4 2 12 1 53 0 14 0 0 1 92 2 0 0 0 0 0 0 0 11 343 -1
6 C_ID_ef55cf8d4b 2018-03-28 19:50:00 69 N 1.0 B 1 5.263697 2018 3 28 19 2 50 0 13 2 0 1 87 1 0 0 0 0 0 0 0 11 348 -1
7 C_ID_ef55cf8d4b 2018-04-05 08:06:00 231 N 1.0 B 2 -0.553787 2018 4 5 8 0 6 0 14 3 0 1 95 2 0 0 0 0 0 0 0 11 340 -1
8 C_ID_ef55cf8d4b 2018-04-07 18:37:00 69 N 1.0 B 2 -0.596643 2018 4 7 18 2 37 0 14 5 1 0 97 2 0 0 0 0 0 0 0 11 338 -1
9 C_ID_ef55cf8d4b 2018-03-17 18:10:00 69 N 1.0 B 1 -0.607191 2018 3 17 18 2 10 0 11 5 1 0 76 1 0 0 0 0 0 0 0 11 359 -1

Boom!

23 new features from a single column

Leave a Comment