Feature Engineering with Dates
Feature Engineering with Dates
In this tutorial I present the datetime format that Pandas provides to handle datetime features. In the end I create a function that generates 23 features from a single one.
Dates and Times are very precious features for businesses and sales. While a single column can’t help us make meaningful insights and take important decisions, creating a handful of features from it can transform data into profit.
Import Libraries
import numpy as np
import pandas as pd
import datetime
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('max_colwidth', 100)
data = pd.read_csv('dataset_with_dates.csv')
data.head()
card_id | purchase_date | city_id | category_1 | category_2 | category_3 | month_lag | purchase_amount | |
---|---|---|---|---|---|---|---|---|
0 | C_ID_415bb3a509 | 3/11/2018 14:57 | 107 | N | 1.0 | B | 1 | -0.557574 |
1 | C_ID_415bb3a509 | 3/19/2018 18:53 | 140 | N | 1.0 | B | 1 | -0.569580 |
2 | C_ID_415bb3a509 | 4/26/2018 14:08 | 330 | N | 1.0 | B | 2 | -0.551037 |
3 | C_ID_415bb3a509 | 3/7/2018 9:43 | -1 | Y | NaN | B | 1 | -0.671926 |
4 | C_ID_ef55cf8d4b | 3/22/2018 21:07 | -1 | Y | NaN | B | 1 | -0.659904 |
First Step
Convert date to datetime format using Pandas.
After conversion each Series (one column) can be used to access and return several datetime properties like *Series.dt.property
data['purchase_date'] = pd.to_datetime(data['purchase_date'])
data.head(1)
card_id | purchase_date | city_id | category_1 | category_2 | category_3 | month_lag | purchase_amount | |
---|---|---|---|---|---|---|---|---|
0 | C_ID_415bb3a509 | 2018-03-11 14:57:00 | 107 | N | 1.0 | B | 1 | -0.557574 |
Properties
Series.dt.date - Returns the date part of datetime without timezone information
data['purchase_date'].dt.date.head(1)
0 2018-03-11
Name: purchase_date, dtype: object
Series.dt.time - Returns the time part of datetime
data['purchase_date'].dt.time.head(1)
0 14:57:00
Name: purchase_date, dtype: object
Series.dt.year - Returns the year (int) of the datetime
data['purchase_date'].dt.year.head(1)
0 2018
Name: purchase_date, dtype: int64
Series.dt.month - Returns the month (int) of the datetime
data['purchase_date'].dt.month.head(1)
0 3
Name: purchase_date, dtype: int64
Series.dt.day - Returns the day (int) of the datetime
data['purchase_date'].dt.day.head(1)
0 11
Name: purchase_date, dtype: int64
Series.dt.hour - Returns the hour (int) of the datetime
data['purchase_date'].dt.hour.head(1)
0 14
Name: purchase_date, dtype: int64
Series.dt.minute - Returns the minute (int) of the datetime
data['purchase_date'].dt.minute.head(1)
0 57
Name: purchase_date, dtype: int64
Series.dt.second/microsecond/nanosecond - Returns the second/microsecond/nanosecond (int) of the datetime
data['purchase_date'].dt.second.head(1)
0 0
Name: purchase_date, dtype: int64
Series.dt.week/weekofyear - Returns the week number of the year. Both properties are the same
data['purchase_date'].dt.week.head(1)
0 10
Name: purchase_date, dtype: int64
Series.dt.dayofweek/weekday - Returns the day of the week with Monday=0 and Sunday=6. Both properties are the same
data['purchase_date'].dt.dayofweek.head(1)
0 6
Name: purchase_date, dtype: int64
Series.dt.dayofyear - Returns the ordinal day of the year
data['purchase_date'].dt.dayofyear.head(1)
0 70
Name: purchase_date, dtype: int64
Series.dt.quarter - Returns the quarter of the date
data['purchase_date'].dt.quarter.head(1)
0 1
Name: purchase_date, dtype: int64
Series.dt.is_month_start - Indicator for whether the date is the first day of the month
data['purchase_date'].dt.is_month_start.head(1)
0 False
Name: purchase_date, dtype: bool
Series.dt.is_month_end - Indicator for whether the date is the last day of the month
data['purchase_date'].dt.is_month_end.head(1)
0 False
Name: purchase_date, dtype: bool
Series.dt.is_quarter_start - Indicator for whether the date is the first day of a quarter
data['purchase_date'].dt.is_quarter_start.head(1)
0 False
Name: purchase_date, dtype: bool
Series.dt.is_quarter_end - Indicator for whether the date is the last day of a quarter
data['purchase_date'].dt.is_quarter_end.head(1)
0 False
Name: purchase_date, dtype: bool
Series.dt.is_year_start - Indicate whether the date is the first day of a year
data['purchase_date'].dt.is_year_start.head(1)
0 False
Name: purchase_date, dtype: bool
Series.dt.is_year_end - Indicate whether the date is the last day of a year
data['purchase_date'].dt.is_year_end.head(1)
0 False
Name: purchase_date, dtype: bool
Series.dt.is_leap_year - Indicate whether the date belongs to a leap year
data['purchase_date'].dt.is_leap_year.head(1)
0 False
Name: purchase_date, dtype: bool
Feature Engineering
# Takes a dataframe and the column name as inputs
def create_datetime_features(df, c):
if (df[c].dtype!='datetime64[ns]'):
pd.to_datetime(df[c])
df['year'] = df[c].dt.year
df['month'] = data[c].dt.month
df['day'] = data[c].dt.day
df['hour'] = data[c].dt.hour
df['session'] = df['hour'].apply(lambda x: get_session(x))
df['minute'] = data[c].dt.minute
df['second'] = data[c].dt.second
df['week'] = data[c].dt.week
df['dayofweek'] = data[c].dt.dayofweek
df['isWeekend'] = (df[c].dt.dayofweek >=5).astype(int)
df['isWeekday'] = (df[c].dt.dayofweek < 5).astype(int)
df['dayofyear'] = data[c].dt.dayofyear
df['quarter'] = data[c].dt.quarter
df['isMonthStart'] = data[c].dt.is_month_start.astype(int)
df['isMonthEnd'] = data[c].dt.is_month_end.astype(int)
df['isQuarterStart'] = data[c].dt.is_quarter_start.astype(int)
df['isQuarterEnd'] = data[c].dt.is_quarter_end.astype(int)
df['isYearStart'] = data[c].dt.is_year_start.astype(int)
df['isYearEnd'] = data[c].dt.is_year_end.astype(int)
df['isLeapYear'] = data[c].dt.is_leap_year.astype(int)
df['monthDifferenceFromToday'] = (datetime.datetime.today() - df[c]).dt.days//30
df['dayDifferenceFromToday'] = (datetime.datetime.today() - df[c]).dt.days
# Count days until a specific date arrives (value=0 on the day of the event)
df['blackFriday2017']=(pd.to_datetime('2017-11-24') - df[c]).dt.days.apply(lambda x: x if x >= 0 and x < 100 else -1)
def get_session(time):
if time > 4 and time<12:
return 0 # Morning
elif time >= 12 and time < 17:
return 1 # Afternoon
elif time >= 17 and time < 21:
return 2 # Evening
else:
return 3 # Night
create_datetime_features(data, 'purchase_date')
data.head(10)
card_id | purchase_date | city_id | category_1 | category_2 | category_3 | month_lag | purchase_amount | year | month | day | hour | session | minute | second | week | dayofweek | isWeekend | isWeekday | dayofyear | quarter | isMonthStart | isMonthEnd | isQuarterStart | isQuarterEnd | isYearStart | isYearEnd | isLeapYear | monthDifferenceFromToday | dayDifferenceFromToday | blackFriday2017 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | C_ID_415bb3a509 | 2018-03-11 14:57:00 | 107 | N | 1.0 | B | 1 | -0.557574 | 2018 | 3 | 11 | 14 | 1 | 57 | 0 | 10 | 6 | 1 | 0 | 70 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 365 | -1 |
1 | C_ID_415bb3a509 | 2018-03-19 18:53:00 | 140 | N | 1.0 | B | 1 | -0.569580 | 2018 | 3 | 19 | 18 | 2 | 53 | 0 | 12 | 0 | 0 | 1 | 78 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 357 | -1 |
2 | C_ID_415bb3a509 | 2018-04-26 14:08:00 | 330 | N | 1.0 | B | 2 | -0.551037 | 2018 | 4 | 26 | 14 | 1 | 8 | 0 | 17 | 3 | 0 | 1 | 116 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10 | 319 | -1 |
3 | C_ID_415bb3a509 | 2018-03-07 09:43:00 | -1 | Y | NaN | B | 1 | -0.671926 | 2018 | 3 | 7 | 9 | 0 | 43 | 0 | 10 | 2 | 0 | 1 | 66 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 369 | -1 |
4 | C_ID_ef55cf8d4b | 2018-03-22 21:07:00 | -1 | Y | NaN | B | 1 | -0.659904 | 2018 | 3 | 22 | 21 | 3 | 7 | 0 | 12 | 3 | 0 | 1 | 81 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 354 | -1 |
5 | C_ID_ef55cf8d4b | 2018-04-02 12:53:00 | 231 | N | 1.0 | B | 2 | -0.633007 | 2018 | 4 | 2 | 12 | 1 | 53 | 0 | 14 | 0 | 0 | 1 | 92 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 343 | -1 |
6 | C_ID_ef55cf8d4b | 2018-03-28 19:50:00 | 69 | N | 1.0 | B | 1 | 5.263697 | 2018 | 3 | 28 | 19 | 2 | 50 | 0 | 13 | 2 | 0 | 1 | 87 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 348 | -1 |
7 | C_ID_ef55cf8d4b | 2018-04-05 08:06:00 | 231 | N | 1.0 | B | 2 | -0.553787 | 2018 | 4 | 5 | 8 | 0 | 6 | 0 | 14 | 3 | 0 | 1 | 95 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 340 | -1 |
8 | C_ID_ef55cf8d4b | 2018-04-07 18:37:00 | 69 | N | 1.0 | B | 2 | -0.596643 | 2018 | 4 | 7 | 18 | 2 | 37 | 0 | 14 | 5 | 1 | 0 | 97 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 338 | -1 |
9 | C_ID_ef55cf8d4b | 2018-03-17 18:10:00 | 69 | N | 1.0 | B | 1 | -0.607191 | 2018 | 3 | 17 | 18 | 2 | 10 | 0 | 11 | 5 | 1 | 0 | 76 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 359 | -1 |
Boom!
23 new features from a single column
Leave a Comment