Normalization and Standardization
Normalization and Standardization
Normalization/standardization are designed to achieve a similar goal, which is to create features that have similar ranges to each other and are widely used in data analysis to help the programmer to get some clue out of the raw data.
This notebook includes:
- Normalization
- Why normalize?
- Standardization
- Why standardization?
- Differences?
- When to use and when not
- Python code for Simple Feature Scaling, Min-Max, Z-score, log1p transformation
Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style("whitegrid") #possible choices: white, dark, whitegrid, darkgrid, ticks
House Prices Dataset from https://www.kaggle.com/c/house-prices-advanced-regression-techniques
train = pd.read_csv('train.csv')
train['SalePrice'].head()
0 208500
1 181500
2 223500
3 140000
4 250000
Name: SalePrice, dtype: int64
sns.distplot(train['SalePrice'])
<matplotlib.axes._subplots.AxesSubplot at 0x207befa1e10>
Normalization and Standardization
Normalization/standardization are designed to achieve a similar goal, which is to create features that have similar ranges to each other and are widely used in data analysis to help the programmer to get some clue out of the raw data.
Normalization
It is the process of rescaling values between [0, 1].
Why normalization?
- Normalization makes training less sensitive to the scale of features, so we can better solve for coefficients. Outliers are gone, but still remain visible within the normalized data.
- The use of a normalization method will improve analysis for some models.
- Normalizing will ensure that a convergence problem does not have a massive variance, making optimization feasible.
Standardization
It is the process of rescaling the features so that they’ll have the properties of a Gaussian distribution with μ=0 and σ=1.
Why standardization?
- Compare features that have different units or scales.
- Standardizing tends to make the training process well behaved because the numerical condition of the optimization problems is improved.
Differences?
- Sometimes, when normalization does not work, standardization might do the work.
- When using standardization, your new data aren’t bounded (unlike normalization).
When to use and when not
Use for algorithms like:
- k-Nearest Neighbors
- k-Means
- Logistic Regression
- SVM
- Perceptrons
- PCA and LDA
Don’t use for algorithms like:
- Decision Tree
- Random Forest
- XGBoost
- LightGBM
Normalization Method 1 - Simple Feature Scaling
Data is rescaled and new values are in [0, 1].
train1 = train.copy()
train1['SalePrice'] = train1['SalePrice']/train1['SalePrice'].max()
train1['SalePrice'].head()
0 0.276159
1 0.240397
2 0.296026
3 0.185430
4 0.331126
Name: SalePrice, dtype: float64
Normalization Method 2 - Min-Max
Data is rescaled and new values are in [0, 1].
train2 = train.copy()
train2['SalePrice'] = (train2['SalePrice']-train2['SalePrice'].min())/(train2['SalePrice'].max()-train2['SalePrice'].min())
train2['SalePrice'].head()
0 0.241078
1 0.203583
2 0.261908
3 0.145952
4 0.298709
Name: SalePrice, dtype: float64
Standarization - Z-score
It is the process of rescaling the features so that they’ll have the properties of a Gaussian distribution with μ=0 and σ=1
where μ is the mean and σ is the standard deviation from the mean; standard scores (also called z scores) of the samples are calculated as follows:
train3 = train.copy()
train3['YrSold'] = (train3['YrSold']-train3['YrSold'].mean())/train3['YrSold'].std()
train3['YrSold'].head()
0 0.138730
1 -0.614228
2 0.138730
3 -1.367186
4 0.138730
Name: YrSold, dtype: float64
Transformation - log1p transformation.
Usually used to transform price values to log. Then, applying statistical learning becomes a lot easier.
Also, in case of positive skewness, log transformations usually works well.
train4 = train.copy()
train4['SalePrice'] = np.log1p(train3['SalePrice'])
train4['SalePrice'].head()
0 12.247699
1 12.109016
2 12.317171
3 11.849405
4 12.429220
Name: SalePrice, dtype: float64
sns.distplot(train4['SalePrice'])
<matplotlib.axes._subplots.AxesSubplot at 0x2079c9483c8>
Sources:
Medium - Standardize or Normalize? — Examples in Python
Quora - What is the difference between normalization, standardization, and regularization for data?
Leave a Comment