Pandas Tutorial
Pandas Tutorial
Are you starting with Data Science? Pandas is perhaps the first best thing you will need. And it’s really easy!
After reading (and practising) this tutorial you will learn how to:
- Create, add, remove and rename columns
- Read, select and filter data
- Retrieve statistics for data
- Sort and group data
- Manipulate data
Happy learning!
Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('max_colwidth', 100)
1. Rows and Columns (create, add, remove, rename)
Create a DataFrame + add rows and columns
data_to_add = {
'Name':['Greece', 'Paraguay', 'Ireland'],
'Population':[10.77, 6.62, 4.78],
'Area':[128.9, 397.3, 68.89]
}
countries = pd.DataFrame(data_to_add, columns=['Name','Population','Area'])
#Note: if you don't specify 'columns', then they will be placed in alphabetical order
countries.head()
Name | Population | Area | |
---|---|---|---|
0 | Greece | 10.77 | 128.90 |
1 | Paraguay | 6.62 | 397.30 |
2 | Ireland | 4.78 | 68.89 |
Add data (a row) with append
countries = countries.append({'Name':'Japan',
'Population':127.25,
'Area':364.5}, ignore_index=True)
#Note: If you type the name of a column that does not exist, then it will be created with NaNs for all other rows
countries.head()
Name | Population | Area | |
---|---|---|---|
0 | Greece | 10.77 | 128.90 |
1 | Paraguay | 6.62 | 397.30 |
2 | Ireland | 4.78 | 68.89 |
3 | Japan | 127.25 | 364.50 |
names = ['Singapore', 'Gabon']
populations = [4.46, 1.64]
areas = [0.687, 257.67]
for i in range(len(names)):
countries = countries.append({'Name':names[i],
'Population':populations[i],
'Area':areas[i]}, ignore_index=True)
countries.head(10)
Name | Population | Area | |
---|---|---|---|
0 | Greece | 10.77 | 128.900 |
1 | Paraguay | 6.62 | 397.300 |
2 | Ireland | 4.78 | 68.890 |
3 | Japan | 127.25 | 364.500 |
4 | Singapore | 4.46 | 0.687 |
5 | Gabon | 1.64 | 257.670 |
Add feature (a column) with assign. Handcrafted data
countries = countries.assign(GDP = [22.08,3.81,45.83,46.72,51.7,11.43])
Add a feature (a column) from interactions with other features
countries['PopulationPerArea'] = countries['Population']/countries['Area']
countries.head(10)
Name | Population | Area | GDP | PopulationPerArea | |
---|---|---|---|---|---|
0 | Greece | 10.77 | 128.900 | 22.08 | 0.083553 |
1 | Paraguay | 6.62 | 397.300 | 3.81 | 0.016662 |
2 | Ireland | 4.78 | 68.890 | 45.83 | 0.069386 |
3 | Japan | 127.25 | 364.500 | 46.72 | 0.349108 |
4 | Singapore | 4.46 | 0.687 | 51.70 | 6.491994 |
5 | Gabon | 1.64 | 257.670 | 11.43 | 0.006365 |
Remove (drop) a row
countries = countries.drop(3, axis=0) #Drop by index
countries.head(10)
Name | Population | Area | GDP | PopulationPerArea | |
---|---|---|---|---|---|
0 | Greece | 10.77 | 128.900 | 22.08 | 0.083553 |
1 | Paraguay | 6.62 | 397.300 | 3.81 | 0.016662 |
2 | Ireland | 4.78 | 68.890 | 45.83 | 0.069386 |
4 | Singapore | 4.46 | 0.687 | 51.70 | 6.491994 |
5 | Gabon | 1.64 | 257.670 | 11.43 | 0.006365 |
Remove (drop) a column
countries = countries.drop(['PopulationPerArea'], axis=1)
countries.head(10)
Name | Population | Area | GDP | |
---|---|---|---|---|
0 | Greece | 10.77 | 128.900 | 22.08 |
1 | Paraguay | 6.62 | 397.300 | 3.81 |
2 | Ireland | 4.78 | 68.890 | 45.83 |
4 | Singapore | 4.46 | 0.687 | 51.70 |
5 | Gabon | 1.64 | 257.670 | 11.43 |
Rename a column
countries = countries.rename(columns={'GDP':'Money'})
countries.head(10)
Name | Population | Area | Money | |
---|---|---|---|---|
0 | Greece | 10.77 | 128.900 | 22.08 |
1 | Paraguay | 6.62 | 397.300 | 3.81 |
2 | Ireland | 4.78 | 68.890 | 45.83 |
4 | Singapore | 4.46 | 0.687 | 51.70 |
5 | Gabon | 1.64 | 257.670 | 11.43 |
2. Selecting Data
Subsets of columns of DataFrame
print(type(countries['Name'])) # A single column in [] is called Series
countries['Name']
<class 'pandas.core.series.Series'>
0 Greece
1 Paraguay
2 Ireland
4 Singapore
5 Gabon
Name: Name, dtype: object
type(countries[['Name']]) # With [[]] it is again a DataFrame
countries[['Name']]
Name | |
---|---|
0 | Greece |
1 | Paraguay |
2 | Ireland |
4 | Singapore |
5 | Gabon |
countries[['Name','Money']] # More columns
Name | Money | |
---|---|---|
0 | Greece | 22.08 |
1 | Paraguay | 3.81 |
2 | Ireland | 45.83 |
4 | Singapore | 51.70 |
5 | Gabon | 11.43 |
Subsets of rows of DataFrame with slice
countries[0:2] # Return the rows from the 0th to the 1st position
Name | Population | Area | Money | |
---|---|---|---|---|
0 | Greece | 10.77 | 128.9 | 22.08 |
1 | Paraguay | 6.62 | 397.3 | 3.81 |
Subsets of rows of DataFrame with iloc
Same as slice
countries.iloc[0:2]
Name | Population | Area | Money | |
---|---|---|---|---|
0 | Greece | 10.77 | 128.9 | 22.08 |
1 | Paraguay | 6.62 | 397.3 | 3.81 |
Subsets of rows and columns of DataFrame with iloc
countries.iloc[0:4,0:2]
Name | Population | |
---|---|---|
0 | Greece | 10.77 |
1 | Paraguay | 6.62 |
2 | Ireland | 4.78 |
4 | Singapore | 4.46 |
Filter rows and select specific columns
countries[countries['Population']>5][['Name','Population']]
Name | Population | |
---|---|---|
0 | Greece | 10.77 |
1 | Paraguay | 6.62 |
Select rows of DataFrame with loc
loc gets rows (or columns) with particular labels from the index
countries.loc[[1,2]]
Name | Population | Area | Money | |
---|---|---|---|---|
1 | Paraguay | 6.62 | 397.30 | 3.81 |
2 | Ireland | 4.78 | 68.89 | 45.83 |
Select rows and columns of DataFrame with loc
countries.loc[[1,2], ['Name','Area']]
Name | Area | |
---|---|---|
1 | Paraguay | 397.30 |
2 | Ireland | 68.89 |
Select specific cell (row,column) of a DataFrame with at
countries.at[1,'Name'] # index,column_name
'Paraguay'
3. Filtering Data
Relational Operators
countries[countries['Area']>100]
Name | Population | Area | Money | |
---|---|---|---|---|
0 | Greece | 10.77 | 128.90 | 22.08 |
1 | Paraguay | 6.62 | 397.30 | 3.81 |
5 | Gabon | 1.64 | 257.67 | 11.43 |
Logical Operators
countries[(countries['Area']>100) & (countries['Population']>5)] # don't forget brackets
Name | Population | Area | Money | |
---|---|---|---|---|
0 | Greece | 10.77 | 128.9 | 22.08 |
1 | Paraguay | 6.62 | 397.3 | 3.81 |
countries[countries['Name'].isin(['Greece','Gabon'])]
Name | Population | Area | Money | |
---|---|---|---|---|
0 | Greece | 10.77 | 128.90 | 22.08 |
5 | Gabon | 1.64 | 257.67 | 11.43 |
If contains string
countries[countries['Name'].str.contains('e')]
Name | Population | Area | Money | |
---|---|---|---|---|
0 | Greece | 10.77 | 128.900 | 22.08 |
2 | Ireland | 4.78 | 68.890 | 45.83 |
4 | Singapore | 4.46 | 0.687 | 51.70 |
Use query command to find a specific values in a column
countries.query('Area>100 & Population>5')
Name | Population | Area | Money | |
---|---|---|---|---|
0 | Greece | 10.77 | 128.9 | 22.08 |
1 | Paraguay | 6.62 | 397.3 | 3.81 |
4. Reading Data
House Prices Dataset from https://www.kaggle.com/c/house-prices-advanced-regression-techniques
train = pd.read_csv('train.csv', sep=',')
test = pd.read_csv('test.csv', sep=',')
Headers of DataFrame
train.columns.values
array(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond',
'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl',
'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu',
'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars',
'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive',
'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature',
'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition',
'SalePrice'], dtype=object)
Dimensions of DataFrame
train.shape
(1460, 81)
5. Statistics
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id 1460 non-null int64
MSSubClass 1460 non-null int64
MSZoning 1460 non-null object
LotFrontage 1201 non-null float64
LotArea 1460 non-null int64
Street 1460 non-null object
Alley 91 non-null object
LotShape 1460 non-null object
LandContour 1460 non-null object
Utilities 1460 non-null object
LotConfig 1460 non-null object
LandSlope 1460 non-null object
Neighborhood 1460 non-null object
Condition1 1460 non-null object
Condition2 1460 non-null object
BldgType 1460 non-null object
HouseStyle 1460 non-null object
OverallQual 1460 non-null int64
OverallCond 1460 non-null int64
YearBuilt 1460 non-null int64
YearRemodAdd 1460 non-null int64
RoofStyle 1460 non-null object
RoofMatl 1460 non-null object
Exterior1st 1460 non-null object
Exterior2nd 1460 non-null object
MasVnrType 1452 non-null object
MasVnrArea 1452 non-null float64
ExterQual 1460 non-null object
ExterCond 1460 non-null object
Foundation 1460 non-null object
BsmtQual 1423 non-null object
BsmtCond 1423 non-null object
BsmtExposure 1422 non-null object
BsmtFinType1 1423 non-null object
BsmtFinSF1 1460 non-null int64
BsmtFinType2 1422 non-null object
BsmtFinSF2 1460 non-null int64
BsmtUnfSF 1460 non-null int64
TotalBsmtSF 1460 non-null int64
Heating 1460 non-null object
HeatingQC 1460 non-null object
CentralAir 1460 non-null object
Electrical 1459 non-null object
1stFlrSF 1460 non-null int64
2ndFlrSF 1460 non-null int64
LowQualFinSF 1460 non-null int64
GrLivArea 1460 non-null int64
BsmtFullBath 1460 non-null int64
BsmtHalfBath 1460 non-null int64
FullBath 1460 non-null int64
HalfBath 1460 non-null int64
BedroomAbvGr 1460 non-null int64
KitchenAbvGr 1460 non-null int64
KitchenQual 1460 non-null object
TotRmsAbvGrd 1460 non-null int64
Functional 1460 non-null object
Fireplaces 1460 non-null int64
FireplaceQu 770 non-null object
GarageType 1379 non-null object
GarageYrBlt 1379 non-null float64
GarageFinish 1379 non-null object
GarageCars 1460 non-null int64
GarageArea 1460 non-null int64
GarageQual 1379 non-null object
GarageCond 1379 non-null object
PavedDrive 1460 non-null object
WoodDeckSF 1460 non-null int64
OpenPorchSF 1460 non-null int64
EnclosedPorch 1460 non-null int64
3SsnPorch 1460 non-null int64
ScreenPorch 1460 non-null int64
PoolArea 1460 non-null int64
PoolQC 7 non-null object
Fence 281 non-null object
MiscFeature 54 non-null object
MiscVal 1460 non-null int64
MoSold 1460 non-null int64
YrSold 1460 non-null int64
SaleType 1460 non-null object
SaleCondition 1460 non-null object
SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
train.describe()
Id | MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | TotRmsAbvGrd | Fireplaces | GarageYrBlt | GarageCars | GarageArea | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1460.000000 | 1460.000000 | 1201.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1452.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1379.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 |
mean | 730.500000 | 56.897260 | 70.049958 | 10516.828082 | 6.099315 | 5.575342 | 1971.267808 | 1984.865753 | 103.685262 | 443.639726 | 46.549315 | 567.240411 | 1057.429452 | 1162.626712 | 346.992466 | 5.844521 | 1515.463699 | 0.425342 | 0.057534 | 1.565068 | 0.382877 | 2.866438 | 1.046575 | 6.517808 | 0.613014 | 1978.506164 | 1.767123 | 472.980137 | 94.244521 | 46.660274 | 21.954110 | 3.409589 | 15.060959 | 2.758904 | 43.489041 | 6.321918 | 2007.815753 | 180921.195890 |
std | 421.610009 | 42.300571 | 24.284752 | 9981.264932 | 1.382997 | 1.112799 | 30.202904 | 20.645407 | 181.066207 | 456.098091 | 161.319273 | 441.866955 | 438.705324 | 386.587738 | 436.528436 | 48.623081 | 525.480383 | 0.518911 | 0.238753 | 0.550916 | 0.502885 | 0.815778 | 0.220338 | 1.625393 | 0.644666 | 24.689725 | 0.747315 | 213.804841 | 125.338794 | 66.256028 | 61.119149 | 29.317331 | 55.757415 | 40.177307 | 496.123024 | 2.703626 | 1.328095 | 79442.502883 |
min | 1.000000 | 20.000000 | 21.000000 | 1300.000000 | 1.000000 | 1.000000 | 1872.000000 | 1950.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 334.000000 | 0.000000 | 0.000000 | 334.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 1900.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 2006.000000 | 34900.000000 |
25% | 365.750000 | 20.000000 | 59.000000 | 7553.500000 | 5.000000 | 5.000000 | 1954.000000 | 1967.000000 | 0.000000 | 0.000000 | 0.000000 | 223.000000 | 795.750000 | 882.000000 | 0.000000 | 0.000000 | 1129.500000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 2.000000 | 1.000000 | 5.000000 | 0.000000 | 1961.000000 | 1.000000 | 334.500000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.000000 | 2007.000000 | 129975.000000 |
50% | 730.500000 | 50.000000 | 69.000000 | 9478.500000 | 6.000000 | 5.000000 | 1973.000000 | 1994.000000 | 0.000000 | 383.500000 | 0.000000 | 477.500000 | 991.500000 | 1087.000000 | 0.000000 | 0.000000 | 1464.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 3.000000 | 1.000000 | 6.000000 | 1.000000 | 1980.000000 | 2.000000 | 480.000000 | 0.000000 | 25.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 2008.000000 | 163000.000000 |
75% | 1095.250000 | 70.000000 | 80.000000 | 11601.500000 | 7.000000 | 6.000000 | 2000.000000 | 2004.000000 | 166.000000 | 712.250000 | 0.000000 | 808.000000 | 1298.250000 | 1391.250000 | 728.000000 | 0.000000 | 1776.750000 | 1.000000 | 0.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 7.000000 | 1.000000 | 2002.000000 | 2.000000 | 576.000000 | 168.000000 | 68.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 8.000000 | 2009.000000 | 214000.000000 |
max | 1460.000000 | 190.000000 | 313.000000 | 215245.000000 | 10.000000 | 9.000000 | 2010.000000 | 2010.000000 | 1600.000000 | 5644.000000 | 1474.000000 | 2336.000000 | 6110.000000 | 4692.000000 | 2065.000000 | 572.000000 | 5642.000000 | 3.000000 | 2.000000 | 3.000000 | 2.000000 | 8.000000 | 3.000000 | 14.000000 | 3.000000 | 2010.000000 | 4.000000 | 1418.000000 | 857.000000 | 547.000000 | 552.000000 | 508.000000 | 480.000000 | 738.000000 | 15500.000000 | 12.000000 | 2010.000000 | 755000.000000 |
mean, median, quantile, min, max
print('Year Build mean: {:.2f}'.format(train['YearBuilt'].mean()))
print('Year Build median: {:.2f}'.format(train['YearBuilt'].median()))
print('Year Build quantile 25%: {:.2f}'.format(train['YearBuilt'].quantile(0.25)))
print('Year Build quantile 70%: {:.2f}'.format(train['YearBuilt'].quantile(0.7)))
print('Year Build min: {:.2f}'.format(train['YearBuilt'].min()))
print('Year Build max: {:.2f}'.format(train['YearBuilt'].max()))
Year Build mean: 1971.27
Year Build median: 1973.00
Year Build quantile 25%: 1954.00
Year Build quantile 70%: 1997.30
Year Build min: 1872.00
Year Build max: 2010.00
Find the value of the index that has max
train['YearBuilt'].idxmax()
378
Count the different occurencies of a column
train['Fireplaces'].value_counts()
0 690
1 650
2 115
3 5
Name: Fireplaces, dtype: int64
Find unique instances of a column
print(train['Fireplaces'].nunique()) #how many unique
train['Fireplaces'].unique() #displays the unique
4
array([0, 1, 2, 3], dtype=int64)
Count
train['YearBuilt'].count()
1460
Sum
train['SalePrice'].sum()
264144946
Mode (Most common value)
train['YearBuilt'].mode()
0 2006
dtype: int64
6. Sorting and Grouping Data
Sort
train.sort_values(by='SalePrice', ascending=False).head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
691 | 692 | 60 | RL | 104.0 | 21535 | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 10 | 6 | 1994 | 1995 | Gable | WdShngl | HdBoard | HdBoard | BrkFace | 1170.0 | Ex | TA | PConc | Ex | TA | Gd | GLQ | 1455 | Unf | 0 | 989 | 2444 | GasA | Ex | Y | SBrkr | 2444 | 1872 | 0 | 4316 | 0 | 1 | 3 | 1 | 4 | 1 | Ex | 10 | Typ | 2 | Ex | Attchd | 1994.0 | Fin | 3 | 832 | TA | TA | Y | 382 | 50 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 1 | 2007 | WD | Normal | 755000 |
1182 | 1183 | 60 | RL | 160.0 | 15623 | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 10 | 5 | 1996 | 1996 | Hip | CompShg | Wd Sdng | ImStucc | None | 0.0 | Gd | TA | PConc | Ex | TA | Av | GLQ | 2096 | Unf | 0 | 300 | 2396 | GasA | Ex | Y | SBrkr | 2411 | 2065 | 0 | 4476 | 1 | 0 | 3 | 1 | 4 | 1 | Ex | 10 | Typ | 2 | TA | Attchd | 1996.0 | Fin | 3 | 813 | TA | TA | Y | 171 | 78 | 0 | 0 | 0 | 555 | Ex | MnPrv | NaN | 0 | 7 | 2007 | WD | Abnorml | 745000 |
1169 | 1170 | 60 | RL | 118.0 | 35760 | Pave | NaN | IR1 | Lvl | AllPub | CulDSac | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 10 | 5 | 1995 | 1996 | Hip | CompShg | HdBoard | HdBoard | BrkFace | 1378.0 | Gd | Gd | PConc | Ex | TA | Gd | GLQ | 1387 | Unf | 0 | 543 | 1930 | GasA | Ex | Y | SBrkr | 1831 | 1796 | 0 | 3627 | 1 | 0 | 3 | 1 | 4 | 1 | Gd | 10 | Typ | 1 | TA | Attchd | 1995.0 | Fin | 3 | 807 | TA | TA | Y | 361 | 76 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 7 | 2006 | WD | Normal | 625000 |
898 | 899 | 20 | RL | 100.0 | 12919 | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | NridgHt | Norm | Norm | 1Fam | 1Story | 9 | 5 | 2009 | 2010 | Hip | CompShg | VinylSd | VinylSd | Stone | 760.0 | Ex | TA | PConc | Ex | TA | Gd | GLQ | 2188 | Unf | 0 | 142 | 2330 | GasA | Ex | Y | SBrkr | 2364 | 0 | 0 | 2364 | 1 | 0 | 2 | 1 | 2 | 1 | Ex | 11 | Typ | 2 | Gd | Attchd | 2009.0 | Fin | 3 | 820 | TA | TA | Y | 0 | 67 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 3 | 2010 | New | Partial | 611657 |
803 | 804 | 60 | RL | 107.0 | 13891 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | NridgHt | Norm | Norm | 1Fam | 2Story | 9 | 5 | 2008 | 2009 | Hip | CompShg | VinylSd | VinylSd | Stone | 424.0 | Ex | TA | PConc | Ex | TA | Gd | Unf | 0 | Unf | 0 | 1734 | 1734 | GasA | Ex | Y | SBrkr | 1734 | 1088 | 0 | 2822 | 0 | 0 | 3 | 1 | 4 | 1 | Ex | 12 | Typ | 1 | Gd | BuiltIn | 2009.0 | RFn | 3 | 1020 | TA | TA | Y | 52 | 170 | 0 | 0 | 192 | 0 | NaN | NaN | NaN | 0 | 1 | 2009 | New | Partial | 582933 |
train.sort_values(['Fireplaces', 'SalePrice'], ascending=[0,1])[['Fireplaces', 'SalePrice']].head(8) # 0 means ascending=False
Fireplaces | SalePrice | |
---|---|---|
1298 | 3 | 160000 |
166 | 3 | 190000 |
605 | 3 | 205000 |
642 | 3 | 345000 |
309 | 3 | 360000 |
393 | 2 | 100000 |
662 | 2 | 110000 |
9 | 2 | 118000 |
Group
train.groupby('Fireplaces')['SalePrice'].mean()
Fireplaces
0 141331.482609
1 211843.909231
2 240588.539130
3 252000.000000
Name: SalePrice, dtype: float64
Group continious variables to distinct classes with cut
train['YearBuilt_Distinct'] = pd.cut(train['YearBuilt'], [-1, 1950, 1980, 2000, 2010], labels=['Before 1951','1951-1980', '1981-1999', '2001-2010'])
train['YearBuilt_Distinct'].head()
0 2001-2010
1 1951-1980
2 2001-2010
3 Before 1951
4 1981-1999
Name: YearBuilt_Distinct, dtype: category
Categories (4, object): [Before 1951 < 1951-1980 < 1981-1999 < 2001-2010]
7. Manipulating Data
Transform
Applies to groupby. The output is the same shape as the input dataframe. The grouped values are placed in each corresponding row.
train['meanLivAreaPerYearBuilt'] = train.groupby('YearBuilt')['GrLivArea'].transform('mean')
train[['YearBuilt','meanLivAreaPerYearBuilt']][25:30] # See that same years have the same value
YearBuilt | meanLivAreaPerYearBuilt | |
---|---|---|
25 | 2007 | 1721.448980 |
26 | 1951 | 1239.166667 |
27 | 2007 | 1721.448980 |
28 | 1957 | 1379.850000 |
29 | 1927 | 838.666667 |
Map
Works with Series (one column). Use it with a lambda function and think of it like a for loop. For each row in column ‘GrLivArea’, run a function
train['GrLivArea'].map(lambda x: x/train['GrLivArea'].max()).head()
0 0.303084
1 0.223680
2 0.316554
3 0.304325
4 0.389578
Name: GrLivArea, dtype: float64
Apply
Works with DataFrame (one or more columns). Use it with lambda like map
train[['YearBuilt','GrLivArea']].apply(lambda x: x - x.min()).head()
YearBuilt | GrLivArea | |
---|---|---|
0 | 131 | 1376 |
1 | 104 | 928 |
2 | 129 | 1452 |
3 | 43 | 1383 |
4 | 128 | 1864 |
Or use it to get single values as answer to a function
train[['YearBuilt','GrLivArea']].apply(lambda x: x.max() - x.min())
YearBuilt 138
GrLivArea 5308
dtype: int64
List Comprehension
train['GrLivArea_binary'] = ['big' if i>1500 else 'normal' for i in train['GrLivArea']]
train[['GrLivArea','GrLivArea_binary']].head()
GrLivArea | GrLivArea_binary | |
---|---|---|
0 | 1710 | big |
1 | 1262 | normal |
2 | 1786 | big |
3 | 1717 | big |
4 | 2198 | big |
Change/Replace values based on condition
train.loc[train['GrLivArea'] > 2000, 'GrLivArea'] = 2000
Leave a Comment