Coders Packet

House Price Prediction in Python Using Machine Learning

By Rahul Makwana

In this tutorial, we will learn how to do exploratory data analysis, feature engineering, and apply all the regression model to house prices using Python.

In this project, I have applied some regression methods of supervised learning using Python in Machine Learning to predict the house price.

we will be using Pycharm IDE to solve this problem if you have not that IDE in your system you can download it from their official website(make sure you download the community version) and we will be using Python language to solve this problem.

what is Supervised Learning? :

Supervised learning is when the model is getting trained on a labeled dataset. The labeled dataset is one that has both input and output parameters(for example we have our output column called 'sale price'). So, we have to apply Supervised machine learning algorithms.

it is also classified into two types, Classification, and regression.

Classification is basically used to categorizes a set of data into classes. for example, dog vs cat, 'Red' or 'blue' or 'disease' and 'no disease'.

Regression is used to predict future values based on the independent variable. model's evaluation is done by calculating the error value. The smaller the error the greater the accuracy of our regression model.

Example of Supervised Learning Algorithms:

1)Linear Regression
2)Nearest Neighbor
3)Gaussian Naive Bayes
4)Decision Trees
5)Support Vector Machine (SVM)
6)Random Forest
7)XGBoost 
8)ADA Boost


so in this problem, we are going to predict values. Hence, we will apply the regression algorithms. 

 

We are going to need some packages and libraries:

1)Numpy-for linear algebraic operations.

2)Scikit-learn-includes many statistical models.

3)Pandas-To load the dataset.

4)matplotlib and seaborn-to to different plotting.

 

I have taken a very basic approach and I hope you find it useful

My main objectives on this project are:

  • Applying exploratory data analysis and trying to get some insights about our dataset
  • Getting data in better shape by transforming and feature engineering to help us in building better models
  • Building and tuning couple models to get some stable results on predicting housing prices

dataset link:- https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

 

STEP:1 Importing necessary Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import xgboost.
from sklearn.metrics import mean_squared_error
from mlxtend.regressor import StackingCVRegressor
import xgboost
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegresso
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor

STEP:2 Load the dataset

df_train = pd.read_csv('G:/projects/house price prediction kaggle/dataset/train.csv')
df_test = pd.read_csv('G:/projects/house price prediction kaggle/dataset/test.csv')

Now I will explain the whole dataset that how many columns are there and which columns stand for which data.

Here's a brief version of the data.

  • SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict. (Dependent Variable)
  • MSSubClass: The building class
  • MSZoning: The general zoning classification
  • LotFrontage: Linear feet of street connected to property
  • LotArea: Lot size in square feet
  • Street: Type of road access
  • Alley: Type of alley access
  • LotShape: General shape of property
  • LandContour: Flatness of the property
  • Utilities: Type of utilities available
  • LotConfig: Lot configuration
  • LandSlope: Slope of property
  • Neighborhood: Physical locations within Ames city limits
  • Condition1: Proximity to main road or railroad
  • Condition2: Proximity to main road or railroad (if a second is present)
  • BldgType: Type of dwelling
  • HouseStyle: Style of dwelling
  • OverallQual: Overall material and finish quality
  • OverallCond: Overall condition rating
  • YearBuilt: Original construction date
  • YearRemodAdd: Remodel date
  • RoofStyle: Type of roof
  • RoofMatl: Roof material
  • Exterior1st: Exterior covering on house
  • Exterior2nd: Exterior covering on house (if more than one material)
  • MasVnrType: Masonry veneer type
  • MasVnrArea: Masonry veneer area in square feet
  • ExterQual: Exterior material quality
  • ExterCond: Present condition of the material on the exterior
  • Foundation: Type of foundation
  • BsmtQual: Height of the basement
  • BsmtCond: General condition of the basement
  • BsmtExposure: Walkout or garden level basement walls
  • BsmtFinType1: Quality of basement finished area
  • BsmtFinSF1: Type 1 finished square feet
  • BsmtFinType2: Quality of second finished area (if present)
  • BsmtFinSF2: Type 2 finished square feet
  • BsmtUnfSF: Unfinished square feet of basement area
  • TotalBsmtSF: Total square feet of basement area
  • Heating: Type of heating
  • HeatingQC: Heating quality and condition
  • CentralAir: Central air conditioning
  • Electrical: Electrical system
  • 1stFlrSF: First Floor square feet
  • 2ndFlrSF: Second floor square feet
  • LowQualFinSF: Low quality finished square feet (all floors)
  • GrLivArea: Above grade (ground) living area square feet
  • BsmtFullBath: Basement full bathrooms
  • BsmtHalfBath: Basement half bathrooms
  • FullBath: Full bathrooms above grade
  • HalfBath: Half baths above grade
  • Bedroom: Number of bedrooms above basement level
  • Kitchen: Number of kitchens
  • KitchenQual: Kitchen quality
  • TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
  • Functional: Home functionality rating
  • Fireplaces: Number of fireplaces
  • FireplaceQu: Fireplace quality
  • GarageType: Garage location
  • GarageYrBlt: Year garage was built
  • GarageFinish: Interior finish of the garage
  • GarageCars: Size of garage in car capacity
  • GarageArea: Size of garage in square feet
  • GarageQual: Garage quality
  • GarageCond: Garage condition
  • PavedDrive: Paved driveway
  • WoodDeckSF: Wood deck area in square feet
  • OpenPorchSF: Open porch area in square feet
  • EnclosedPorch: Enclosed porch area in square feet
  • 3SsnPorch: Three season porch area in square feet
  • ScreenPorch: Screen porch area in square feet
  • PoolArea: Pool area in square feet
  • PoolQC: Pool quality
  • Fence: Fence quality
  • MiscFeature: Miscellaneous feature not covered in other categories
  • MiscVal: $Value of miscellaneous feature
  • MoSold: Month Sold
  • YrSold: Year Sold
  • SaleType: Type of sale
  • SaleCondition: Condition of sale

 

We have 1460 observations of 80 variables in the training data frame and 1459 rows and 79 columns test dataset.

correlation matrix:-

correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between the two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis.

#correlation matrix
import matplotlib.pyplot as plt
import seaborn as sns
corrmat = df_train.corr()
f, ax = plt.subplots(figsize=(15, 12))
sns.heatmap(corrmat, vmax=.8, square=True)

 

The below code will get the top 10 correlation variables matrix.

#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
plt.figure(figsize=(10,10))
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

 

OUTPUT:-

SalePrice

OverallQual

GrLiveArea

GarageCars

GarageArea

TotalBsmtSF

1stFlrSF

FullBath

TotRmsAbvGrd

YearBuilt

so, above 10 columns are highly correlated with the dependent variable

#values of correlation
abs(df_train.corr()['SalePrice']).nlargest(10)

OUTPUT:-

SalePrice       1.000000
OverallQual     0.790982
GrLivArea       0.708624
GarageCars      0.640409
GarageArea      0.623431
TotalBsmtSF     0.613581
1stFlrSF        0.605852
FullBath        0.560664
TotRmsAbvGrd    0.533723
YearBuilt       0.522897
Name: SalePrice, dtype: float64

 

now we will be going to see the distribution using sns.distplot of Saleprice, GrLivArea, and  OverallQual 

Transforming the target variable, GrLivArea and OverallQual to log values so that the error is equally impactful.

#before log transformation
#you can see the distribution by plotting it
sns.distplot(df_train['SalePrice']);
fig_saleprice = plt.figure(figsize=(12,5))
result1 = stats.probplot(df_train['SalePrice'],plot = plt)

sns.distplot(df_train['GrLivArea']);
fig_GrLivArea = plt.figure(figsize=(12,5))
result1 = stats.probplot(df_train['GrLivArea'],plot = plt)

sns.distplot(df_train['OverallQual']);
fig_OverallQual = plt.figure(figsize=(12,5))
result1 = stats.probplot(df_train['OverallQual'],plot = plt)

#then apply this and again do plotting you will see the difference
df_train['SalePrice'] = np.log(df_train['SalePrice'])
df_train['GrLivArea'] = np.log(df_train['GrLivArea'])
df_train['OverallQual'] = np.log(df_train['OverallQual'])

 

We're going to merge the datasets here before we start editing it so we don't have to do these operations twice. Let's call it features since it has features only. So our data has 2919 observations and 79 features, to begin with...

 

frames = [df_train,df_test]
df = pd.concat(frames,keys=['train','test'])

STEP:3 Handling Missing Values

#sum of missing data
df.isnull().sum().sort_values(ascending=False)

OUTPUT:-

SalePrice :1459

MSZoning: 4

LotFrontage: 486

Alley: 2721

Utilities: 2

Exterior1st: 1

Exterior2nd: 1

MasVnrType: 24

MasVnrArea: 23

BsmtQual: 81

BsmtCond: 82

BsmtExposure: 82

BsmtFinType1: 79

BsmtFinSF1: 79

BsmtFinType2: 80

BsmtFinSF2: 1

BsmtUnfSF: 1

TotalBsmtSF: 1

Electrical: 1

BsmtFullBath: 2

BsmtHalfBath: 2

KitchenQual: 1

Functional: 2

FireplaceQu: 1420

GarageType: 157

GarageYrBlt: 159

GarageFinish: 159

GarageCars: 1

GarageArea: 1

GarageQual: 159

GarageCond: 159

PoolQC: 2909

Fence: 2348

MiscFeature:2814

MoSold: Month Sold

YrSold: Year Sold

SaleType: 1

Length: 36, dtype: int64


now I am going to fill missing values where in numerical columns i will replace NaN values by 0 and for categorical columns "None".

# handling missing values of numerical columns
df['LotFrontage'] = df['LotFrontage'].fillna(value=0)
df['GarageYrBlt'] = df['GarageYrBlt'].fillna(value=0)
df['MasVnrArea'] = df['MasVnrArea'].fillna(value=0)
df['BsmtFullBath'] = df['BsmtFullBath'].fillna(value=0)
df['BsmtHalfBath'] = df['BsmtHalfBath'].fillna(value=0)
df['GarageArea'] = df['GarageArea'].fillna(value=0)
df['BsmtFinSF2'] = df['BsmtFinSF2'].fillna(value=0)
df['TotalBsmtSF'] = df['TotalBsmtSF'].fillna(value=0)
df['GarageCars'] = df['GarageCars'].fillna(value=0)
df['BsmtUnfSF'] = df['BsmtUnfSF'].fillna(value=0)
df['BsmtFinSF1'] = df['BsmtFinSF1'].fillna(value=0)

# handling missing values of categorical columns
df['MSZoning'] = df['MSZoning'].fillna(value='None')
df['GarageQual'] = df['GarageQual'].fillna(value='None')
df['GarageCond'] = df['GarageCond'].fillna(value='None')
df['GarageFinish'] = df['GarageFinish'].fillna(value='None')
df['GarageType'] = df['GarageType'].fillna(value='None')
df['BsmtExposure'] = df['BsmtExposure'].fillna(value='None')
df['BsmtCond'] = df['BsmtCond'].fillna(value='None')
df['BsmtQual'] = df['BsmtQual'].fillna(value='None')
df['BsmtFinType2'] = df['BsmtFinType2'].fillna(value='None')
df['BsmtFinType1'] = df['BsmtFinType1'].fillna(value='None')
df['MasVnrType'] = df['MasVnrType'].fillna(value='None')
df['Utilities'] = df['Utilities'].fillna(value='None')
df['Functional'] = df['Functional'].fillna(value='None')
df['Exterior1st'] = df['Exterior1st'].fillna(value='None')
df['Exterior2nd'] = df['Exterior2nd'].fillna(value='None')
df['Electrical'] = df['Electrical'].fillna(value='None')
df['KitchenQual'] = df['KitchenQual'].fillna(value='None')
df['SaleType'] = df['SaleType'].fillna(value='None')

 

most missing values columns

PoolQC 2909

MiscFeature 2814

Alley 2721

Fence 2348

FireplaceQu 1420

this five features have almost 90% missing values so we will drop them.

df = df.drop(columns={'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu'})

STEP:5 OUTLIERS

outlier:-An outlier is an object that deviates significantly from the rest of the objects. They can be caused by measurement or execution error. Most data mining methods discard outliers noise or exceptions, however, in some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring one and hence, the outlier analysis becomes important in such cases.

#now we are going detect outliers in whole dataset
fig = plt.subplots()
plt.scatter(x = df_train['GrLivArea'], y = df_train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

fig1= plt.subplots()
plt.scatter(x = df_train['OverallQual'], y = df_train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('OverallQual', fontsize=13)
plt.show()

fig2= plt.subplots()
plt.scatter(x = df_train['GarageCars'], y = df_train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GarageCars', fontsize=13)
plt.show()

fig3= plt.subplots()
plt.scatter(x = df_train['GarageArea'], y = df_train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GarageArea', fontsize=13)
plt.show()

fig4= plt.subplots()
plt.scatter(x = df_train['TotalBsmtSF'], y = df_train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('TotalBsmtSF', fontsize=13)
plt.show()

fig5= plt.subplots()
plt.scatter(x = df_train['1stFlrSF'], y = df_train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('1stFlrSF', fontsize=13)
plt.show()

fig6= plt.subplots()
plt.scatter(x = df_train['FullBath'], y = df_train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('FullBath', fontsize=13)
plt.show()

fig7= plt.subplots()
plt.scatter(x = df_train['TotRmsAbvGrd'], y = df_train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('TotRmsAbvGrd', fontsize=13)
plt.show()

fig8= plt.subplots()
plt.scatter(x = df_train['YearBuilt'], y = df_train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('YearBuilt', fontsize=13)
plt.show()

by the above code, you will plot the 10 highly correlated graphs and by them, you will get an idea of which points are outliers. and below code is used to delete outliers.

df = df.drop(df[(df['GrLivArea']>4000) & (df['SalePrice']<300000)].index)
df = df.drop(df[(df['GarageArea']>1200) & (df['SalePrice']<500000)].index)
df = df.drop(df[(df['TotalBsmtSF']>3000) & (df['SalePrice']<700000)].index)
df = df.drop(df[(df['1stFlrSF']>2700) & (df['1stFlrSF']<700000)].index

STEP:6 Feature Engineering

here i have merged some columns to just reduce complexity.

#feature engineering
df_train['TotalSF'] = df_train['TotalBsmtSF']+df_train['1stFlrSF']+df_train['2ndFlrSF']
df_train=df_train.drop(columns={'1stFlrSF', '2ndFlrSF','TotalBsmtSF'})
df_train['wholeExterior'] = df_train['Exterior1st']+df_train['Exterior2nd']
df_train=df_train.drop(columns={'Exterior1st','Exterior2nd'})
df_train['Bsmt'] = df_train['BsmtFinSF1']+ df_train['BsmtFinSF2']
df_train = df_train.drop(columns={'BsmtFinSF1','BsmtFinSF2'})
df_train['TotalBathroom'] = df_train['FullBath'] + df_train['HalfBath']
df_train = df_train.drop(columns={'FullBath','HalfBath'})


df_test['TotalSF'] = df_test['TotalBsmtSF']+df_test['1stFlrSF']+df_test['2ndFlrSF']
df_test=df_test.drop(columns={'1stFlrSF', '2ndFlrSF','TotalBsmtSF'})
df_test['wholeExterior'] = df_test['Exterior1st']+df_test['Exterior2nd']
df_test=df_test.drop(columns={'Exterior1st','Exterior2nd'})
df_test['Bsmt'] = df_test['BsmtFinSF1']+ df_test['BsmtFinSF2']
df_test = df_test.drop(columns={'BsmtFinSF1','BsmtFinSF2'})
df_test['TotalBathroom'] = df_test['FullBath'] + df_test['HalfBath']
df_test = df_test.drop(columns={'FullBath','HalfBath'})

 

OVERALL:-There are 2919 observations with 76 columns. including the target variable SalePrice and Id. The train set has 1460 observations while the test set has 1459 observations, the target variable SalePrice is absent in test. The aim of this study is to train a model on the train set and use it to predict the target SalePrice of the test set.

now its time to apply get_dummies function but if you want to use a one-hot encoder this will also useful.

#encoded
df_main = pd.get_dummies(df)
df_main.shape

OUTPUT:-

(2919, 339)

 

Now we have a test and train dataset in one data frame so we have to predict the sale price for test data set so here I have split the dataset into two parts.

test and train where the test shape is (1459,339) and train shape is (1456,339) because we dropped the outliers. and then make x_train and y_train

df_test = df_main.loc['test']
df_train = df_main.loc['train']

df_test.drop(['SalePrice', 'Id'], axis=1, inplace=True)


X_train = df_train.drop(['SalePrice', 'Id'], axis=1)
y_train = df_train['SalePrice']


# scale data before regression
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
df_test = scaler.fit_transform(df_test)

 

STEP:7 Applying algorithms

 

xgboost , GradientBoostingRegressor,  Random forest regressor, lightgbm regressor, support vector regressor, stacked regressor

#xgboost regressor
xgboost = xgboost.XGBRegressor(learning_rate=0.05,
                               colsample_bytree=0.5,
                               subsample=0.8,
                               n_estimators=1000,
                               max_depth=5,
                               gamma=5)
xgboost.fit(X_train,y_train)
y_pred1 = xgboost.predict(df_test)

#GradientBoostingRegressor
gbr = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=4, max_features='sqrt',
                                min_samples_leaf=15, min_samples_split=10, loss='huber', random_state=42)
gbr.fit(X_train,y_train)
y_pred2 = gbr.predict(df_test)

#Random forest regressor
rf = RandomForestRegressor(n_estimators=500, max_depth=2, criterion='mse', max_features='sqrt', bootstrap=False,
                           n_jobs=-1, random_state=0, min_samples_leaf=200)
rf.fit(X_train,y_train)
y_pred3 = rf.predict(df_test)

#lightgbm regressor
lightgbm = LGBMRegressor(objective='regression',
                         num_leaves=4,
                         learning_rate=0.01,
                         n_estimators=5000,
                         max_bin=200,
                         bagging_fraction=0.75,
                         bagging_freq=5,
                         bagging_seed=7,
                         feature_fraction=0.2,
                         feature_fraction_seed=7,
                         verbose=-1,
                         )
lightgbm.fit(X_train,y_train)
y_pred4 = lightgbm.predict(df_test)


#support vector
#Fitting SVR to the dataset
from sklearn.svm import SVR
svr_reg = SVR(kernel = 'rbf')
svr_reg.fit(X, y)
y_pred5 = svr_reg.predict(df_test)

#StackingCVRegressor
stack_gen = StackingCVRegressor(regressors=(rf, gbr, xgboost, lightgbm,svr_reg ),
                                meta_regressor=xgboost,
                                use_features_in_secondary=True)
stack_gen.fit(X_train,y_train)
y_pred6 = stack_gen.predict(df_test)

checking Root Mean Square Error:-RMSE is a quadratic scoring rule that also measures the average magnitude of the error. It’s the square root of the average of squared differences between prediction and actual observation.

 

 

#rmse
y_test = y_train.drop([10], axis=0)
from math import sqrt
print('xgb rmse', sqrt(mean_squared_error(y_test, y_pred1)))
print('gbr rmse', sqrt(mean_squared_error(y_test, y_pred2)))
print('rf rmse', sqrt(mean_squared_error(y_test, y_pred3)))
print('lightgbm rmse', sqrt(mean_squared_error(y_test, y_pred4)))
print('svr rmse:', sqrt(mean_squared_error(y_test, y_pred5)))
print('stacked rmse:', sqrt(mean_squared_error(y_test, y_pred6)))

OUTPUT:-

xgb rmse: 0.1223501568206363
gbr rmse :0.5585375883105338
rf rmse : 0.43600854434323927
lightgbm rmse : 0.5596622356678556
SVR rmse: 0.5246953605047906
stacked rmse: 0.5026308085477498

So, here as you can see the xgboost regressor is the best-fitted model. These error values may vary.

it's your turn now and also you can do more feature engineering on the dataset and find a more accurate model.

 

conclusion:- in this tutorial we have learned how to convert imbalanced dataset into balance dataset using smote and learn how to apply artificial neural network and convolutional neural network to predict prices.

 

Download Complete Code

Comments

No comments yet

Download Packet

Reviews Report

Submitted by Rahul Makwana (rahulmakwana)

Download packets of source code on Coders Packet