House Price Prediction in Python Using Machine Learning

In this tutorial, we will learn how to do exploratory data analysis, feature engineering, and apply all the regression model to house prices using Python.

In this project, I have applied some regression methods of supervised learning using Python in Machine Learning to predict the house price.

we will be using Pycharm IDE to solve this problem if you have not that IDE in your system you can download it from their official website(make sure you download the community version) and we will be using Python language to solve this problem.

what is Supervised Learning? :

Supervised learning is when the model is getting trained on a labeled dataset. The labeled dataset is one that has both input and output parameters(for example we have our output column called 'sale price'). So, we have to apply Supervised machine learning algorithms.

it is also classified into two types, Classification, and regression.

Classification is basically used to categorizes a set of data into classes. for example, dog vs cat, 'Red' or 'blue' or 'disease' and 'no disease'.

Regression is used to predict future values based on the independent variable. model's evaluation is done by calculating the error value. The smaller the error the greater the accuracy of our regression model.

Example of Supervised Learning Algorithms:

1)Linear Regression
2)Nearest Neighbor
3)Gaussian Naive Bayes
4)Decision Trees
5)Support Vector Machine (SVM)
6)Random Forest
7)XGBoost
8)ADA Boost

so in this problem, we are going to predict values. Hence, we will apply the regression algorithms.

We are going to need some packages and libraries:

1)Numpy-for linear algebraic operations.

2)Scikit-learn-includes many statistical models.

3)Pandas-To load the dataset.

4)matplotlib and seaborn-to to different plotting.

I have taken a very basic approach and I hope you find it useful

My main objectives on this project are:

Applying exploratory data analysis and trying to get some insights about our dataset
Getting data in better shape by transforming and feature engineering to help us in building better models
Building and tuning couple models to get some stable results on predicting housing prices

dataset link:- https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

STEP:1 Importing necessary Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import xgboost.
from sklearn.metrics import mean_squared_error
from mlxtend.regressor import StackingCVRegressor
import xgboost
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegresso
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor

STEP:2 Load the dataset

df_train = pd.read_csv('G:/projects/house price prediction kaggle/dataset/train.csv')
df_test = pd.read_csv('G:/projects/house price prediction kaggle/dataset/test.csv')

Now I will explain the whole dataset that how many columns are there and which columns stand for which data.

Here's a brief version of the data.

SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict. (Dependent Variable)
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale

We have 1460 observations of 80 variables in the training data frame and 1459 rows and 79 columns test dataset.

correlation matrix:-

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between the two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis.

#correlation matrix
import matplotlib.pyplot as plt
import seaborn as sns
corrmat = df_train.corr()
f, ax = plt.subplots(figsize=(15, 12))
sns.heatmap(corrmat, vmax=.8, square=True)

The below code will get the top 10 correlation variables matrix.

#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
plt.figure(figsize=(10,10))
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

OUTPUT:-

SalePrice

OverallQual

GrLiveArea

GarageCars

GarageArea

TotalBsmtSF

1stFlrSF

FullBath

TotRmsAbvGrd

YearBuilt

so, above 10 columns are highly correlated with the dependent variable

#values of correlation
abs(df_train.corr()['SalePrice']).nlargest(10)

OUTPUT:-

SalePrice       1.000000
OverallQual     0.790982
GrLivArea       0.708624
GarageCars      0.640409
GarageArea      0.623431
TotalBsmtSF     0.613581
1stFlrSF        0.605852
FullBath        0.560664
TotRmsAbvGrd    0.533723
YearBuilt       0.522897
Name: SalePrice, dtype: float64

now we will be going to see the distribution using sns.distplot of Saleprice, GrLivArea, and OverallQual

Transforming the target variable, GrLivArea and OverallQual to log values so that the error is equally impactful.

#before log transformation
#you can see the distribution by plotting it
sns.distplot(df_train['SalePrice']);
fig_saleprice = plt.figure(figsize=(12,5))
result1 = stats.probplot(df_train['SalePrice'],plot = plt)

sns.distplot(df_train['GrLivArea']);
fig_GrLivArea = plt.figure(figsize=(12,5))
result1 = stats.probplot(df_train['GrLivArea'],plot = plt)

sns.distplot(df_train['OverallQual']);
fig_OverallQual = plt.figure(figsize=(12,5))
result1 = stats.probplot(df_train['OverallQual'],plot = plt)

#then apply this and again do plotting you will see the difference
df_train['SalePrice'] = np.log(df_train['SalePrice'])
df_train['GrLivArea'] = np.log(df_train['GrLivArea'])
df_train['OverallQual'] = np.log(df_train['OverallQual'])

We're going to merge the datasets here before we start editing it so we don't have to do these operations twice. Let's call it features since it has features only. So our data has 2919 observations and 79 features, to begin with...

frames = [df_train,df_test]
df = pd.concat(frames,keys=['train','test'])

STEP:3 Handling Missing Values

#sum of missing data
df.isnull().sum().sort_values(ascending=False)

OUTPUT:-

SalePrice :1459

MSZoning: 4

LotFrontage: 486

Alley: 2721

Utilities: 2

Exterior1st: 1

Exterior2nd: 1

MasVnrType: 24

MasVnrArea: 23

BsmtQual: 81

BsmtCond: 82

BsmtExposure: 82

BsmtFinType1: 79

BsmtFinSF1: 79

BsmtFinType2: 80

BsmtFinSF2: 1

BsmtUnfSF: 1

TotalBsmtSF: 1

Electrical: 1

BsmtFullBath: 2

BsmtHalfBath: 2

KitchenQual: 1

Functional: 2

FireplaceQu: 1420

GarageType: 157

GarageYrBlt: 159

GarageFinish: 159

GarageCars: 1

GarageArea: 1

GarageQual: 159

GarageCond: 159

PoolQC: 2909

Fence: 2348

MiscFeature:2814

MoSold: Month Sold

YrSold: Year Sold

SaleType: 1

Length: 36, dtype: int64


now I am going to fill missing values where in numerical columns i will replace NaN values by 0 and for categorical columns "None".

# handling missing values of numerical columns
df['LotFrontage'] = df['LotFrontage'].fillna(value=0)
df['GarageYrBlt'] = df['GarageYrBlt'].fillna(value=0)
df['MasVnrArea'] = df['MasVnrArea'].fillna(value=0)
df['BsmtFullBath'] = df['BsmtFullBath'].fillna(value=0)
df['BsmtHalfBath'] = df['BsmtHalfBath'].fillna(value=0)
df['GarageArea'] = df['GarageArea'].fillna(value=0)
df['BsmtFinSF2'] = df['BsmtFinSF2'].fillna(value=0)
df['TotalBsmtSF'] = df['TotalBsmtSF'].fillna(value=0)
df['GarageCars'] = df['GarageCars'].fillna(value=0)
df['BsmtUnfSF'] = df['BsmtUnfSF'].fillna(value=0)
df['BsmtFinSF1'] = df['BsmtFinSF1'].fillna(value=0)

# handling missing values of categorical columns
df['MSZoning'] = df['MSZoning'].fillna(value='None')
df['GarageQual'] = df['GarageQual'].fillna(value='None')
df['GarageCond'] = df['GarageCond'].fillna(value='None')
df['GarageFinish'] = df['GarageFinish'].fillna(value='None')
df['GarageType'] = df['GarageType'].fillna(value='None')
df['BsmtExposure'] = df['BsmtExposure'].fillna(value='None')
df['BsmtCond'] = df['BsmtCond'].fillna(value='None')
df['BsmtQual'] = df['BsmtQual'].fillna(value='None')
df['BsmtFinType2'] = df['BsmtFinType2'].fillna(value='None')
df['BsmtFinType1'] = df['BsmtFinType1'].fillna(value='None')
df['MasVnrType'] = df['MasVnrType'].fillna(value='None')
df['Utilities'] = df['Utilities'].fillna(value='None')
df['Functional'] = df['Functional'].fillna(value='None')
df['Exterior1st'] = df['Exterior1st'].fillna(value='None')
df['Exterior2nd'] = df['Exterior2nd'].fillna(value='None')
df['Electrical'] = df['Electrical'].fillna(value='None')
df['KitchenQual'] = df['KitchenQual'].fillna(value='None')
df['SaleType'] = df['SaleType'].fillna(value='None')

most missing values columns

PoolQC 2909

MiscFeature 2814

Alley 2721

Fence 2348

FireplaceQu 1420

this five features have almost 90% missing values so we will drop them.

df = df.drop(columns={'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu'})

STEP:5 OUTLIERS

outlier:-An outlier is an object that deviates significantly from the rest of the objects. They can be caused by measurement or execution error. Most data mining methods discard outliers noise or exceptions, however, in some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring one and hence, the outlier analysis becomes important in such cases.

#now we are going detect outliers in whole dataset
fig = plt.subplots()
plt.scatter(x = df_train['GrLivArea'], y = df_train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

fig1= plt.subplots()
plt.scatter(x = df_train['OverallQual'], y = df_train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('OverallQual', fontsize=13)
plt.show()

fig2= plt.subplots()
plt.scatter(x = df_train['GarageCars'], y = df_train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GarageCars', fontsize=13)
plt.show()

fig3= plt.subplots()
plt.scatter(x = df_train['GarageArea'], y = df_train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GarageArea', fontsize=13)
plt.show()

fig4= plt.subplots()
plt.scatter(x = df_train['TotalBsmtSF'], y = df_train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('TotalBsmtSF', fontsize=13)
plt.show()

fig5= plt.subplots()
plt.scatter(x = df_train['1stFlrSF'], y = df_train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('1stFlrSF', fontsize=13)
plt.show()

fig6= plt.subplots()
plt.scatter(x = df_train['FullBath'], y = df_train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('FullBath', fontsize=13)
plt.show()

fig7= plt.subplots()
plt.scatter(x = df_train['TotRmsAbvGrd'], y = df_train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('TotRmsAbvGrd', fontsize=13)
plt.show()

fig8= plt.subplots()
plt.scatter(x = df_train['YearBuilt'], y = df_train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('YearBuilt', fontsize=13)
plt.show()

by the above code, you will plot the 10 highly correlated graphs and by them, you will get an idea of which points are outliers. and below code is used to delete outliers.

df = df.drop(df[(df['GrLivArea']>4000) & (df['SalePrice']<300000)].index)
df = df.drop(df[(df['GarageArea']>1200) & (df['SalePrice']<500000)].index)
df = df.drop(df[(df['TotalBsmtSF']>3000) & (df['SalePrice']<700000)].index)
df = df.drop(df[(df['1stFlrSF']>2700) & (df['1stFlrSF']<700000)].index

STEP:6 Feature Engineering

here i have merged some columns to just reduce complexity.

#feature engineering
df_train['TotalSF'] = df_train['TotalBsmtSF']+df_train['1stFlrSF']+df_train['2ndFlrSF']
df_train=df_train.drop(columns={'1stFlrSF', '2ndFlrSF','TotalBsmtSF'})
df_train['wholeExterior'] = df_train['Exterior1st']+df_train['Exterior2nd']
df_train=df_train.drop(columns={'Exterior1st','Exterior2nd'})
df_train['Bsmt'] = df_train['BsmtFinSF1']+ df_train['BsmtFinSF2']
df_train = df_train.drop(columns={'BsmtFinSF1','BsmtFinSF2'})
df_train['TotalBathroom'] = df_train['FullBath'] + df_train['HalfBath']
df_train = df_train.drop(columns={'FullBath','HalfBath'})


df_test['TotalSF'] = df_test['TotalBsmtSF']+df_test['1stFlrSF']+df_test['2ndFlrSF']
df_test=df_test.drop(columns={'1stFlrSF', '2ndFlrSF','TotalBsmtSF'})
df_test['wholeExterior'] = df_test['Exterior1st']+df_test['Exterior2nd']
df_test=df_test.drop(columns={'Exterior1st','Exterior2nd'})
df_test['Bsmt'] = df_test['BsmtFinSF1']+ df_test['BsmtFinSF2']
df_test = df_test.drop(columns={'BsmtFinSF1','BsmtFinSF2'})
df_test['TotalBathroom'] = df_test['FullBath'] + df_test['HalfBath']
df_test = df_test.drop(columns={'FullBath','HalfBath'})

OVERALL:-There are 2919 observations with 76 columns. including the target variable SalePrice and Id. The train set has 1460 observations while the test set has 1459 observations, the target variable SalePrice is absent in test. The aim of this study is to train a model on the train set and use it to predict the target SalePrice of the test set.

now its time to apply get_dummies function but if you want to use a one-hot encoder this will also useful.

#encoded
df_main = pd.get_dummies(df)
df_main.shape

OUTPUT:-

(2919, 339)

Now we have a test and train dataset in one data frame so we have to predict the sale price for test data set so here I have split the dataset into two parts.

test and train where the test shape is (1459,339) and train shape is (1456,339) because we dropped the outliers. and then make x_train and y_train

df_test = df_main.loc['test']
df_train = df_main.loc['train']

df_test.drop(['SalePrice', 'Id'], axis=1, inplace=True)


X_train = df_train.drop(['SalePrice', 'Id'], axis=1)
y_train = df_train['SalePrice']


# scale data before regression
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
df_test = scaler.fit_transform(df_test)

STEP:7 Applying algorithms

xgboost , GradientBoostingRegressor, Random forest regressor, lightgbm regressor, support vector regressor, stacked regressor

#xgboost regressor
xgboost = xgboost.XGBRegressor(learning_rate=0.05,
                               colsample_bytree=0.5,
                               subsample=0.8,
                               n_estimators=1000,
                               max_depth=5,
                               gamma=5)
xgboost.fit(X_train,y_train)
y_pred1 = xgboost.predict(df_test)

#GradientBoostingRegressor
gbr = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=4, max_features='sqrt',
                                min_samples_leaf=15, min_samples_split=10, loss='huber', random_state=42)
gbr.fit(X_train,y_train)
y_pred2 = gbr.predict(df_test)

#Random forest regressor
rf = RandomForestRegressor(n_estimators=500, max_depth=2, criterion='mse', max_features='sqrt', bootstrap=False,
                           n_jobs=-1, random_state=0, min_samples_leaf=200)
rf.fit(X_train,y_train)
y_pred3 = rf.predict(df_test)

#lightgbm regressor
lightgbm = LGBMRegressor(objective='regression',
                         num_leaves=4,
                         learning_rate=0.01,
                         n_estimators=5000,
                         max_bin=200,
                         bagging_fraction=0.75,
                         bagging_freq=5,
                         bagging_seed=7,
                         feature_fraction=0.2,
                         feature_fraction_seed=7,
                         verbose=-1,
                         )
lightgbm.fit(X_train,y_train)
y_pred4 = lightgbm.predict(df_test)


#support vector
#Fitting SVR to the dataset
from sklearn.svm import SVR
svr_reg = SVR(kernel = 'rbf')
svr_reg.fit(X, y)
y_pred5 = svr_reg.predict(df_test)

#StackingCVRegressor
stack_gen = StackingCVRegressor(regressors=(rf, gbr, xgboost, lightgbm,svr_reg ),
                                meta_regressor=xgboost,
                                use_features_in_secondary=True)
stack_gen.fit(X_train,y_train)
y_pred6 = stack_gen.predict(df_test)

checking Root Mean Square Error:-RMSE is a quadratic scoring rule that also measures the average magnitude of the error. It’s the square root of the average of squared differences between prediction and actual observation.

#rmse
y_test = y_train.drop([10], axis=0)
from math import sqrt
print('xgb rmse', sqrt(mean_squared_error(y_test, y_pred1)))
print('gbr rmse', sqrt(mean_squared_error(y_test, y_pred2)))
print('rf rmse', sqrt(mean_squared_error(y_test, y_pred3)))
print('lightgbm rmse', sqrt(mean_squared_error(y_test, y_pred4)))
print('svr rmse:', sqrt(mean_squared_error(y_test, y_pred5)))
print('stacked rmse:', sqrt(mean_squared_error(y_test, y_pred6)))

OUTPUT:-

xgb rmse: 0.1223501568206363
gbr rmse :0.5585375883105338
rf rmse : 0.43600854434323927
lightgbm rmse : 0.5596622356678556
SVR rmse: 0.5246953605047906
stacked rmse: 0.5026308085477498

So, here as you can see the xgboost regressor is the best-fitted model. These error values may vary.

it's your turn now and also you can do more feature engineering on the dataset and find a more accurate model.

conclusion:- in this tutorial we have learned how to convert imbalanced dataset into balance dataset using smote and learn how to apply artificial neural network and convolutional neural network to predict prices.

Coders Packet

House Price Prediction in Python Using Machine Learning

STEP:1 Importing necessary Libraries

STEP:2 Load the dataset

Transforming the target variable, GrLivArea and OverallQual to log values so that the error is equally impactful.

STEP:3 Handling Missing Values

STEP:5 OUTLIERS

STEP:6 Feature Engineering

STEP:7 Applying algorithms

Comments