In this tutorial, we will learn how to do exploratory data analysis, feature engineering, and apply all the regression model to house prices using Python.
In this project, I have applied some regression methods of supervised learning using Python in Machine Learning to predict the house price.
we will be using Pycharm IDE to solve this problem if you have not that IDE in your system you can download it from their official website(make sure you download the community version) and we will be using Python language to solve this problem.
what is Supervised Learning? :
Supervised learning is when the model is getting trained on a labeled dataset. The labeled dataset is one that has both input and output parameters(for example we have our output column called 'sale price'). So, we have to apply Supervised machine learning algorithms.
it is also classified into two types, Classification, and regression.
Classification is basically used to categorizes a set of data into classes. for example, dog vs cat, 'Red' or 'blue' or 'disease' and 'no disease'.
Regression is used to predict future values based on the independent variable. model's evaluation is done by calculating the error value. The smaller the error the greater the accuracy of our regression model.
Example of Supervised Learning Algorithms:
1)Linear Regression
2)Nearest Neighbor
3)Gaussian Naive Bayes
4)Decision Trees
5)Support Vector Machine (SVM)
6)Random Forest
7)XGBoost
8)ADA Boost
so in this problem, we are going to predict values. Hence, we will apply the regression algorithms.
We are going to need some packages and libraries:
1)Numpy-for linear algebraic operations.
2)Scikit-learn-includes many statistical models.
3)Pandas-To load the dataset.
4)matplotlib and seaborn-to to different plotting.
I have taken a very basic approach and I hope you find it useful
My main objectives on this project are:
dataset link:- https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from scipy import stats import xgboost. from sklearn.metrics import mean_squared_error from mlxtend.regressor import StackingCVRegressor import xgboost from sklearn.ensemble import RandomForestRegressor from lightgbm import LGBMRegresso from sklearn.svm import SVR from sklearn.ensemble import GradientBoostingRegressor
df_train = pd.read_csv('G:/projects/house price prediction kaggle/dataset/train.csv') df_test = pd.read_csv('G:/projects/house price prediction kaggle/dataset/test.csv')
Now I will explain the whole dataset that how many columns are there and which columns stand for which data.
Here's a brief version of the data.
We have 1460 observations of 80 variables in the training data frame and 1459 rows and 79 columns test dataset.
correlation matrix:-
A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between the two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis.
#correlation matrix import matplotlib.pyplot as plt import seaborn as sns corrmat = df_train.corr() f, ax = plt.subplots(figsize=(15, 12)) sns.heatmap(corrmat, vmax=.8, square=True)
The below code will get the top 10 correlation variables matrix.
#saleprice correlation matrix k = 10 #number of variables for heatmap cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index cm = np.corrcoef(df_train[cols].values.T) sns.set(font_scale=1.25) plt.figure(figsize=(10,10)) hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values) plt.show()
OUTPUT:-
SalePrice
OverallQual
GrLiveArea
GarageCars
GarageArea
TotalBsmtSF
1stFlrSF
FullBath
TotRmsAbvGrd
YearBuilt
so, above 10 columns are highly correlated with the dependent variable
#values of correlation abs(df_train.corr()['SalePrice']).nlargest(10)
OUTPUT:-
SalePrice 1.000000 OverallQual 0.790982 GrLivArea 0.708624 GarageCars 0.640409 GarageArea 0.623431 TotalBsmtSF 0.613581 1stFlrSF 0.605852 FullBath 0.560664 TotRmsAbvGrd 0.533723 YearBuilt 0.522897 Name: SalePrice, dtype: float64
now we will be going to see the distribution using sns.distplot of Saleprice, GrLivArea, and OverallQual
#before log transformation #you can see the distribution by plotting it sns.distplot(df_train['SalePrice']); fig_saleprice = plt.figure(figsize=(12,5)) result1 = stats.probplot(df_train['SalePrice'],plot = plt) sns.distplot(df_train['GrLivArea']); fig_GrLivArea = plt.figure(figsize=(12,5)) result1 = stats.probplot(df_train['GrLivArea'],plot = plt) sns.distplot(df_train['OverallQual']); fig_OverallQual = plt.figure(figsize=(12,5)) result1 = stats.probplot(df_train['OverallQual'],plot = plt) #then apply this and again do plotting you will see the difference df_train['SalePrice'] = np.log(df_train['SalePrice']) df_train['GrLivArea'] = np.log(df_train['GrLivArea']) df_train['OverallQual'] = np.log(df_train['OverallQual'])
We're going to merge the datasets here before we start editing it so we don't have to do these operations twice. Let's call it features since it has features only. So our data has 2919 observations and 79 features, to begin with...
frames = [df_train,df_test] df = pd.concat(frames,keys=['train','test'])
#sum of missing data df.isnull().sum().sort_values(ascending=False)
OUTPUT:-
SalePrice :1459
MSZoning: 4
LotFrontage: 486
Alley: 2721
Utilities: 2
Exterior1st: 1
Exterior2nd: 1
MasVnrType: 24
MasVnrArea: 23
BsmtQual: 81
BsmtCond: 82
BsmtExposure: 82
BsmtFinType1: 79
BsmtFinSF1: 79
BsmtFinType2: 80
BsmtFinSF2: 1
BsmtUnfSF: 1
TotalBsmtSF: 1
Electrical: 1
BsmtFullBath: 2
BsmtHalfBath: 2
KitchenQual: 1
Functional: 2
FireplaceQu: 1420
GarageType: 157
GarageYrBlt: 159
GarageFinish: 159
GarageCars: 1
GarageArea: 1
GarageQual: 159
GarageCond: 159
PoolQC: 2909
Fence: 2348
MiscFeature:2814
MoSold: Month Sold
YrSold: Year Sold
SaleType: 1
Length: 36, dtype: int64
now I am going to fill missing values where in numerical columns i will replace NaN values by 0 and for categorical columns "None".
# handling missing values of numerical columns df['LotFrontage'] = df['LotFrontage'].fillna(value=0) df['GarageYrBlt'] = df['GarageYrBlt'].fillna(value=0) df['MasVnrArea'] = df['MasVnrArea'].fillna(value=0) df['BsmtFullBath'] = df['BsmtFullBath'].fillna(value=0) df['BsmtHalfBath'] = df['BsmtHalfBath'].fillna(value=0) df['GarageArea'] = df['GarageArea'].fillna(value=0) df['BsmtFinSF2'] = df['BsmtFinSF2'].fillna(value=0) df['TotalBsmtSF'] = df['TotalBsmtSF'].fillna(value=0) df['GarageCars'] = df['GarageCars'].fillna(value=0) df['BsmtUnfSF'] = df['BsmtUnfSF'].fillna(value=0) df['BsmtFinSF1'] = df['BsmtFinSF1'].fillna(value=0) # handling missing values of categorical columns df['MSZoning'] = df['MSZoning'].fillna(value='None') df['GarageQual'] = df['GarageQual'].fillna(value='None') df['GarageCond'] = df['GarageCond'].fillna(value='None') df['GarageFinish'] = df['GarageFinish'].fillna(value='None') df['GarageType'] = df['GarageType'].fillna(value='None') df['BsmtExposure'] = df['BsmtExposure'].fillna(value='None') df['BsmtCond'] = df['BsmtCond'].fillna(value='None') df['BsmtQual'] = df['BsmtQual'].fillna(value='None') df['BsmtFinType2'] = df['BsmtFinType2'].fillna(value='None') df['BsmtFinType1'] = df['BsmtFinType1'].fillna(value='None') df['MasVnrType'] = df['MasVnrType'].fillna(value='None') df['Utilities'] = df['Utilities'].fillna(value='None') df['Functional'] = df['Functional'].fillna(value='None') df['Exterior1st'] = df['Exterior1st'].fillna(value='None') df['Exterior2nd'] = df['Exterior2nd'].fillna(value='None') df['Electrical'] = df['Electrical'].fillna(value='None') df['KitchenQual'] = df['KitchenQual'].fillna(value='None') df['SaleType'] = df['SaleType'].fillna(value='None')
most missing values columns
PoolQC 2909
MiscFeature 2814
Alley 2721
Fence 2348
FireplaceQu 1420
this five features have almost 90% missing values so we will drop them.
df = df.drop(columns={'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu'})
outlier:-An outlier is an object that deviates significantly from the rest of the objects. They can be caused by measurement or execution error. Most data mining methods discard outliers noise or exceptions, however, in some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring one and hence, the outlier analysis becomes important in such cases.
#now we are going detect outliers in whole dataset fig = plt.subplots() plt.scatter(x = df_train['GrLivArea'], y = df_train['SalePrice']) plt.ylabel('SalePrice', fontsize=13) plt.xlabel('GrLivArea', fontsize=13) plt.show() fig1= plt.subplots() plt.scatter(x = df_train['OverallQual'], y = df_train['SalePrice']) plt.ylabel('SalePrice', fontsize=13) plt.xlabel('OverallQual', fontsize=13) plt.show() fig2= plt.subplots() plt.scatter(x = df_train['GarageCars'], y = df_train['SalePrice']) plt.ylabel('SalePrice', fontsize=13) plt.xlabel('GarageCars', fontsize=13) plt.show() fig3= plt.subplots() plt.scatter(x = df_train['GarageArea'], y = df_train['SalePrice']) plt.ylabel('SalePrice', fontsize=13) plt.xlabel('GarageArea', fontsize=13) plt.show() fig4= plt.subplots() plt.scatter(x = df_train['TotalBsmtSF'], y = df_train['SalePrice']) plt.ylabel('SalePrice', fontsize=13) plt.xlabel('TotalBsmtSF', fontsize=13) plt.show() fig5= plt.subplots() plt.scatter(x = df_train['1stFlrSF'], y = df_train['SalePrice']) plt.ylabel('SalePrice', fontsize=13) plt.xlabel('1stFlrSF', fontsize=13) plt.show() fig6= plt.subplots() plt.scatter(x = df_train['FullBath'], y = df_train['SalePrice']) plt.ylabel('SalePrice', fontsize=13) plt.xlabel('FullBath', fontsize=13) plt.show() fig7= plt.subplots() plt.scatter(x = df_train['TotRmsAbvGrd'], y = df_train['SalePrice']) plt.ylabel('SalePrice', fontsize=13) plt.xlabel('TotRmsAbvGrd', fontsize=13) plt.show() fig8= plt.subplots() plt.scatter(x = df_train['YearBuilt'], y = df_train['SalePrice']) plt.ylabel('SalePrice', fontsize=13) plt.xlabel('YearBuilt', fontsize=13) plt.show()
by the above code, you will plot the 10 highly correlated graphs and by them, you will get an idea of which points are outliers. and below code is used to delete outliers.
df = df.drop(df[(df['GrLivArea']>4000) & (df['SalePrice']<300000)].index) df = df.drop(df[(df['GarageArea']>1200) & (df['SalePrice']<500000)].index) df = df.drop(df[(df['TotalBsmtSF']>3000) & (df['SalePrice']<700000)].index) df = df.drop(df[(df['1stFlrSF']>2700) & (df['1stFlrSF']<700000)].index
here i have merged some columns to just reduce complexity.
#feature engineering df_train['TotalSF'] = df_train['TotalBsmtSF']+df_train['1stFlrSF']+df_train['2ndFlrSF'] df_train=df_train.drop(columns={'1stFlrSF', '2ndFlrSF','TotalBsmtSF'}) df_train['wholeExterior'] = df_train['Exterior1st']+df_train['Exterior2nd'] df_train=df_train.drop(columns={'Exterior1st','Exterior2nd'}) df_train['Bsmt'] = df_train['BsmtFinSF1']+ df_train['BsmtFinSF2'] df_train = df_train.drop(columns={'BsmtFinSF1','BsmtFinSF2'}) df_train['TotalBathroom'] = df_train['FullBath'] + df_train['HalfBath'] df_train = df_train.drop(columns={'FullBath','HalfBath'}) df_test['TotalSF'] = df_test['TotalBsmtSF']+df_test['1stFlrSF']+df_test['2ndFlrSF'] df_test=df_test.drop(columns={'1stFlrSF', '2ndFlrSF','TotalBsmtSF'}) df_test['wholeExterior'] = df_test['Exterior1st']+df_test['Exterior2nd'] df_test=df_test.drop(columns={'Exterior1st','Exterior2nd'}) df_test['Bsmt'] = df_test['BsmtFinSF1']+ df_test['BsmtFinSF2'] df_test = df_test.drop(columns={'BsmtFinSF1','BsmtFinSF2'}) df_test['TotalBathroom'] = df_test['FullBath'] + df_test['HalfBath'] df_test = df_test.drop(columns={'FullBath','HalfBath'})
now its time to apply get_dummies function but if you want to use a one-hot encoder this will also useful.
#encoded df_main = pd.get_dummies(df) df_main.shape
OUTPUT:-
(2919, 339)
Now we have a test and train dataset in one data frame so we have to predict the sale price for test data set so here I have split the dataset into two parts.
test and train where the test shape is (1459,339) and train shape is (1456,339) because we dropped the outliers. and then make x_train and y_train
df_test = df_main.loc['test'] df_train = df_main.loc['train'] df_test.drop(['SalePrice', 'Id'], axis=1, inplace=True) X_train = df_train.drop(['SalePrice', 'Id'], axis=1) y_train = df_train['SalePrice'] # scale data before regression from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train = scaler.fit_transform(X_train) df_test = scaler.fit_transform(df_test)
xgboost , GradientBoostingRegressor, Random forest regressor, lightgbm regressor, support vector regressor, stacked regressor
#xgboost regressor xgboost = xgboost.XGBRegressor(learning_rate=0.05, colsample_bytree=0.5, subsample=0.8, n_estimators=1000, max_depth=5, gamma=5) xgboost.fit(X_train,y_train) y_pred1 = xgboost.predict(df_test) #GradientBoostingRegressor gbr = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=4, max_features='sqrt', min_samples_leaf=15, min_samples_split=10, loss='huber', random_state=42) gbr.fit(X_train,y_train) y_pred2 = gbr.predict(df_test) #Random forest regressor rf = RandomForestRegressor(n_estimators=500, max_depth=2, criterion='mse', max_features='sqrt', bootstrap=False, n_jobs=-1, random_state=0, min_samples_leaf=200) rf.fit(X_train,y_train) y_pred3 = rf.predict(df_test) #lightgbm regressor lightgbm = LGBMRegressor(objective='regression', num_leaves=4, learning_rate=0.01, n_estimators=5000, max_bin=200, bagging_fraction=0.75, bagging_freq=5, bagging_seed=7, feature_fraction=0.2, feature_fraction_seed=7, verbose=-1, ) lightgbm.fit(X_train,y_train) y_pred4 = lightgbm.predict(df_test) #support vector #Fitting SVR to the dataset from sklearn.svm import SVR svr_reg = SVR(kernel = 'rbf') svr_reg.fit(X, y) y_pred5 = svr_reg.predict(df_test) #StackingCVRegressor stack_gen = StackingCVRegressor(regressors=(rf, gbr, xgboost, lightgbm,svr_reg ), meta_regressor=xgboost, use_features_in_secondary=True) stack_gen.fit(X_train,y_train) y_pred6 = stack_gen.predict(df_test)
checking Root Mean Square Error:-RMSE is a quadratic scoring rule that also measures the average magnitude of the error. It’s the square root of the average of squared differences between prediction and actual observation.
#rmse y_test = y_train.drop([10], axis=0) from math import sqrt print('xgb rmse', sqrt(mean_squared_error(y_test, y_pred1))) print('gbr rmse', sqrt(mean_squared_error(y_test, y_pred2))) print('rf rmse', sqrt(mean_squared_error(y_test, y_pred3))) print('lightgbm rmse', sqrt(mean_squared_error(y_test, y_pred4))) print('svr rmse:', sqrt(mean_squared_error(y_test, y_pred5))) print('stacked rmse:', sqrt(mean_squared_error(y_test, y_pred6)))
OUTPUT:-
xgb rmse: 0.1223501568206363
gbr rmse :0.5585375883105338
rf rmse : 0.43600854434323927
lightgbm rmse : 0.5596622356678556
SVR rmse: 0.5246953605047906
stacked rmse: 0.5026308085477498
So, here as you can see the xgboost regressor is the best-fitted model. These error values may vary.
it's your turn now and also you can do more feature engineering on the dataset and find a more accurate model.
conclusion:- in this tutorial we have learned how to convert imbalanced dataset into balance dataset using smote and learn how to apply artificial neural network and convolutional neural network to predict prices.
Submitted by Rahul Makwana (rahulmakwana)
Download packets of source code on Coders Packet
Comments