House price prediction system using linear regression in C++

Regression/

In this project, Multivariate Linear Regression is used to predict House prices, implemented in C++. The data consists of 6 independent variables. The Testing and Training Errors were analyzed.

This is a Multivariate regression model in C++. The data was taken from Kaggle, then modified to make this model fare better. There are 6 Independent variables - "Under construction" represents whether the house is under construction or not(binary), "RERA", "Ready to move", "Resale" are similar categorical inputs(binary) indicating whether the house is RERA approved, whether the current occupants are ready to move, and whether it is a resale respectively. "BHK no." indicates the number of bedrooms and "SQUARE_FT" represents the area of the house. The output is the price in Lakhs. Another column was added in data full of ones for the constant term in the hypothesis function. As it's a regression model, stochastic gradient descent was followed, and regularization was added to decrease test and train error difference.

75%-25%(training- test) split was chosen, more Higher training percentage resulted in test error being lower than training as the test set turns to be "easier" than the training set. Data were then normalized to increase accuracy and get all the independents on the same scale. In the 'fit' method(in the code) basic multivariate regression is actually implemented along with regularization. The 'theta' array represents the coefficients that were being updated to reduce the error. Here RMSE was used to define an error. On every 10th loop, errors were recorded and on every 10000th loop, errors are displayed. The model was trained until test error was decreasing.

Here is the analysis of test and training errors for learning rate 1 and 0.1 respectively :

In the end, even though Training and test errors are converging we see they are quite high, this is due to the unevenness of the data. If we look at the ranges of the output, we see few cases have very high values, causing high errors in training as well as testing.

If we ignore the latter part of the data(having such unevenness) we get significant improvement in our errors.

Coders Packet

House price prediction system using linear regression in C++

Comments