Build Regression Models in Python for House Price Prediction

Ever wondered how experts predict house prices? This project dives into exactly that! Using Python, we'll build regression models that predict house prices based on factors like location, size, and more. Whether you're into real estate or data science, this project is a fun, hands-on way to explore predictive modeling.

Project Overview

For this project, we will use Linear Regression to predict house prices. First, we load and explore our dataset, then deal with missing values and outliers. Our main aim is that the model can predict prices based on features like area, bedrooms, bathrooms, and so on.

We split the data into training and test sets first. In addition, to normalize the features we also apply the Min-Max Scaling, so that each feature can be uniform. We used Recursive Feature Elimination (RFE) to select features. This helps us select the most important features of the model.

We use the statsmodels library to build the model using Linear Regression. Adding a constant (intercept) to the feature set is the key step here. This is to ensure that the model features the baseline price, despite any other feature being zero.

We finally evaluate the model’s outperformance using R Square and Mean Squared Error to see if it effectively predicts the house price. This is a fun first project endeavor to work with data preprocessing, feature selection, and building regression models.

Prerequisites

Learners must develop some skills before undertaking this project. Here’s what you should ideally know:

Understanding of basic knowledge of Python for data analysis and manipulation
Knowledge of libraries such as Pandas, Numpy, and Matplotlib for data manipulation and data visualization respectively.
Understanding of data preprocessing steps such as how to deal with missing values, normalization, and scaling.
Familiarity with exploratory data analysis (EDA) to find out patterns and growing trends in sets.
Elementary concepts about regression models to learn how predictive modeling works
Machine learning frameworks such as Scikit-Learn for building, training, and assessing models

Approach

We followed several key steps that are involved in building an accurate house price prediction model. So we load the dataset to understand its structure, clean missing or invalid data, etc. We then deal with outliers in box plots and ensure the dataset is fit for training the model. Once the data is split into two sets, the training, and the testing, we shall use the latter later to evaluate the model's performance. To normalize the features and to give all variables the same status in front of the model, we apply Min-Max Scaling. To address the accuracy of the model, we apply Recursive Feature elimination (RFE) that allows us to choose which variables will be the key to the model. Then we use Linear regression to fit the regression model with a constant term so that we have the intercept. We then apply the model until performance metrics such as R2 and Mean Squared error are attained with the best prediction.

Workflow and Methodology

Workflow

Data Collection: Collect the dataset from the public dataset repository, and load it into a DataFrame in Pandas for further analysis.
Data Cleaning: You need to deal with missing values, remove outliers, and check that the right data type is used and that all the data is ready for modeling.
Feature Scaling: Apply Min-Max Scaling to normalize the features.
Feature Selection: Use Recursive Feature Elimination to select the most relevant features for the regression model.
Train-Test Split: Split the dataset into training and testing sets to evaluate model performance.
Model Building: Train a Linear regression model using the prepared data.
Model Evaluation: Evaluate the models using metrics MSE, R2.

Methodology

The methodology takes a systematic approach to estimating house prices with the help of predictive modeling techniques in regression. The very first step involves preparing and processing the data by performing actions such as cleaning the data, settling outliers if any, and handling missing values. After the data was prepared, features were scaled using the Min-Max Scaling so as not to let any attribute dominate the model by its scale. Feature Selection is then carried out using RFE to come up with the most pertinent features while ensuring that the model does not become ineffective. After the relevant features have been chosen, a Linear Regression model is constructed with the addition of a constant to account for intercept. Finally, the result of the computed model is checked for R-squared values to analyze the fit of the data, and Mean Squared Error (MSE) is used to check the accuracy of the predictions.

Data Collection and Preparation

Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.

Data Preparation Workflow:

Load the Data: The first step will be to load the dataset into a Pandas DataFrame which can be utilized for analysis.
Exploratory Data Analysis (EDA): The initial analysis focuses on the data’s structure and its distribution.
Handle Missing Values: Handle the null values by either filling in or erasing the missing values to achieve an intact dataset.
Remove Outliers: Recognize and exclude any outliers as they may affect the outcome of the results.
Encode Categorical Variables: Implement encoding techniques to change categorical values into numerical form.
Feature Scaling: Perform Min-Max Scaling to standardize features and avoid scaling conflicts.
Feature Selection: Employ RFE to determine the key features to be utilized in the model.

Code Explanation

STEP 1:

Mounting Google Drive

First, mount Google Drive to access the dataset that is stored in the cloud.

from google.colab import drive
drive.mount('/content/drive')

Importing Library

This code imports necessary libraries for data analysis, visualization, feature selection, and building a regression model, including Pandas, Numpy, Seaborn, and Scikit-learn tools.

%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.feature_selection import RFE
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

STEP 2:

Loading Data and Checking Dimensions:

This code loads the CSV file. After loading the dataset it prints the dataset’s shape to check the number of rows and columns. The %time magic command in the notebook records the time taken to perform the task.

Aionlinecourse_housing = pd.read_csv("/content/drive/MyDrive/New 90 Projects/Project_3/Data/Housing.csv")
%time
print(Aionlinecourse_housing.shape)

Previewing Data

This block of code displays the first few rows of the dataset to give a quick overview of its structure.

# Check the head of the dataset
Aionlinecourse_housing.head()

The purpose of the given code is to provide a summary of the DataFrame Aionlinecourse_housing by displaying the number of records, names of the columns, types of columns, count of non-null values, and the size in memory.

Aionlinecourse_housing.info()

Descriptive Statistics

This code displays a summary of the numerical variables contained in the Data Frame, including mean, standard deviation, minimum, maximum, and quartiles, etc.

Aionlinecourse_housing.describe()

Checking Null Values

The code measures the proportion of missing (null) values in the columns of the DataFrame, and since the output reveals 0%, it shows that the dataset does not have any null values.

# Checking Null values
Aionlinecourse_housing.isnull().sum()*100/Aionlinecourse_housing.shape[0]
# There are no NULL values in the dataset, hence it is clean.

STEP 3:

Creating Pie Chart

The following code snippet first computes the frequency of occurrences in the ‘mainroad’ column and then it proceeds to draw and display a pie chart indicating which proportion of the houses have main road access and which are not, along with the percentage numbers.

# Assuming you want to create a pie chart for the 'mainroad' feature
mainroad_counts = Aionlinecourse_housing['mainroad'].value_counts()
# Create the pie chart
plt.figure(figsize=(6, 6))
plt.pie(mainroad_counts, labels=mainroad_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Houses with Main Road Access')
plt.show()

Outlier Analysis

This code plots six box plots to represent outliers of some important features that include ‘price’, ‘area’, ‘bedrooms’, ‘bathrooms’, ‘stories’, and ‘parking’ in the features dataset.

# Outlier Analysis
fig, axs = plt.subplots(2,3, figsize = (10,5))
plt1 = sns.boxplot(Aionlinecourse_housing['price'], ax = axs[0,0])
plt2 = sns.boxplot(Aionlinecourse_housing['area'], ax = axs[0,1])
plt3 = sns.boxplot(Aionlinecourse_housing['bedrooms'], ax = axs[0,2])
plt1 = sns.boxplot(Aionlinecourse_housing['bathrooms'], ax = axs[1,0])
plt2 = sns.boxplot(Aionlinecourse_housing['stories'], ax = axs[1,1])
plt3 = sns.boxplot(Aionlinecourse_housing['parking'], ax = axs[1,2])
plt.tight_layout()

Outlier Handling for Price

The following code draws a box plot for the ‘price’ feature, determines the Interquartile Range (IQR), and eliminates outliers by truncating values at 1.5 and above and below the IQR from the lower and upper quartiles.

# outlier treatment for price
plt.boxplot(Aionlinecourse_housing.price)
Q1 = Aionlinecourse_housing.price.quantile(0.25)
Q3 = Aionlinecourse_housing.price.quantile(0.75)
IQR = Q3 - Q1
Aionlinecourse_housing = Aionlinecourse_housing[(Aionlinecourse_housing.price >= Q1 - 1.5*IQR) & (Aionlinecourse_housing.price <= Q3 + 1.5*IQR)]

Outlier Handling for Area

The following code draws a box plot for the area feature, determines the Interquartile Range (IQR), and eliminates outliers by truncating values at 1.5 and above and below the IQR from the lower and upper quartiles.

# outlier treatment for area
plt.boxplot(Aionlinecourse_housing.area)
Q1 = Aionlinecourse_housing.area.quantile(0.25)
Q3 = Aionlinecourse_housing.area.quantile(0.75)
IQR = Q3 - Q1
Aionlinecourse_housing = Aionlinecourse_housing[(Aionlinecourse_housing.area >= Q1 - 1.5*IQR) & (Aionlinecourse_housing.area <= Q3 + 1.5*IQR)]

Outlier Analysis

This code creates a 2x3 grid of box plots to visualize and check if all the outliers are handled or not.

# Outlier Analysis
fig, axs = plt.subplots(2,3, figsize = (10,5))
plt1 = sns.boxplot(Aionlinecourse_housing['price'], ax = axs[0,0])
plt2 = sns.boxplot(Aionlinecourse_housing['area'], ax = axs[0,1])
plt3 = sns.boxplot(Aionlinecourse_housing['bedrooms'], ax = axs[0,2])
plt1 = sns.boxplot(Aionlinecourse_housing['bathrooms'], ax = axs[1,0])
plt2 = sns.boxplot(Aionlinecourse_housing['stories'], ax = axs[1,1])
plt3 = sns.boxplot(Aionlinecourse_housing['parking'], ax = axs[1,2])
plt.tight_layout()

Pairplot Visualization

This code plots a pair plot for the given Aionlinecourse_housing dataset to show the correlation or distribution of features in the form of a scatter plot and histogram respectively for all features in the dataset.

sns.pairplot(Aionlinecourse_housing)
plt.show()

Features vs Price Boxplot Analysis

The code generates the box plots analyzing the relationship between price and different categorical features such as mainroad, guestroom, basement, hotwaterheating, airconditioning, and furnishingstatus in a 2×3 grid.

plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.boxplot(x = 'mainroad', y = 'price', data = Aionlinecourse_housing)
plt.subplot(2,3,2)
sns.boxplot(x = 'guestroom', y = 'price', data = Aionlinecourse_housing)
plt.subplot(2,3,3)
sns.boxplot(x = 'basement', y = 'price', data = Aionlinecourse_housing)
plt.subplot(2,3,4)
sns.boxplot(x = 'hotwaterheating', y = 'price', data = Aionlinecourse_housing)
plt.subplot(2,3,5)
sns.boxplot(x = 'airconditioning', y = 'price', data = Aionlinecourse_housing)
plt.subplot(2,3,6)
sns.boxplot(x = 'furnishingstatus', y = 'price', data = Aionlinecourse_housing)
plt.show()

Use of Boxplot with Hues

This piece of code generates a boxplot to determine the dependency of ‘price’ on ‘furnishingstatus’ making use of the ‘airconditioning’ feature to allow a further breakdown as one compares different groups of air conditioning.

plt.figure(figsize = (10, 5))
sns.boxplot(x = 'furnishingstatus', y = 'price', hue = 'airconditioning', data = Aionlinecourse_housing)
plt.show()

STEP 4:

Converting Of Categorical Variables Into Binary

This piece of code defines how to convert categorical values to binary values (categorical ‘yes’ to ‘1’ and ‘no’ to ‘0 ‘) and uses the function prepared to change the indicated columns in the dataset Aionlinecourse_housing.

# List of variables to map
varlist =  ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea']
# Defining the map function
def binary_map(x):
    return x.map({'yes': 1, "no": 0})
# Applying the function to the housing list
Aionlinecourse_housing[varlist] = Aionlinecourse_housing[varlist].apply(binary_map)

Previewing Data

This block of code displays the first few rows of the dataset to have a quick overview of changes.

# Check the housing dataframe now
Aionlinecourse_housing.head()