Build Regression (Linear, Ridge, Lasso) Models in NumPy Python

Using Python and NumPy this project introduces Linear Regression, Ridge, and Lasso Regression. We will also understand how these models can forecast outcomes and determine the correlation between variables. Regardless of your experience with machine learning this project simplifies the concept making it very easy to understand.

Project Overview

We’ll explore three key regression techniques: Ridge Regression, Lasso Regression, and Linear Regression. Continuous values, given in input data, are predicted with these models. Ridge and Lasso are linear regression versions (with regularization), capturing simple relationships between variables but making such relationships more robust to noise in the data. Using Python and NumPy library, we’ll go through data pre-processing, model building, model validation, and optimization techniques. By the end of the course, you’ll also have a solid grasp of how to use these regression models on real data and enhance your ML projects.

Prerequisites

Learners must develop some skills before undertaking this project. Here’s what you should ideally know:

Understanding of basic knowledge of Python for data analysis and manipulation
Knowledge of libraries such as Pandas, Numpy, and Matplotlib for data manipulation and data visualization respectively.
Understanding of data preprocessing steps such as how to deal with missing values, normalization, and scaling.
Familiarity with exploratory data analysis (EDA) to find out patterns and growing trends in sets.
Elementary concepts about regression models to learn how predictive modeling works
Machine learning frameworks such as Scikit-Learn for building, training, and assessing models

Approach:

In this project, we predict laptop prices using multiple regression models. We first loaded the dataset and cleaned it, handling the missing values and feature selection. OneHotEncoder is used for encoding categorical variables while StandardScaler is used to standardize for numerical features. Then the data is split into training and testing sets. Three trained models using Linear Regression, Lasso Regression, and Ridge Regression on the training set are run on the training set. Metrics like MAE, MSE, R2, and RMSE are the evaluation of each model. The performance of the models is compared and a classification report is generated by predicting prices as binary labels. The results are shown in a comparison table and through bar plots to compare the results of different models.

Workflow and Methodology

Workflow:

Data Collection: Collect the dataset from the public dataset repository, and load it into a DataFrame in Pandas for further analysis.
Data Cleaning: You need to deal with missing values, check that the right data type is used, and all the data is ready for modeling.
Feature Engineering: We transform some existing ones (categorical variables encoding) for better results on the model using OneHotEncoder.
Data Scaling: To get the best performance for your model, you have to scale the numerical data with StandardScaler.
Train-Test Split: Split the dataset into training and testing sets to evaluate model performance.
Model Building: Train regression models (Linear, Lasso, Ridge) using the prepared data.
Model Evaluation: Evaluate the models using metrics like MAE, MSE, R2, and RMSE.
Model Comparison: Compare model performance by analyzing evaluation metrics for each model.

Methodology:

Data Preprocessing: For categorical features, we use OneHotEncoding and for numerical features, we scale it using StandardScaler to ensure uniformity for models.
Model Selection: We preprocessed data and chose and trained Linear Regression, Lasso Regression, and Ridge Regression models.
Model Evaluation: Use the evaluation function to evaluate the model’s performance on the test set using MAE, MSE, R2, and RMSE.
Classification Report: Convert regression output into binary and get a classification report for the binary classification task.
Model Comparison: Compare the models using a comparison table and visualization (like a bar plot) using evaluation metrics.

Data Collection and Preparation

Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.

Data Preparation:
The Dataset is loaded on a Pandas DataFrame for easy preparation and analysis. We identified and handled missing values by removing rows having a large proportion of missing data. Features are chosen for the regression models to ensure that only relevant ones are taken to skip the unnecessary or redundant ones. OneHotEncoder encodes categorical variables into a format that the machine learning models can work with. Then we standardize numerical features to represent all features on a similar scale using StandardScaler. Finally train_test_split() splits the data into training and testing sets to let our model be evaluated on unseen data.

Code Explanation

STEP 1:

Mounting Google Drive

First, mount Google Drive to access the dataset that is stored in the cloud.

from google.colab import drive
drive.mount('/content/drive')

Importing Library

This segment of code imports the requisite libraries for data handling, model building, and graphics rendering. The data operations are carried out with the help of NumPy and Pandas whereas the plotting is done by Seaborn and Matplotlib.

The code also imports scikit-learn machine-learning models including Linear Regression, Ridge, and Lasso. It has components like OneHotEncoder and StandardScaler which are designed for data pre-processing and subsequent model performance evaluation respectively.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LassoCV,RidgeCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.preprocessing import OneHotEncoder,StandardScaler

Loading Data and Checking Dimensions:

This code loads the CSV file. After loading the dataset it prints the dataset’s shape to check the number of rows and columns. The %time magic command in the notebook records the time taken to perform the task.

Aionlinecourse = pd.read_csv("/content/drive/MyDrive/New 90 Projects/Project_2/laptop_eda.csv")
%time
print(Aionlinecourse.shape)

Previewing Data

This block of code displays the first few rows of the dataset to have a quick overview of the structure of the dataset.

Aionlinecourse.head()

This block of code displays the last few rows of the dataset.

Aionlinecourse.tail()

Summary Statistics

The code produces a transposed table containing summary statistics, including mean, standard deviation, minimum, and maximum values, for every numerical column of the DataFrame.

Aionlinecourse.describe().T

This line of code checks if there are any null values present in each feature.

Aionlinecourse.isnull().sum()

Eliminating Missing Values

The code removes the rows where all the values are missing and modifies the DataFrame in place. It then prints the new shape to check the changes.

Aionlinecourse.dropna(axis=0, how='all', inplace=True)  # Remove 'thresh' argument
print(Aionlinecourse.shape)

STEP 2:

Data Visualization

The code constructs a 2x2 grid of plots to visualize various attributes of the dataset. In detail, it comprises:

A histogram exhibiting the distribution of laptop prices with an overlaid kernel density estimation plot.
A scatter plot that depicts the relationship between the RAM of laptops and their price.
A box plot illustrating the price distribution of laptops based on the operating system.
A heatmap of a correlation matrix that showcases the relationship amongst the numeric features.

# Set up the figure with 2 rows and 2 columns
fig, axs = plt.subplots(2, 2, figsize=(16, 12))
# Histogram of a numerical column
sns.histplot(Aionlinecourse['Price'], bins=20, kde=True, ax=axs[0, 0])
axs[0, 0].set_title('Distribution of Laptop Prices')
axs[0, 0].set_xlabel('Price')
axs[0, 0].set_ylabel('Frequency')
# Scatter plot between two numerical columns
sns.scatterplot(x='Ram', y='Price', data=Aionlinecourse, ax=axs[0, 1])
axs[0, 1].set_title('Relationship between RAM and Price')
axs[0, 1].set_xlabel('RAM')
axs[0, 1].set_ylabel('Price')
# Box plot to compare a numerical column across different categories
sns.boxplot(x='OpSys', y='Price', data=Aionlinecourse,ax=axs[1, 0])
axs[1, 0].set_title('Laptop Prices by Operating System')
axs[1, 0].set_xlabel('Operating System')
axs[1, 0].set_ylabel('Price')
# Correlation matrix heatmap
numerical_features = Aionlinecourse.select_dtypes(include=np.number).columns
correlation_matrix = Aionlinecourse[numerical_features].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f",ax=axs[1,1])
axs[1, 1].set_title('Correlation Matrix')
plt.tight_layout()
plt.show()

Detecting Features with High Correlations

The code identifies pairs of features that correlate greater than a specified threshold (0.7). It then prints the pairs of features with high correlations.

# prompt: Find the highly correlated features
# Assuming 'correlation_matrix' is amodemodel2eady calculated as in your provided code.
# Find features with correlation greater than a threshold (e.g., 0.7)
threshold = 0.7
highly_correlated_features = set()
for i in range(len(correlation_matrix.columns)):
  for j in range(i):
    if abs(correlation_matrix.iloc[i, j]) > threshold:
      colname_i = correlation_matrix.columns[i]
      colname_j = correlation_matrix.columns[j]
      highly_correlated_features.add((colname_i, colname_j))
print("Highly correlated features:")
for feature_pair in highly_correlated_features:
  # Indented this line to be inside the for loop
  print(feature_pair)