Image

Learn to Build a Polynomial Regression Model from Scratch

Ready to explore polynomial regression? Imagine yourself driving around data - uncovering some hidden structure, some pattern that isn't a simple line. Polynomial regression is like enhancing the power of the model that helps to capture complex curves and trends. Increasing the accuracy for predicting the result and accomplishing actual tasks is such a great idea.

Project Overview

In this project, you will be learning how to create your polynomial regression model from scratch. You will realize how to mold a basic linear model to your advantage. First, let’s collect data and clean it. After that, we will discuss how polynomial regression is different from simple linear regression. Last, we will proceed with coding step by step while explaining it in detail. By the end of this project, you will have a working model that can handle non-linear trends perfectly. Ready to start? Let’s dive in!

Prerequisites

  • Basic Python Skills: Be able to write loops, functions, and variables.
  • Understanding of Linear Regression: Familiar with drawing a straight line to fit data.
  • Basic Math Knowledge: Knowledge in simple algebra, power, and exponentiation.
  • Libraries: NumPy and Matplotlib: Familiarity with calculations, data manipulations, and data visualization.

With these basics, you’re ready to start the project!

Approach

Our approach starts by setting up the dataset. After that, we'll use the dataset to train and test our model. We start with a quick review of linear regression to see where polynomial regression can add value. Next, we’ll transform our data by adding polynomial features. This allows our model to capture curves instead of just straight lines. Once the data is prepared, we’ll code the polynomial regression model step-by-step. We will employ the Python programming language and its NumPy library for numerical calculations and transformations. After training, we'll present the results using Matplotlib to assess the data-fitting ability of the model.

In the end, we will do a comparison of these two models and perform a more advanced analysis to show the clear benefit of using this model in practical applications dealing with databases.

Workflow

  1. Data Collection: First, we gathered the datasets made of the images of healthy and infected leaf samples.
  2. Data preprocessing: Once we gathered samples, we handled missing values. Then the images were normalized. After that, it was divided into data for training and others for validation.
  3. Feature Engineering
    Add polynomial features to the dataset, which will allow our model to fit curves.
  4. Model Building: Implementing polynomial regression from scratch using Python and NumPy to handle the math.
  5. Training the Model: The model is then trained on the training dataset to learn from the patterns.
  6. Evaluation: Carrying out model testing and its performance assessment.
  7. Comparison with Linear Regression
    Compare results with a basic linear regression model to highlight the power of polynomial regression.

Methodology

To construct a polynomial regression model, we start by increasing the dimension of our features by adding polynomial terms of the original features like squares and cubes of the original respectively. This enables the model to fit higher-order trends. We then use least squares estimation to train the model and this is just a concept of getting closer to the real values. By mapping each step, from feature engineering up to training and evaluation, we shall not only know how it works but also find out why the model suits non-linear data. We can visualize and compare the effects of polynomial regression, thus making a practical and useful methodology!

Data Collection

First, we collected a public dataset that shows a non-linear pattern. This will allow us to see the benefits of polynomial regression. You can also use real-world data like housing prices or stock trends. In this dataset, values fluctuate in complex ways.

If you're a beginner, try to create artificial data with a curved pattern. This way at least you will be in control of the data. This makes it easier to know whether the model we are using is functioning as required.

Data Preparation

Once the dataset has been collected, it is then time to prepare the data set. We will start with the missing values so that it doesn’t mess up our data set. Following this, we will normalize our features to ensure all of them are in the same range. This is especially important in polynomial regression. Also, the last step to perform is transforming (squaring, cubing, or taking to higher powers) the original built features to make new features. This transformation enables our model to fit curves rather than mere straight lines. So that more relationships can be captured.

Data Preparation Workflow

  • Handle Missing Values: Impute missing values so that it does not reduce the total number of features in a dataset on which the training is going to be done.
  • Feature Scaling: Scale features to the same scale for training an accurate model.
  • Generate Polynomial Features: To incorporate non-linear relationships, add polynomial features.
  • Split the Data: Divide the available data into training and validation data.
    Final Check: Check all features for modeling with the conditions that no outliers are influencing the results and no need for scale transformations.

STEP 1:

Code explanation

Here’s what is happening under the hood. Let’s go through it step by step:

Mount Google Drive

Mount your Google Drive to access and save datasets, models, and other resources.

from google.colab import drive
drive.mount('/content/drive')

Suppress Warnings

It excludes non-critical warnings from the output, producing a cleaner view of the results.

import warnings
warnings.filterwarnings('ignore')

Install Required Libraries

This installs the LightGBM and Scikit-Learn libraries. LightGBM is often used for gradient boosting in machine learning, while Scikit-Learn provides essential tools for building and evaluating models.

!pip install lightgbm
!pip install scikit-learn

Import Libraries

This section imports the necessary library for data manipulation (pandas, numpy), graph plotting (seaborn, plotly, matplotlib), statistical operations (scipy), and machine learning models (sklearn).

import sys
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import scipy.stats as stats
import matplotlib.pyplot as plt
from sklearn import linear_model
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

STEP 2:

Load the Dataset

The code employs the pandas library to retrieve the dataset from Google Drive. The head() method shows the first few records within the dataset to have an overview of the dataset.

data \= pd.read_csv("/content/drive/MyDrive/New 90Projects/Project_1/Final_NBA_Dataset.csv")
data.head()

Check the Dataset Dimensions

This statement displays the dimension of the dataset in terms of how many rows and columns it has. For example, if it says (317, 7), this means that the dataset has 317 rows and 7 columns, thereby helping to verify that the correct amount of data was indeed loaded.

print("Dimension of the dataset is= ", data.shape)

Display Column Names

The code is designed to print the names of the dataset's columns. This helps to quickly understand the features present in the data to be analyzed.

data.columns

Dataset Overview

Before starting the data cleaning or the analysis process, it is wise to understand the structure of the data and look for missing values. This command is very useful in undertaking this task, as it lets the user easily perceive the characteristics of the data.

data.info()

Renamed Columns for Easy Analysis

The code alters the column names of the data DataFrame to make simpler names and produces a new DataFrame df with such column names. It then uses df.head(10) to present the first 10 rows to check the modifications made.

df=data.rename(columns={'Points_Scored':'Points','Weightlifting_Sessions_Average':'WL','Yoga_Sessions_Average':'Yoga',
'Laps_Run_Per_Practice_Average':'Laps','Water_Intake':'WI',
'Players_Absent_For_Sessions':'PAFS'})
df.head(10)

STEP 3:

Visualizing Data Distribution and Outliers for Deeper Insights into Points

This code creates three visualizations for the Points variable: These include a density plot, a transformed density plot (using the square root of Points), and a boxplot. For detail, each of the plots assists us in identifying different aspects of data distribution such as spread, skewness, and outliers. The layout is changed to make it easier to view the plots, and the plots are shown beside each other for contrast.

fig, axs \= plt.subplots(1, 3, figsize=(20, 6), dpi=80)
# Distribution Plot for Points
sns.distplot(df.Points, ax=axs[0])
axs[0].set_xlabel("Points")
axs[0].set_ylabel("Density")
axs[0].set_title("Distribution Plot for Points")
# Distribution Plot for Square Root of Points
sns.distplot(np.sqrt(df.Points), ax=axs[1])
axs[1].set_xlabel("Square Root of Points")
axs[1].set_ylabel("Density")
axs[1].set_title("Distribution Plot for Square Root of Points")
# Boxplot for Points
sns.boxplot(x=df.Points, ax=axs[2])
axs[2].set_xlabel("Points")
axs[2].set_title("Boxplot for Points")
# Adjust layout to avoid overlap
plt.tight_layout()
plt.show()

Viewing the Last 100 Rows of the Dataset

The code displays the last 100 rows of the df DataFrame.

df.tail(100)

Creating Violin and Box Plots for Each Variable by Team

The function plotting_box_violin_plots() is constructed to create violin and box plots for any given variable against the ‘Team’ column to better visualize the spread and the central tendency of the data. This function is subsequently called in a loop to draw these plots for each variable (‘WL’, ‘Yoga’, ‘Laps’, ‘WI’, ‘PAFS’) and allow for comparison for each of them against the ‘Team’ label.

def plotting_box_violin_plots(df, x, y):
fig, axes \= plt.subplots(1, 2, figsize=(26, 8))
fig.suptitle("Violin and Box plots for variable: {}".format(x))
\# Violin plot with Team on the y-axis and the variable on the x-axis, colored by Team
violin \= sns.violinplot(ax=axes\[0\], x=x, y=y, data=df, hue=y, palette="Set2", split=True)
axes\[0\].set\_title("Violin plot for variable: {}".format(x))
\# Box plot with Team on the y-axis and the variable on the x-axis, colored by Team
box \= sns.boxplot(ax=axes\[1\], x=x, y=y, data=df, hue=y, palette="Set2")
axes\[1\].set\_title("Box plot for variable: {}".format(x))
\# Setting the labels for x-axis and y-axis
axes\[0\].set\_xlabel(x)
axes\[0\].set\_ylabel("Team")
axes\[1\].set\_xlabel(x)
axes\[1\].set\_ylabel("Team")
\# Add legends only if there are any labeled elements
if violin.get\_legend\_handles\_labels()\[1\]:
axes\[0\].legend(loc='upper right')
if box.get\_legend\_handles\_labels()\[1\]:
axes\[1\].legend(loc='upper right')
# Looping through the variables and plotting
for x in ['WL', 'Yoga', 'Laps', 'WI', 'PAFS']:
plotting_box_violin_plots(df, x, "Team")

STEP 4:

Identifying Outliers in Selected Columns

The function find_outliers() is employed to determine outliers in a given column by computing the Interquartile Range (IQR) of that column. In a for-loop manner, this function is executed column-wise on the columns (‘WL’, ‘Yoga’, ‘Laps’, ‘WI’, ‘PAFS’) and any values that are beyond the composed range are displayed, making it possible to identify data points that need additional attention.

def find_outliers(df,column):
Q1=df[column].quantile(0.25)
Q3=df[column].quantile(0.75)
IQR=Q3-Q1
Upper_End=Q3+1.5*IQR
Lower_End=Q1-1.5*IQR
outlier=df[column][(df[column]>Upper_End)| (df[column]\
return outlier
# Indent the code block within the for loop
for column in ['WL','Yoga','Laps','WI','PAFS']:
print('\n Outliers in column "%s"' %column)
outlier= find_outliers(df,column)
print(outlier)

Eliminating Certain Rows for Data Cleaning Purposes

The code constructs a revised DataFrame df_clean such that the rows with index numbers 142, 143, and 144 in df are excluded. In addition, the command df_clean.shape is employed to provide information regarding the size of the cleaned DataFrame.

df_clean=df.drop([142,143,144])
df_clean.shape

Replacing Invalid Values with NaN in the Cleaned Data

The specified code eliminates all the instances of the value 1111111.0 present in the 'WL' column of df_clean by substituting it with NaN denoting it as missing data. On displaying df_clean['WL’] the modified column is presented which helps in identifying the records that will require additional treatment during the analysis.

df_clean['WL'][df_clean['WL']==1111111.0]=np.nan
df_clean['WL']

STEP 5:

Calculating and Displaying Missing Data Proportion

The dataframe ncounts display the amount of missing data per column of the dataframe df_clean. It resorts to df_clean.isna().mean() to get the count of the NaN values as a percentage and the column is then transposed and given the name data_missing for ease of reference.

ncounts=pd.DataFrame([df_clean.isna().mean()]).T
ncounts=ncounts.rename(columns={1:'data_missing'})
ncounts

Visualizing Missing Data Proportion

The purpose of this code is to generate a horizontal bar graph showing the proportion of missing values against each of the columns in ncounts. Here, ncounts.plot(kind='barh', title='% of missing values across each column’) gives an intuitive understanding of the missing data in the columns and assists in determining which sections may need data winnowing. Lastly, the output has been generated using plt.show().

ncounts.plot(kind='barh', title='% of missing values across each column')
plt.show()

Comparing Data Shapes Before and After Dropping Missing Values

This comparison helps you understand the impact of dropping rows or columns with missing data on the dataset’s size.

df_clean.shape, df_clean.dropna(axis=0).shape, df_clean.dropna(axis=1).shape

Getting an Overview of the Cleaned Data

This summary provides insight into the arrangement of the prepared dataset before carrying out any evaluation.

df_clean.info()

Filling Missing Values in the 'WL' Column

The code replaces all NaN values in the 'WL' column with -1.

df_clean['WL'].fillna(-1)

STEP 6:

Visualizing the Effect of Filling Missing Values with Mean and Median

The presented code illustrates the effect on the distribution of the WL column when its missing values are replaced with the mean or the median. The placing of these two plots next to each other facilitates the comparison between the two imputation approaches and gives an idea of which method is better at retaining the original characteristics of the data.

fig, axes \= plt.subplots(1, 2, figsize=(16, 6), dpi=80)
# Visualizing after filling missing values with mean
sns.distplot(df_clean['WL'].fillna(df_clean['WL'].mean()), ax=axes[0])
axes[0].set_xlabel("WL")
axes[0].set_ylabel("Density")
axes[0].set_title("Distribution Plot for WL (Filled with Mean)")
# Visualizing after filling missing values with median
sns.distplot(df_clean['WL'].fillna(df_clean['WL'].median()), ax=axes[1])
axes[1].set_xlabel("WL")
axes[1].set_ylabel("Density")
axes[1].set_title("Distribution Plot for WL (Filled with Median)")
# Adjust layout to avoid overlap
plt.tight_layout()
plt.show()

Calculating Mean 'WL' per Team

This piece of code computes the average of the WL column for every distinct Team in df_clean, utilizing groupby(), and outputs the result in the form of a dictionary.

mean_WL=df_clean.groupby("Team")['WL'].mean().to_dict()
mean_WL

Replacing NaN Values in the 'WL' Column on a Team-Wise Basis

This code goes through every row of df_clean. In cases where the WL value is missing within a specific column, the void is filled with the mean WL value, which is associated with the team from the mean_WL dictionary.