BigMart Sales Prediction ML Project in Python

Do you want to learn predictive modeling and turn meaningful business insights into it? The Big Mart Sales Prediction project allows one to gain practical experience with data science methods using actual retail sales data. This project is intended for learners who wish to improve their machine-learning abilities and understand the retail business dynamics at the same time.

Project Overview

This big mart sales prediction is the best example of how data science methods can be applied to real-life sales data in retail. You will be using a dataset from Kaggle containing some rigorous features like product type, item exposure, your store’s location as well as customer information to create a perfect sales prediction model.
The project begins with data cleaning and preprocessing, where you’ll also deal with missing values and scaling of features for model training. You will then move to feature engineering and explore data analysis (EDA) on customers and products to assess patterns and trends in customer buying behaviors and product performance.
When progressing to building the regression model, you’ll discover basic concepts such as scaling the data, selecting features, and optimizing the model. Techniques like linear regression, random forest regression, and hyperparameter tuning will generate the sales figure model for Big Mart products.
That is why you can include this project in your portfolio as it will let you have practical experience in both predictive modeling and retail analysis. If you want to get a job as a data scientist, e-business, or business analyst, this project will help you to improve your ability and confidence.

Prerequisites

Learners must develop some skills before undertaking this Big Mart Sales Prediction project Here’s what you should ideally know:

Understanding of basic knowledge of Python for data analysis and manipulation
Knowledge of libraries such as Pandas, Numpy, and Matplotlib for data manipulation and data visualization respectively.
Understanding of data preprocessing steps such as how to deal with missing values, normalization, and scaling.
Familiarity with exploratory data analysis (EDA) to find out patterns and growing trends in sets.
Elementary concepts about regression models in order to learn how predictive modeling works
Machine learning frameworks such as Scikit-Learn for building, training, and assessing models

Approach

First, we begin with data loading and cleaning to ensure high-quality data. Then, EDA reveals some useful insights and key patterns in sales.

After that use feature engineering to create impactful variables. After preprocessing with scaling and encoding, you’ll select and train regression models. Hyperparameter tuning then optimizes model accuracy, evaluated with metrics like MAE and RMSE. Finally, the model generates sales predictions and insights, enhancing the understanding of retail sales trends for better decision-making.

Workflow and Methodology

Here's a step-by-step workflow you'll follow to build a successful sales prediction model:

Data Collection and Loading: Begin by collecting the Big Mart sales data set from Kaggle and loading and importing them to a Pandas data frame
Data Cleaning: To improve data quality, you need to detect and deal with missing values, ensure correct data type of data, and the handling of outliers.
Exploratory Data Analysis (EDA): By applying EDA, you can understand the distribution of data and the prominent features, with an analysis of patterns or trends of sales.
Feature Engineering: Create new columns and transform existing ones (categorical variables encoding) for better results of the model.
Data Preprocessing: You have to scale the numerical data and convert the categorical data into numeric data for better model training.
Model Selection: Use Regression models as this is a regression task.
Model Training: Train all models with cleaned data and prepared data.
Model Evaluation: Compare the model using other parameters such as the Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE) metrics.
Prediction and Insights: Use the final model to predict sales, and generate insights to help improve Big Mart’s sales strategies.

Data Collection and Preparation

Data collection

Big Mart Sales dataset is available in Kaggale. It is possible to conveniently and securely access a Kaggle dataset from within Google Colab after configuring your Kaggle credentials to prevent compromising sensitive information. It brings in the user’s data to collect securely the Kaggle API key and username and assigns them as environment variables. This enables the use of Kaggle’s CLI command (!kaggle datasets download -d brijbhushannanda1979/bigmart-sales-data) which authenticates the user and downloads the dataset straight into Colab.

Data Preparation

Data preparation workflow

Data Cleaning: Handling missing values with median or mode. Then convert data types into correct formats.
Outlier Management: Detects outliers for better model performance using statistical methods like IQR.
Feature Engineering: Transform categorical variables with label encoding or one-hot encoding. Create additional features if they can improve model performance.
Scaling and Normalization: Use StandardScaler to normalize numeric columns.
Data Splitting: Split data into training and testing sets to prepare for model training.

Code Explanation

STEP 1:

Mounting Google Drive

First, mount Google Drive to access the dataset that is stored in the cloud.

from google.colab import drive
drive.mount('/content/drive')

Importing Libraries

This code block imports all the required libraries for this project for creating, training, and evaluating models. It also imports image visualization libraries like Matplotlib and Seaborn, and performance evaluation using metrics like mean squared error and R² score.

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import GradientBoostingRegressor

STEP 2:

Load Dataset

The load_data function takes two parameters. These are the train dataset path and test dataset path. If datasets are present and loaded successfully then it prints “Train and Test Data Loaded Successfully!”
If there is any error occurs then it prints “File not found”

def load_data(train_path, test_path):
try:
train_data = pd.read_csv(train_path)
test_data = pd.read_csv(test_path)
print("Train and Test Data Loaded Successfully!")
return train_data, test_data
except FileNotFoundError as e:
print(f"File not found: {e}")
return None, None
train_df, test_df = load_data('/content/Train.csv', '/content/Test.csv')

Dataset Overview

The explore_data function takes a dataset and prints its quick overview. It prints how many rows and columns are present. The counts of missing values And their statistical descriptions like count, mean median, and percentile. This helps you to understand the dataset overview and its structure.

def explore_data(data, data_name="Dataset"):
print(f"{data_name} Shape:", data.shape)
print("\nMissing Values:\n", data.isnull().sum())
print("\nData Types:\n", data.dtypes)
print("\nDescriptive Statistics:\n", data.describe())
# Explore train and test data
explore_data(train_df, "Train Data")
explore_data(test_df, "Test Data")

This checks the first ten rows of the train and test dataset for a quick overview.

train_df.head(), test_df.head()

STEP 3:

Handling Missing Values

This displays those column names that have missing values and the percentage of missing values out of the training dataset.

missing_values= train_df.isnull().sum().sort_values(ascending=False)
missing_percentage = (missing_values / len(train_df)) * 100
missing_data = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})
missing_data = missing_data[missing_data['Missing Values'] > 0]
missing_data

This displays those column names that have missing values and the percentage of missing values out of the test dataset.

missing_values= test_df.isnull().sum().sort_values(ascending=False)
missing_percentage = (missing_values / len(test_df)) * 100
# Display columns with missing values and their percentages
missing_data = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})
missing_data = missing_data[missing_data['Missing Values'] > 0]
missing_data

Here, the Outlet_size column is categorical. That’s why we are filling the missing values with mode() techniques in train_df and test_df.

train_df['Outlet_Size']=train_df['Outlet_Size'].fillna(
train_df['Outlet_Size'].mode().values[0])
# test_df outlet size filling missing values
test_df['Outlet_Size']=test_df['Outlet_Size'].fillna(
test_df['Outlet_Size'].mode().values[0])

For the numerical column train_df[‘Item_Weight’], we are checking if there are any outliers. It will help us to make a decision on which method to choose for handling missing values.

plt.figure(figsize=(10,5))
sns.boxplot(data=train_df['Item_Weight'],orient="v", color = 'c')
plt.title("Item_Weight Boxplot")

Looking at the boxplot created, we noticed that there are no outliers in the `Item_Weight’ column. Hence to maintain the overall distribution, all the missing values were imputed with mean values.

train_df['Item_Weight'] = train_df['Item_Weight'].fillna(train_df['Item_Weight'].mean())
test_df['Item_Weight'] = test_df['Item_Weight'].fillna(test_df['Item_Weight'].mean())
missing_train = train_df['Item_Weight'].isnull().sum()
missing_test = test_df['Item_Weight'].isnull().sum()
print(f'Missing values in train_df Item_Weight: {missing_train}')
print(f'Missing values in test_df Item_Weight: {missing_test}')

STEP 4:

This line displays all the column names in test_df and train_df. It helps you quickly check and compare the available features in each dataset.

train_df.columns, test_df.columns