Predictive Analytics on Business License Data Using Deep Learning

Let us introduce an interesting deep learning project on predictive analysis based on business license data. This project offers easy step-by-step guidance on the modeling process, which involves predicting if a business license application is to be approved, renewed, or revoked, using the features of TensorFlow and H2O.

This project is aimed at those who have little or no experience with machine learning and those who want to take their skills several notches higher. It is a practical project that embraces all aspects including data preprocessing and the development of the model using deep neural networks (DNNs).

Let's explore the topic more!

Project Overview:

The purpose of this project is to predict whether a business license will be active or not using deep learning and machine learning techniques. We’ll explore, clean, and prepare a dataset with more than 86,000 businesses for modeling. So, for one we’re going to build out a baseline model using H2O’s Random Forest and then a more complex Deep Neural Network (DNN) using TensorFlow.

At the end of all this, you will have built a system that can foresee what is most likely to happen with a business license application.

Key Features:

Tools: TensorFlow, H2O, Python libraries (pandas, numpy, matplotlib, seaborn, scikit-learn.
Outcome: Predict business license statuses such as Approved, Renewed, or Revoked.
Use case: Predictive analytics for businesses, government regulations, or consultancy services.

Prerequisites

Before working, please ensure that you have the following:

Google Colab or a local working Python environment
Knowledge in libraries like TensorFlow, H2O, Pandas, seaborn, and NumPy, scikit-learn libraries
Knowledge regarding machine learning concepts and deep learning techniques.
Business License dataset

Approach

We follow a structured approach:

Data Collection: Collect a dataset that contains data on 86,000 different businesses and their licensing details.
Data Preparation: Clean and preprocess the available data for model training.
Model Building: There are two models we built to establish a baseline. We implemented a random forest baseline model using H20 and deep learning neural networks using TensorFlow
Evaluation: We run and evaluate the model using some essential parameters like accuracy.

Workflow and Methodology

The overall workflow of this project includes:

Data Preparation: Load and clean the dataset of business licenses. Then handle missing values and normalize data features.
Exploratory Data Analysis: Analyze data distribution and relationships between features to understand patterns.
Baseline Model with H2O: Build a Random Forest baseline model using the H2O framework to predict license statuses.
DNN Setup: Train a DNN model using TensorFlow including dropout regularization.
Evaluation: Test the trained model with test data. Then calculate accuracy and loss metrics for performance evaluation.
Prediction: Use the trained DNN model to make predictions on unseen data

The methodology involves:

Supervised learning: We train the model with labeled data to predict license statuses.
Feature engineering: Important features like license type, business type, and ZIP code are used for predictions.
Model training: We use cross-entropy loss for the DNN and Gini Impurity for the random forest.

Data Collection

First, we load a business license dataset with detailed information about businesses, including license number, license description, license status, application type, and so on from Kaggle. It helps us predict if a business license will be issued, renewed, or revoked.

Data Preparation

After collecting the dataset, we will prepare the data and clean it before modeling. It involves handling missing values, how to encode categorical variables, and how to split the dataset into training and testing sets.

Data Preparation Workflow:

Handle missing values: To ensure the model's reliability, we either fill in or remove the data points that are missing.
Categorical encoding: Convert categorical features into numerical values using one-hot encoding.
Train-test split: Split the dataset on train test split (i.e. 80% for training and 20% for test) so that the model works well on new data.

Code Explanation

To easily understand, let’s dive deep into the code step by step:

STEP 1:

Mounting Google Drive This code mounts Google Drive. It allows us to access datasets that are stored there.

from google.colab import drive
drive.mount('/content/drive')

Install Required Packages

This code installs these three packages which are tensorflow, numpy, and h20. The setup allows the use of classical machine learning through H2O as well as deep learning frameworks such as TensorFlow. NumPy takes care of all the computation of data.

!pip install tensorflow
!pip install numpy
!pip install h2o

Importing Required Libraries

We import all the essential Python libraries like pandas for data manipulation and seaborn for data visualization tasks. H2O is popular when it comes to heavy dealing with big data sets and analysis. Numpy offers support regarding arrays, matrices, and various mathematical capabilities. And the H2O engine. Which allows you to interact with the library for building and training machine learning models.

import h2o
from h2o.estimators import H2OGradientBoostingEstimator, H2ORandomForestEstimator
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
h2o.init()

This code ensures that panda displays up to 500 columns and rows of dataframe.

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

Loading the Dataset

This line loads the business license dataset and prints the dataset’s shape, which shows its total number of rows and columns.

pdf = pd.read_csv("/content/drive/MyDrive/Aionlinecourse/Data.csv")
print(pdf.shape)

STEP 2:

Exploratory data analysis

This code shows all column names of the dataframe.

pdf.columns

This code shows the count of unique value within the “LICENSE STATUS” column

pdf["LICENSE STATUS"].value_counts()

It returns true if ‘LICENSE STATUS' is present in dataframe columns. And return false if not present.

'LICENSE STATUS' in pdf.columns

This line discards other rows and retains only those rows where the "LICENSE STATUS" is AAI, AAC, or REV.

pdf = pdf[pdf['LICENSE STATUS'].isin(['AAI', 'AAC', 'REV'])]

It returns the total count of missing values for each column. This makes it easy to see where data might be incomplete.

pdf.isna().sum()

This code shows the concise summary of the DataFrame.

pdf.info()

This code shows the number of unique values for each column in the DataFrame pdf.

pdf.nunique()

The command pdf.head() displays the first five rows of the DataFrame pdf. This is useful for quickly inspecting the initial entries, verifying the structure, and reviewing the data types of each column, which aids in understanding the dataset before further analysis or manipulation.

pdf.head()

This code shows the count-plot of the count of the categorical variable “LICENSE STATUS” in the data frame pdf. The count plot visually represents the distribution of different license statuses. This makes it easier to spot trends or imbalances in the data.

sns.countplot(data=pdf, x='LICENSE STATUS')
plt.show()