Image

Predictive Analytics on Business License Data Using Deep Learning

This deep learning project aims to teach people the fundamentals of Deep Neural Networks (DNN) and how to effectively utilize them. The project uses a valid dataset with information on 86,000 companies in a range of areas and teaches students important steps such as Exploratory Data Analysis (EDA), cleaning the data, and getting it ready. Participants will learn about ideas like activation functions, feedforward, backpropagation, and dropout regularization by using Python and tools like pandas, seaborn, numpy, matplotlib, scikit-learn, h2o, and TensorFlow. The project not only looks at theoretical issues, but also shows how to build DNN models with TensorFlow, do baseline modeling with h2o, and tune hyperparameters. Participants will get a full understanding of deep learning concepts and learn important skills in data preprocessing, model development, and evaluation through this structured method.

Explanation All Code

Step 1:

You can mount your Google Drive in a Google Colab notebook with this piece of code. This makes it easy to view files stored in Google Drive for data manipulation, analysis, and training models in the Colab environment.

Install required packages

Import required packages

Step 2:

Load the data

This piece of code makes a summary of how the numbers in the DataFrame pdf's "LICENSE STATUS" column are spread out. It shows how often different license statuses show up in the dataset, which can help you understand what the dataset is like and use that information to guide further research or decision-making.

This bit of code gets the number of unique values in the DataFrame pdf's "LICENSE STATUS" field. It basically shows how many times each different license state shows up in the dataset. This data can be very important for figuring out how common and where different license statuses are in the dataset, which can help with further research or making decisions.

This code sorts the DataFrame "pdf" so that it only has rows where the "LICENSE STATUS" column has the values "AAI", "AAC", or "REV". Rows with other license statuses are then taken out of the dataset.

This piece of code counts how many empty cells there are in each column of the DataFrame "pdf".

This code shows the DataFrame pdf summary, which includes column names, data kinds, and memory usage.

The number of unique values in each column of the DataFrame "pdf" is found by this piece of code.

This code makes a count plot with seaborn ('sns') to show how the numbers in the "LICENSE STATUS" column of the DataFrame 'pdf' are spread out. On the x-axis, it shows how often each different license state happens as bars. Lastly, the plot is shown with "plt.show()".

Step 3:

This code counts how many times each unique number appears in the "CONDITIONAL APPROVAL" column of the "pdf" data frame. It shows how the different types of conditional approval are spread out in the collection.

This block of code remanes column “DOING BUSINESS AS NAME” of DataFrame “pdf”. The function'rename()' takes a dictionary as input. The keys in the dictionary are the current column names, and the values are the new column names. The "inplace=True" option makes sure that the changes are made to the "pdf" DataFrame instead of making a new one.

This piece of code adds a new field called "LEGAL BUSINESS NAME MATCH" to the DataFrame "pdf". For each row, it checks to see if the uppercase "LEGAL NAME" is inside the uppercase "DOING BUSINESS AS NAME" or the other way around. The new column's value is set to 1 if a match is found and to 0 otherwise.

This code keeps track of how many times each unique value appears in the "LICENSE DESCRIPTION" field of the "pdf" DataFrame. It gives information about how the different kinds of license titles are spread out in the dataset.

This piece of code takes similar descriptions from the "LICENSE DESCRIPTION" column of the "pdf" DataFrame and turns them into broader terms. It sets standards for groups like auto repair shop, day care center, peddler, tire shop, repossessor, expediter, and itinerant merchant. This helps to remove duplicates easily for analysis and make sure the information is consistent.

This code finds the number of unique values in the "pdf" DataFrame's "LICENSE DESCRIPTION" field. It gives you an idea of the range of license descriptions in the collection.

This piece of code sorts and standardizes the different types of businesses in the DataFrame "pdf". It makes the "LEGAL NAME" and "DOING BUSINESS AS NAME" fields more consistent by getting rid of the periods. It puts "PVT" in the "BUSINESS TYPE" field to start. After that, it sorts businesses into groups based on terms like "INC", "LLC", "CORP", and "LTD" that are in their names, giving them the right types. These steps make the information easier to analyze by making it more consistent and letting you look into more business types.

This code finds out how often each unique number in the DataFrame pdf's "BUSINESS TYPE" column appears. It gives information about how the different types of businesses are spread out in the dataset after it has been standardized and categorized.

This code keeps track of how many times each unique number appears in the DocumentFrame's "BUSINESS TYPE" column. It gives a brief overview of how the different types of businesses are spread out in the information.

Step 4:

This piece of code finds out how often each unique number appears in the "ZIP CODE" column of the "pdf" DataFrame. It gives information about how businesses are spread out across the dataset's ZIP codes.

This code adds a new column called "ZIP CODE MISSING" to the DataFrame 'pdf' to show cases where ZIP codes are missing. It also finds missing values in the "ZIP CODE" column and replaces them with -1.

This code makes a 12 bin histogram of the "SSA" column in the "pdf" DataFrame. This code uses the trained neural network model.

Step 5:

This code makes a 12 bin histogram of the "SSA" column in the "pdf" DataFrame. The grid is slightly see-through.

This piece of code adds -1 to the "APPLICATION REQUIREMENTS COMPLETE" column of the DataFrame "pdf" to fill in any values that are missing. Then, it changes the column numbers by setting them to 0 if they are missing and 1 otherwise. With this transformation, you can tell if the application standards are complete or not.

Train Test Split


This code divides the DataFrame "pdf" into training and testing sets. The test set has a size of 20% and a random state that can be used again and again.

The pandas DataFrames "train" and "test" are turned into H2OFrame objects by this code. The H2O machine learning platform uses a data structure called H2OFrame to make it easier to work with and handle large datasets.

Using the H2O library and certain settings, this code sets up and trains a random forest model called "h2o_rf". To guess the "LICENSE STATUS" variable, it looks at things like "APPLICATION TYPE", "CONDITIONAL APPROVAL", "LICENSE CODE", "SSA", "LEGAL BUSINESS NAME MATCH", "ZIP CODE MISSING", "APPLICATION REQUIREMENTS COMPLETE" and "BUSINESS TYPE".

This code uses the learned random forest model ('h2o_rf') on the test dataset ('test') to make predictions. Then, it adds the real numbers from the test dataset to the predictions in the "LICENSE STATUS" column. Lastly, it turns the results into a pandas DataFrame so that they can be studied further.

The learned random forest model ('h2o_rf') is used on the test dataset ('test') by this code to make predictions. After that, it adds the test dataset's real numbers to the forecasts in the "LICENSE STATUS" column. The last step is to turn the data into a pandas DataFrame so they can be studied more.

Data Conversion for DNN model

This piece of code takes the original DataFrame 'pdf' and chooses the predictors and goal variable ('LICENSE STATUS'). It then makes a new DataFrame called 'final_df'. 'APPLICATION TYPE', 'CONDITIONAL APPROVAL', 'LICENSE CODE', 'LICENSE DESCRIPTION', 'BUSINESS TYPE', and 'LICENSE STATUS' are some of the categorical fields that are encoded in a single pass using 'pd.get_dummies()'. By changing categorical factors to numbers, this gets the data ready for training a machine learning model.

You can split the final DataFrame 'final_df' into training and testing sets with the 'train_test_split' method from scikit-learn. Making sure the results can be replayed is done with a random seed. The test set is 20% of its largest size.
  • The predictor factors for the training set are in "X_train."
  • The goal variables for the training set are in "y_train."
  • The predictor factors for the test set are in "X_test."
  • The goal variables for the test set are in "y_test."

Then, the form of each dataset is printed to make sure the sizes are correct.

Step 6:

Modeling


This code imports necessary libraries for building neural network models

This code creates a sequence neural network model for multiclass classification using Keras. There is an input layer, three thick layers that utilize dropout regularization, and an output layer. The model is put together using the Adam optimizer, the categorical crossentropy loss function, and the accuracy measure. Last, the model summary is written out so that it can be looked over.

This code uses the learned neural network model ('model') on the test data ('X_test') to make predictions. For each instance in the test dataset, it gives back the predicted number.

Conclusion

In this project, we learned about the basics of Deep Neural Networks (DNNs) and how to use them in a real-life situation to guess the state of a business license. We started by looking at the dataset that had data on different businesses and the state of their licenses. We learned more about the dataset's features and got it ready for analysis through exploratory data analysis and data cleaning.


After that, we used the H2O framework to make a baseline model and then TensorFlow to make a DNN model. We learned about important ideas like activation functions, feedforward, backpropagation, loss functions, and dropout regularization, which are necessary to understand DNNs. We tested our DNN model on the test dataset after fine-tuning its hyperparameters and training it. It did a good job of predicting the state of business licenses.


In the end, this project was a good introduction to deep learning for people who are new to it because it gave them real-world experience with building and training DNN models. After learning about basic ideas and tools, participants are now ready to take on more difficult deep learning projects and help AI technologies progress in many areas.

Code Editor