Build Your First Machine Learning Project in Python(Step by Step Tutorial)


Are you a total beginner of Machine learning? And looking for a really interesting machine learning project for a beginner? Then this article is just for you. There is always a saying that you will learn best through doing. In this post, you will go through step by step to complete your first machine learning project in python. This will boost your knowledge of python and machine learning.


How can you start making a machine learning project?

Starting your first machine learning project is not that hard. You need not know a lot of things to start developing a new project. All you need a good guideline and practical explanation of how a machine learning model works.

With this hands-on tutorial, you will learn-

  1. How to find datasets for machine learning projects

  2. Loading the dataset and understanding its structures through statistical summarization and data visualization

  3. Cleaning and preprocessing the data

  4. Creating a number of machine learning models, compare them, and finalize the best model for the dataset

  5. Evaluating the performance of the final model

Find out how to load and prepare data with pandas, build, and evaluate models with scikit-learn. Let’s dive deep into the tutorial!

Find Datasets for Machine Learning Projects

There are many places where you can find cool datasets for building exclusive machine learning applications. 

The best place is Kaggle, where you will find a large number of datasets on almost everything from business to health and so on. 

UCL machine learning repository is another good place for datasets. Here you will get hundreds of datasets classified by the type of machine learning problem.

Other sources including Amazon Datasets, Microsoft datasets, US gov data, etc.

You can build your own datasets too. This could be anything like the performance history of your favorite baseball player or data about the videos you most frequently watch on youtube.

For our project, we took the Bank Marketing Dataset from the UCL machine learning repository.

Bank Marketing Dataset

The dataset is developed from the direct marketing campaigns of a Portuguese bank. The data is collected through direct phone calls to users. The campaigners asked various questions related to the product (bank term deposit). They collected information from all of the customers including their age, income, marital status, education, etc. The main goal of the campaigners was to find out whether they are interested to subscribe to the bank term deposit.

The dataset we are using (bank.csv) has a total of 17 input variables and a single output variable. Let’s have a look of them

Input variables:

  • age (numeric): Age of the clients.

  • job(categorical): The type of job the clients do. The categories include- 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown'

  • marital(categorical): marital status of the customers. Includes- 'divorced','married','single','unknown

  • education(categorical): educational background of the customers. Includes- 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown

  • default(categorical): Whether the users have a credit in default? Includes- 'no','yes','unknown'

  • housing(categorical): Do the customers have housing loans taken? Categories include-'no','yes','unknown'

  • loan(categorical): Do the clients have personal loans? Categories include-'no','yes','unknown'

  • contact(categorical): communication medium.  Categories include- 'cellular','telephone'

  • month(categorical): The month clients are contacted.

  • day_of_week(categorical): The day of the week clients are contacted.

  • duration(numerical): The duration of the last call in seconds.

  • campaign(numerical): Number of times clients are contacted.

  • pdays(numeric): Days passed by after the client was last contacted from a previous campaign (999 means client was not previously contacted).

  • previous(numeric): Number of times the clients are contacted during the current campaign.

  • poutcome(categorical): The result of the previous marketing campaign. Categories include- 'failure','nonexistent','success'

Output variable

  • y(categorical): whether the client has taken a term deposit. Categories include- yes, no

Identify the Machine Learning Problem

Before going to build any machine learning models, we need to understand the exact aspects of the problem. This is very important for preparing the dataset and developing appropriate models to get the best performance.

The output variable of our dataset represents the decision of the clients. It has two classes- yes and no. From this intuition, we can understand that it is a binary classification problem. Where we need to classify these two classes based on all the other independent attributes or features of the dataset.

The business aspect of solving this problem is very important. It will help the bank authority to make better decisions about their marketing plans to increase sales. They can find the most potential customers whom they should approach. They can also understand customer behavior across various groups based on age, education, job, etc. 

So we got a clear idea about the exact problem associated with the dataset. Now, let’s dive into building the models-

Develop Your First Machine Learning Project in Python

In this part, we will walk through building the complete machine learning project with our dataset. 

Let’s take an overview of what are we going to do-

  1. Install and Check the Required Python Libraries for Machine Learning
  2. Load the Dataset
  3. Understand the Data through Visualization
  4. Preprocess the Data
  5. Evaluate Some Machine Learning Algorithms
  6. Build and Test the Final Machine Learning Model
  7. Evaluate the Machine Learning Model
  8. Save and Load the Machine Learning Model

1. Install and Check the Required Python Libraries for Machine Learning

For this tutorial, we are using python 3. If you have not installed python yet, you should download it from here. After setting up the python environment. You need to install the scipy library of python for this project. The used packages of this library for this project are-

  • pandas 

  • numpy 

  • matplotlib 

  • scikit learn

You can download and install them from the official SciPy site. They provide excellent installation guidelines for every operating system. You should be able to set up your environments quite easily. If you run into any problems regarding the installation, go to stack overflow and ask/find the question for your problem. You surely get a solution. 

My personal recommendation will be to use Anaconda for machine learning. This is a really easy and powerful platform for doing any machine learning task. For this project, I also used the Anaconda platform and the Spyder IDE.

2. Load the Dataset

We need to import most of the basic libraries first-

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

We will use the pandas library for loading and reading the dataset. Our dataset is in CSV(comma-separated value) format. With the following code, you can easily read the dataset.

df = pd.read_csv("bank-marketing-dataset.csv")

This will load the dataset into your program.

3. Understand the Data through Visualization

As I told earlier, it is best practice to have a detailed look at your data before going to build the model. This is quite helpful to understand the problem better and build a proper model.

Let's check how our data is shaped-

print(df.shape) Output:(45211, 17)

This means we have 17 columns or attributes and a total of 45211 rows of values.

Quickly check the statistical summary of the numerical variables.

print(df.describe())

The output will tell the different statistics of our data. Let's check and understand-

30_statistical_summary

This is quite handy for us as we can see the different measures of our data.

This is a classification task. So let's check out the class distribution of the data.

print(df.groupby('y').size()) 
Output:
no 39922
yes 5289d
type: int64

Here you can see that the classes are not balanced in number which makes it an imbalanced classification problem. But for the sake of simplicity, this tutorial will proceed with general classification techniques.

Now come to the visualization part! Here we will generate some histogram plots to understand our numerical variables-

subset = df['age']
plt.figure()
plt.xlabel('Age')
plt.ylabel('Frequency')
subset.hist(range=[15,70], rwidth=0.95, color='blue', label='age')
plt.legend()

The output will look like the following-

30_histogram_plot

This histogram plot describes the variation of the age of the customers. It is clear that most of the customers are between the age of 30 to 40.

You should try to plot all the variables individually to understand and describe them better.

subset = df['balance'] 
plt.figure() 
plt.xlabel('Balance') 
plt.ylabel('Frequency') 
subset.hist(range=[15,70], rwidth=0.95, color='green', label='balance') 
plt.legend()

30_histogram2

The plot suggests that the balance is in the form of normal distribution.

You can plot histograms for other numerical variables to get a better understanding of them.

Let's look at some bar plots for the categorical variables.

subset = df['job'].value_counts() 
subset.plot.barh(color='magenta') 
plt.xlabel('Frequency') 
plt.ylabel('Job Type') 
plt.legend()

30_barplot1_(2)

From the above plot, we can see that most of the customers are holding blue-collar jobs and there is a little amount of students and housemaids too in the dataset.

4. Preprocess the Data

Now, we are going to do some actual operations on the dataset and make the actual model. Let's go!

First, divide the dataset into feature matrix and dependent variable vector-

X = df.iloc[:, 0:16].values
y = df.iloc[:, 16].values

Machine learning algorithms use mathematical functions so it can work with only numbers. But our dataset contains categorical values that can not be used directly in the models. To convert the categories into numbers, we will encode them. Our categorical variables are ordinal(i.e. jobs, education, etc.) which means they will have effects on our predictive models. If the values are nominal(i.e. name, city, etc.), we need to take dummy variables after encoding them. But here we just have to encode the categorical values.

from sklearn.preprocessing import LabelEncoder, OrdinalEncoder 
ordinalencoder_X = OrdinalEncoder(categories='auto') 
X = ordinalencoder_X.fit_transform(X) 

labelencoder_y = LabelEncoder() 
y = labelencoder_y.fit_transform(y)

Now, we need to separate the dataset. As you know we need training data to train the model and test data to evaluate the model's performance. So, split the dataset into training and test sets.

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.2, random_state = 1000)

There is another thing in our dataset. From the plots, we can see that the features are highly varying in magnitudes. As Machine Learning models deal with numbers, the magnitude of values would change the model's prediction very quickly. So we need to scale them to a certain range to provide our model's standardized inputs.

# Feature Scaling
from sklearn.preprocessing import Standard
Scalersc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

This will transform our data with a mean value of 0 and a standard deviation of 1.

5. Evaluate Some Machine Learning Algorithms

Machine Learning models can have different performance for different datasets. And we can not say earlier which model will make the best performance. For this reason, we need to apply a number of different algorithms and compare their performance before choosing the best algorithm to build our final model. Here, we are going to use 5 different algorithms for evaluation. They are-

Let's put all these algorithms together and see!

from sklearn.model_selection import cross_val_score 
from sklearn.model_selection import StratifiedKFold 
from sklearn.linear_model import LogisticRegression 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.naive_bayes import GaussianNB 
from sklearn.svm import SVC 

models, names = list(), list() 

models.append(LogisticRegression()) 
names.append('LR') 
models.append(KNeighborsClassifier()) 
names.append('KNN') 
models.append( DecisionTreeClassifier()) 
names.append('DT') 
models.append(GaussianNB()) 
names.append('NB') 
models.append(SVC(gamma='auto')) 
names.append('SVC') 

results = list() 

for model, name in zip(models, names) :
	result = cross_val_score(model, X_train, y_train, cv=10, scoring="accuracy")
	results.append(result)

Here we evaluated all the algorithms with 10 fold cross-validation. Cross-validation is a technique where the training data is sampled into certain folds namely k folds. k-1 folds are feed to the model for training and the remaining fold is used to validate the model. In our model, we used 10 folds as it is assumed as best practice.

Check this tutorial to learn deeply about Cross-Validation.

Now, it's time to compare the performance of all those machine learning algorithms we used.

for result, name in zip(results, names):
	print("Accuracy of {}: {}".format(name, result.mean()))
Output:
Accuracy of LR: 0.8924185523254249
Accuracy of KNN: 0.8937456189552286
Accuracy of DT: 0.8736727880999509
Accuracy of NB: 0.8354347770239355
Accuracy of SVC: 0.9003536565897031

The mean accuracy of all our models suggests that all the algorithms performed quite well on our dataset. The accuracy with more than 80% is a great indicator of good models. But among all of the classifiers, Support Vector Classifier stand out with the highest accuracy. Le'ts analyze the accuracies further through visualization.

import seaborn as sns
sns.set(style="whitegrid")
ax = sns.boxplot(x=names, y=results)


From the above plot, it is quite clear that SVC had less varied accuracy at each iteration with 10 fold cross-validation. So, we can choose SVC as the classifier build our final model.

6. Build and Test the Final Machine Learning Model

Now, we've come to the most important part of our tutorial. Here we finalize SVC to build our final predictive model. As it seems to provide the best accuracy while evaluated with the other four algorithms. So we will train an SVC classifier model to make predictions on bank marketing data.

model = SVC(gamma='auto')
model.fit(X_train, y_train)
y_predict = model.predict(X_test)

Congratulations! We build our final model. And now it's ready to make predictions.

7. Evaluate the Machine Learning Model

We should evaluate how the model performed with the dataset. There are many evaluation metrics that are used to evaluate the performance of machine learning models. It is always a good idea to use more than one metrics to evaluate your model. For our model, we used Accuracy, F1 score, and ROC score. 

Check this tutorial to understand more about classification model evaluation.

from sklearn.metrics import accuracy_score 
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
print('Accuracy :       ', accuracy_score(y_test, y_pred))
print('F-Measure :      ', f1_score(y_test, y_pred, average = 'weighted'))
print('ROC Score :      ', roc_auc_score(y_test, y_pred))
Output:
Accuracy :        0.8942828707287405
F-Measure :       0.8744662781745784
ROC Score :       0.6225741327735979

8. Save and Load Trained Machine Learning Model

We almost completed the project. Now, we need to save the model. This is because if we want to make any new predictions, we need to run the whole code again and again. This will take a lot of time to train. So, we should save the model so that we can run it without training the model again and again.

Python offers many convenient ways to save a trained model and load it without training it every time. In this tutorial, we are going to use pickle, a python module for object serialization and de-serialization. With the help of this module, we can save our model to the disk and load it anytime.

# save the model
import pickle
filename = 'bank-marketing-project.sav'
pickle.dump(model, open(filename, 'wb'))

Now, you can load and make predictions from this file any time without training from the beginning.

# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_test, y_test)
print("Model Accuracy: {}".format(result))
Output:
Model Accuracy: 0.8942828707287405


Final Thoughts

Congratulations! If you have gone through all these steps, you have made quite a good machine learning project in python. Now, you can dive deep into other machine learning concepts and try this knowledge to apply to them. Try finding some new datasets and use this code to analyze that dataset. This will amazingly boost your understanding of machine learning projects. You should tweak the above code to get a different perspective on your data. Don't feel afraid to make experiments.

Do you have some better ideas to make this tutorial more useful? Please let us know in the comment box. 

Happy Machine Learning!




© aionlinecourse.com All rights reserved.