Build a Customer Churn Prediction Model using Decision Trees

Welcome to the customer churn prediction project! Churn prediction helps businesses keep their customers happy and engaged. By using machine learning, we can predict if a customer will leave. In this project, we'll dive into the power of decision trees to build a simple yet effective churn prediction model.

Project Overview

Our objective of this project is to predict customer churn using machine learning techniques. First, we will perform exploratory data analysis of the dataset containing many customer characteristics and prepare it for analysis. The primary algorithm employed is the Decision Tree Classifier, a very proficient algorithm used for classification tasks. To evaluate which of the two approaches is optimal, we employ Logistic Regression as a comparison model

We use SMOTE (Synthetic Minority Over-sampling Technique) also to address the class imbalance problem, which consists of producing samples of the minority classes. The data is then further prepared for the modeling stage by splitting it into training and testing data sets. We assess the model per important standards such as ROC-AUC, confusion matrix, accuracy, precision, recall, and F1-score to make sure that the model is useful and operational in predicting customer churn.

Prerequisites

Learners must develop some skills before undertaking this project Here’s what you should ideally know:

Understanding of basic knowledge of Python for data analysis and manipulation
Knowledge of libraries such as Pandas, Numpy, and Matplotlib for data manipulation and data visualization respectively.
Understanding of data preprocessing steps such as how to deal with missing values, normalization, and scaling.
Familiarity with exploratory data analysis (EDA) to find out patterns and growing trends in sets.
Elementary concepts about Decision Tree algorithm to learn how predictive modeling works
Machine learning frameworks such as Scikit-Learn for building, training, and assessing models

Approach

The initial phase of this customer churn prediction project involves loading and analyzing the dataset to familiarize oneself with its data and figure out any present inconsistencies or missing values. Once the dataset is cleaned by addressing the missing values and encoding the categorical features, we focus on the balancing of the class using SMOTE. The SMOTE technique helps generate synthetic samples of the underrepresented class during classification. After the preprocessing of the data has been done, we proceed to split the data into training and testing datasets to ensure the correctness of the performance evaluation of the model. The Decision Tree Classifier is then fed with the training set to learn the patterns for predicting customer churn while the Logistic Regression model is used for comparative purposes. We assess how well the model resolves the problem using ROC-AUC scores, accuracy, precision, recall and F1 scores. After the results are available, we proceed to optimize the model and modify hyperparameters so that the results can be better.

Workflow and Methodology

Workflow

Data Collection: Collect the dataset from the public dataset repository, and load it into a DataFrame in Pandas for further analysis.
Data Cleaning: You need to deal with missing values, convert the categorical data and check that the right data type is used and that all the data is ready for modeling.
Handling Imbalanced Data: Use SMOTE to generate synthetic samples for the minority class, balancing the dataset.
Train-Test Split: Split the dataset into training and testing sets to evaluate model performance.
Model Building: Train a Logistic regression and Decision Tree model using the prepared data.
Model Evaluation: Evaluate the models using metrics ROC-AUC, confusion matrix, accuracy, precision, recall, and F1-score.

Methodology

The procedure is sequential and commences with an exploration and cleaning of the data. After cleaning the data, there is the application of SMOTE to address the balance of the data set prepared. Afterwards, a Decision Tree Classifier is fitted and its performance is compared with the one achieved using Logistic Regression. Assessment criteria like ROC-AUC, F1, etc. help determine how effective the model is in practice. In the end, the model with the highest accuracy is employed to predict the results for new data.

Data Collection and Preparation

Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.

Data Preparation Workflow:

Load the Data: The first step will be to load the dataset into a Pandas DataFrame which can be utilized for analysis.
Exploratory Data Analysis (EDA): The initial analysis focuses on the data’s structure and its distribution.
Handle Missing Values: Handle the null values by either filling in or erasing the missing values to achieve an intact dataset.

Code Explanation

STEP 1:

Mounting Google Drive

First, mount Google Drive to access the dataset that is stored in the cloud.

from google.colab import drive
drive.mount('/content/drive')

Installing Necessary Libraries

This piece of code installs very important libraries used in machine and data analysis, notably imbalanced_learn which is used for manipulation of imbalanced data sets, numpy, matplotlib, pandas, and scikit_learn which are used for data processes, graphics, and especially modeling respectively.

!pip install imbalanced_learn
!pip install numpy
!pip install matplotlib
!pip install pandas
!pip install imblearn
!pip install scikit_learn

Importing Required Libraries

This snippet of code imports the libraries that will be required for data analysis, creation, and evaluation of the models. For instance, it involves libraries such as NumPy and Pandas for handling data, Seaborn and Matplotlib for rendering graphs, Scikit-learn and Imbalanced-learn for carrying out machine learning activities including but not limited to regression and classification, feature selection, and working with imbalanced then, and lastly, it includes many of the metrics involved in model evaluation as accuracy_score, roc_auc_score and confusion_matrix, for example.

import warnings
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import tree
warnings.filterwarnings("ignore")
from matplotlib import pyplot as plt
from sklearn.metrics import roc_curve
from imblearn.over_sampling import SMOTE
from sklearn.feature_selection import RFE
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score,classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

STEP 2:

Loading Data and Checking Dimensions:

This code loads the CSV file. After loading the dataset it prints the dataset’s shape to check the number of rows and columns. The %time magic command in the notebook records the time taken to perform the task.

Aionlinecourse = pd.read_csv("/content/drive/MyDrive/New 90 Projects/Project_4/data_regression.csv")
%time
print(Aionlinecourse.shape)

Previewing Data

This block of code displays the first few rows of the dataset to have a quick overview of the structure of the dataset.

Aionlinecourse.head()

The purpose of the given code is to provide a summary of the DataFrame Aionlinecourse_housing by displaying the number of records, names of the columns, types of columns, count of non-null values, and the size in memory.

Aionlinecourse.info()

STEP 3:

Data Visualization of Churn Analysis

This code creates a 2x2 grid of visualizations: a distribution of the customer ages, a feature correlation matrix, a count plot for churn by customer gender, and a box plot of the weekly minutes watched for customers who churned and did not churn. It is rearranged to fit the page and for more organization

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Distribution of Age
sns.histplot(Aionlinecourse['age'], bins=20, kde=True, ax=axes[0, 0])
axes[0, 0].set_title('Distribution of Customer Age')
axes[0, 0].set_xlabel('Age')
axes[0, 0].set_ylabel('Frequency')
# Correlation Matrix Heatmap
# Select only numerical features for correlation calculation
numerical_df = Aionlinecourse.select_dtypes(include=np.number)
correlation_matrix = numerical_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", ax=axes[0, 1])
axes[0, 1].set_title('Correlation Matrix of Numerical Features')
# Gender vs Churn
sns.countplot(x='gender', hue='churn', data=Aionlinecourse, ax=axes[1, 0])
axes[1, 0].set_title('Churn Rate by Gender')
axes[1, 0].set_xlabel('Gender')
axes[1, 0].set_ylabel('Count')
# Weekly Minutes Watched vs. Churn
sns.boxplot(x='churn', y='weekly_mins_watched', data=Aionlinecourse, ax=axes[1, 1])
axes[1, 1].set_title('Weekly Minutes Watched vs. Churn')
axes[1, 1].set_xlabel('Churn')
axes[1, 1].set_ylabel('Weekly Minutes Watched')
# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()

Utilizing SMOTE in the Handling of Class Imbalance Datasets

This function prepares the data by incorporating the Synthetic Minority over-sampling technique SMOTE in case of class imbalance. It identifies the numeric features, excludes certain columns, segregates the data into train and test sets, and subsequently employs the SMOTE technique to up-sample the minority class of the train set to achieve equal class distribution.

#Synthetic Minority Oversampling Technique. Generates new instances from existing minority cases that you supply as input.
def prepare_model_smote(df,class_col,cols_to_exclude):
cols=df.select_dtypes(include=np.number).columns.tolist()
X=df[cols]
X = X[X.columns.difference([class_col])]
X = X[X.columns.difference(cols_to_exclude)]
y=df[class_col]
global X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
sm = SMOTE(random_state=0, sampling_strategy=1.0)
X_train, y_train = sm.fit_resample(X_train, y_train)

Clearing Data by Eliminating Missing Values

This code drops the rows that contain null (missing) values to avoid the use of unclear data during the analysis and modeling.

df = Aionlinecourse.dropna() # cleaning up null values

Checking Data Shape

This code will just print the dimensions(Number of rows and columns) of the cleaned df.

print(df.shape)