Credit Card Default Prediction Using Machine Learning Techniques

This project aims to develop and assess machine learning models in predicting customer defaults, assisting businesses in evaluating the risk of customer loan or credit repayment default. The project undertakes data preparation steps, dealing with the unbalanced classes by applying techniques such as oversampling and downsampling using SMOTE, and feature engineering processes such as Box-Cox transformations to alleviate skewness. As a way of making the model more interpretable, we employ SHAP and LIME for explainability purposes to see how the models reach certain predictions. These steps are intended to enhance the functionality and clarity of the default prediction models.

Project Overview

The project begins with data preprocessing, which involves dealing with the missing values, converting the categorical variables to the numerical format using Label Encoding and applying Box-Cox transformation for skewed features where applicable. Next is feature engineering where additional features are generated to capture the relationship between features already in existence; for example, the delinquency columns are merged, and financial ratios are computed. There is also the implementation of a range of classification models such as Random Forest, XGBoost, Logistic Regression, LightGBM, and so on. Hyperparameter tuning is deployed along with class balancing techniques such as SMOTE and class weight adjustment to this end to mitigate the effects of an imbalanced dataset. The models are assessed based on key metrics including accuracy, precision, recall, F1 score and AUC-ROC among others to determine how effective they are. Additionally, to simplify model predictions, LIME and SHAP are used as improvements to address interpretability, where particular features are shown to contribute to the overall prediction. Finally, the project considers a case study demonstrating how to incorporate many algorithms and preprocessing techniques to make predictions on customer defaults for the benefit of financial institutions.

Prerequisites

Python Programming: Ability to use basic Python programming skills to implement algorithms and manipulate data.
Machine Learning Basics: Basic concepts of classification algorithms, evaluators, and how to prevent overfitting.
Pandas and NumPy: Skills in data processing and basic mathematical working with data.
Scikit-learn: Knowledge of widely used machine learning frameworks for training and evaluation and data preprocessing.
SMOTE and Downsampling/Undersampling Techniques: Skills in approaches to address imbalanced learning in order to enhance the performance of models.
SHAP and LIME: Knowledge of tools that can be used to understand and explain prediction and feature contribution.
LightGBM, XGBoost, Random Forest, Logistic Regression: Knowledge of some widespread classification algorithms employed in machine learning.

Approach

The procedure starts at Data Preprocessing, where data is cleaned and the dataset is prepared by performing actions such as dealing with missing values, Encoding Categorical Variables through Label Encoding, and correcting box-cox transformed Numerical Features to skewness to prepare the data for machine learning algorithms. Later, during feature engineering, it is aimed to create additional features utilizing the interactions of those already present that are of key importance, such as merging the columns about the contractions and computing such financial metrics as Debt Ratio, and Revolving Utilization.

In order to resolve the class disproportion in the set of data, we impute the minority class via SMOTE (Synthetic Minority Over-sampling Technique) and take down the majority class's size through down-sampling to achieve class balance. Upon data preparation, we opt for a few chosen machine learning model algorithms including Random Forest, XGBoost, Logistic Regression, and LightGBM and train these algorithms on the processed data.

Subsequently, the trained models are assessed against a number of performance measures such as accuracy and precision, recall, F1 score and AUC /ROC with the purpose of establishing how well the models predict the likelihood of customer defaulting: the performance of the best models is also improved through tuning. The best option is chosen with respect to the performance of the model.

In order to guarantee uniformity and intelligibility of the models, we use SHAP and LIME for model explainability which aids in understanding how different features affect predictions. Finally, the results are assessed and conclusions are made that provide practical recommendations to an organization in order to mitigate the risks of customer default.

Workflow and Methodology

Data Collection and Preprocessing:
- The project begins by gathering and preparing the dataset for analysis.
- Missing values are handled appropriately, categorical variables are encoded using Label Encoding, and skewed features are transformed using the Box-Cox technique.
Feature Engineering:
- New features are volunteered to reflect meaningful communications within the existing features by mixing up delinquency features and also giving up analysis by metrics such as the Debt Ratio and the Revolving Utilization.
- Further, class imbalance is attempted to be moderated through SMOTE (to facilitate over-sampling of the under-represented category) and regular down-sampling (for the over-represented category) approaches.
Model Selection and Training:
- Several classifiers such as Random Forest, XGBoost, Logistic Regression, and LightGBM are opted for the purpose.
- Every model is trained after preprocessing the data and adjusting hyperparameters for the best fit. Simultaneously class balancing methods such as class weights or SMOTE to the achieved results for better prediction.
Model Evaluation:
- Finally, each model fitted to the training data is put through performance evaluation using standard metrics, namely accuracy, precision, recall, F score and AUC. ROC for effectiveness.
- It's worth mentioning that cross-validation is also done to check how robust the models are.
Model Explainability:
- The prediction and importance of every feature in the model are then explained using SHAP and LIME. This also gives room for the model to explain how different features affect the predictions resulting in more clarity to the model.
Performance Evaluation:
- The models are empirically analyzed based on the evaluation measures, and the most accurate models are selected for implementation or enhancement.
Conclusion and Recommendations:
- The analysis gives a picture of which models and strategies seem to yield the best results for predicting customer defaults.

Data Collection and Preparation

Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.

Data Preparation Workflow:

Loading the Dataset: Downloading the dataset and examining its contents to gain insight into the types of data and descriptive statistics present.
Handling Missing Values: Assess the extent of missing data and the appropriate imputation or elimination of them, depending on the nature and distribution of data.
Handling Categorical Variables: Apply Label Encoding or One-Hot Encoding to represent categorical variables as numerical variables.
Feature Engineering: Modify existing variables to include additional variables that provide ratios deemed important.
Addressing Class Imbalance: Applying SMOTE, downscaling, or modifying class weights to remedy the problem of disproportionate classes.
Requirements Transformation: Adjust skewed distribution by normalizing or standardizing the data or using Box-Cox transformation.
Dataset Partitioning: Use either train_test_split or cross-validation to partition the available data into training datasets and test datasets.

Code Explanation

STEP 1:

Mounting Google Drive

First, mount Google Drive to access the dataset that is stored in the cloud.

from google.colab import drive
drive.mount('/content/drive')

Library Installation

This piece of code installs various libraries: SHAP for understanding the model, LIME for understanding each prediction, Keras for building deep learning models, XGBoost and LightGBM for gradient boosting techniques, and Imbalanced-learn for the problem of imbalanced datasets.

!pip install shap
!pip install lime
!pip install keras
!pip install xgboost
!pip install lightgbm
!pip install imblearn

Ignore Warning

The filterwarnings('ignore') function prevents any warnings from being shown during the execution of the program. This can be useful when you don't want the warnings to clutter the output, but keep in mind that ignoring warnings can sometimes hide important information about potential issues in your code.

# Ignore all warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore')

Importing necessary Libraries

This code imports the necessary libraries for the successful execution of the program, which are SHAP for explainability, math for calculations, Keras for deep learning net architecture, NumPy and Pandas for handling data, XGBoost for implementing almost all the gradient boosting algorithms, and Seaborn for effective plotting and presentation of data.

import shap
import math
import keras
import numpy as np
import pandas as pd
import xgboost as xgb
import seaborn as sns
import tensorflow as tf
import keras.backend as K
import matplotlib.pyplot as plt
from keras import models
from keras import layers
from sklearn.svm import SVC
from scipy.stats import skew
from matplotlib import pyplot
from collections import Counter
from scipy.stats import kurtosis
from scipy import stats, special
from xgboost import XGBClassifier
from sklearn.utils import resample
from lightgbm import LGBMClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.feature_selection import RFECV
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from lime.lime_tabular import LimeTabularExplainer
from sklearn.metrics import precision_recall_curve
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import KFold,StratifiedKFold
from keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.metrics import mean_squared_error, accuracy_score,confusion_matrix, roc_curve, auc,classification_report, recall_score, precision_score, f1_score,roc_auc_score,auc,roc_curve

STEP 2:

Loading Data and Checking Dimensions:

This code loads the CSV file. After loading the dataset it prints the dataset’s shape to check the number of rows and columns.

Aionlinecourse_data = pd.read_csv("/content/drive/MyDrive/New 90 Projects/Project_5/Data/cs-training.csv")
print(Aionlinecourse_data.shape)

The purpose of the given code is to provide a summary of the DataFrame Aionlinecourse_housing by displaying the number of records, names of the columns, types of columns, count of non-null values, and the size in memory.

Aionlinecourse_data.info()

Calculating missing values percentage

This script determines the extent of each column in the Aionlinecourse_data, dataset that is missing data, expressed as a percentage and rounded off to two decimal places, allowing ease of spotting features with missing data.

round(Aionlinecourse_data.isnull().sum(axis=0)/len(Aionlinecourse_data),2)*100

Previewing Data

This block of code displays the first few rows of the dataset to have a quick overview of the structure of the dataset.

Aionlinecourse_data.head()

Checking Unique Borrowers

This code computes the ratio of unique borrowers (denoted by the column 'Unnamed: 0') to all the records contained in the data set to understand the uniqueness of the data set.

# Checking the unique number of borrowers
Aionlinecourse_data['Unnamed: 0'].nunique()/len(Aionlinecourse_data)

Analyzing Target Variable

In this section, we delve into the analysis of the target variable, which is represented by the code ‘SeriousDlqin2yrs’ which refers to serious delinquency. The above code also gives the percentage of borrowers who go into serious delinquency. This gives more knowledge on the distribution of the target class.

# Target Variable
print(Aionlinecourse_data['SeriousDlqin2yrs'].unique())
print()
print('{}% of the borrowers falling in the serious delinquency '.format((Aionlinecourse_data['SeriousDlqin2yrs'].sum()/len(Aionlinecourse_data))*100))

STEP 3:

Visualizing Target Variable

This snippet of code generates two visualizations for the attempting variable ‘SeriousDlqin2yrs’. The first one is a pie chart that depicts the share of serious delinquency, and the second one is a count plot that illustrates the distribution of the dependent variable understanding the class target distribution in the sample.

fig, axes = plt.subplots(1, 2, figsize=(12, 6))
# Pie chart
Aionlinecourse_data['SeriousDlqin2yrs'].value_counts().plot.pie(
    explode=[0, 0.1],
    autopct='%1.1f%%',
    ax=axes[0],
    colors=['skyblue', 'lightcoral']
)
axes[0].set_title('SeriousDlqin2yrs')
# Count plot
sns.countplot(
    x='SeriousDlqin2yrs',
    data=Aionlinecourse_data,
    ax=axes[1],
    palette=['skyblue', 'lightcoral'],
    hue='SeriousDlqin2yrs'
)
axes[1].set_title('SeriousDlqin2yrs')
# axes[1].legend_.remove()  # Removing legend if it's not needed
plt.show()

Counting Target Variable Occurrences

This code counts the occurrences of each unique value in the 'SeriousDlqin2yrs' column. This shows how many borrowers fall into each category of serious delinquency

Aionlinecourse_data['SeriousDlqin2yrs'].value_counts()

Descriptive Statistics

This code displays a summary of the numerical variables contained in the Data Frame, including mean, standard deviation, minimum, maximum, and quartiles, etc.

Aionlinecourse_data.describe()

Dividing Train and Test dataset

In this code, the target variable “SeriousDlqin2yrs” is separated from the feature columns, and then the dataset is divided into training and testing for model training and evaluation purposes (80% training: 20% testing). Furthermore, the size of each of the datasets corresponding to the split is displayed.

data = Aionlinecourse_data.drop(columns = ['SeriousDlqin2yrs'], axis=1)
y = Aionlinecourse_data['SeriousDlqin2yrs']
data_train, data_test, y_train, y_test = train_test_split(data, y, test_size=0.2, random_state=42)
data_train.shape, data_test.shape, y_train.shape, y_test.shape

Calculating Event Rate

The purpose of this code is to determine the event rate (percentage of borrowers with serious delinquency for the training, testing, and overall dataset. This assists in exploring the prevalence of the target variable in various datasets.

print('Event rate in the training dataset : ',np.mean(y_train))
print()
print('Event rate in the test dataset : ',np.mean(y_test))
print()
print('Event rate in the entire dataset : ',np.mean(y))

Combining Features and Target for Training

In this snippet, we concat the target variable available in y_train to the feature set data_train to form the entire training dataset. This makes the dataset ready for model training. Lastly prints the shape of this combined dataset.

train = pd.concat([data_train, y_train], axis=1)
train.shape

Combining Features and Target for Test Dataset

In this snippet, we concat the target variable available in y_test to the feature set data_test to form the entire testing dataset. This makes the dataset ready for model evaluation. Lastly prints the shape of this combined dataset.

test = pd.concat([data_test, y_test], axis=1)
test.shape

Creating Histogram and Box plot for Distribution

This function generates two plots for a specified column, one of which is a histogram to represent the distribution whereas the other one is a boxplot which helps in finding the outliers. It also computes and displays the statistics of skewness and kurtosis of data which shed light on the form and the tail of the distribution provided.

def plot_hist_boxplot(column):
    fig,[ax1,ax2]=plt.subplots(1,2,figsize=(12,5))
    sns.distplot(train[train[column].notnull()][column],ax=ax1)
    sns.boxplot(y=train[train[column].notnull()][column],ax=ax2)
    print("skewness : ",skew(train[train[column].notnull()][column]))
    print("kurtosis : ",kurtosis(train[train[column].notnull()][column]))
    plt.show()

Plotting Count and Boxplot for Categorical Data

This function provides two illustrations based on a certain categorical column. A countplot is employed to represent how often each category occurs. A boxplot is used to look at how the data is spread and to highlight any outlying values. It also computes and displays skewness and kurtosis statistics that describe the asymmetry and the heaviness of the tails of the distribution respectively.

def plot_count_boxplot(column):
    fig,[ax1,ax2]=plt.subplots(1,2,figsize=(12,6))
    sns.countplot(train[train[column].notnull()][column],ax=ax1)
    sns.boxplot(y=train[train[column].notnull()][column],ax=ax2)
    print("skewness : ",skew(train[train[column].notnull()][column]))
    print("kurtosis : ",kurtosis(train[train[column].notnull()][column]))
    plt.show()

Creating Histograms and Boxplots for Several Columns

The code applies the plot_hist_boxplot function on a multitude of columns of the training dataset. A histogram and a boxplot per column are generated to observe how the data is distributed, how many outliers are present, and figure skewness and kurtosis to study more about the shape and the tails of the distribution respectively. Such distributional properties help to understand the features present in the data.

plot_hist_boxplot('RevolvingUtilizationOfUnsecuredLines')
plot_hist_boxplot('age')
plot_hist_boxplot('DebtRatio')
plot_hist_boxplot('MonthlyIncome')
plot_hist_boxplot('NumberOfOpenCreditLinesAndLoans')
plot_hist_boxplot('NumberRealEstateLoansOrLines')
plot_hist_boxplot('NumberOfDependents')
plot_hist_boxplot('NumberOfTimes90DaysLate')
plot_hist_boxplot('NumberOfTime30-59DaysPastDueNotWorse')

Calculating Skewness and Kurtosis for Multiple Columns

This code figures out the level of skewness and kurtosis for several columns and places the results in a Data Frame. It also orders the Data Frame in a decreasing order of its values column to help examine whether there are any features whose distributions are the most skewed. This helps in understanding how the features in the dataset look like and their extent of possessing tails.

cols_for_stats = ['RevolvingUtilizationOfUnsecuredLines', 'age',
       'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'MonthlyIncome',
       'NumberOfOpenCreditLinesAndLoans', 'NumberOfTimes90DaysLate',
       'NumberRealEstateLoansOrLines', 'NumberOfTime60-89DaysPastDueNotWorse',
       'NumberOfDependents']
skewness  = [] ; kurt = []
for column in cols_for_stats:
    skewness.append(skew(train[train[column].notnull()][column]))
    kurt.append(kurtosis(train[train[column].notnull()][column]))
stats = pd.DataFrame({'Skewness' : skewness, 'Kurtosis' : kurt},index=[col for col in cols_for_stats])
stats.sort_values(by=['Skewness'], ascending=False)

Analyzing Unique Values in Test Data

This code validates and displays the distinct entries in the '30-59 Days', '60-89 Days', and '90 Days' columns in both scenarios when the '30-59 Days' value is >= 90 and when it is \< 90. It aids in comprehending the structural behavior of these delinquency columns in the training dataset.

print("Unique values in '30-59 Days' values that are more than or equal to 90:",np.unique(train[train['NumberOfTime30-59DaysPastDueNotWorse']>=90]
                                                                                          ['NumberOfTime30-59DaysPastDueNotWorse']))
print("Unique values in '60-89 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(train[train['NumberOfTime30-59DaysPastDueNotWorse']>=90]
                                                                                                       ['NumberOfTime60-89DaysPastDueNotWorse']))
print("Unique values in '90 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(train[train['NumberOfTime30-59DaysPastDueNotWorse']>=90]
                                                                                                    ['NumberOfTimes90DaysLate']))
print("Unique values in '30-59 Days' values that are less than 90:",np.unique(train[train['NumberOfTime30-59DaysPastDueNotWorse']<90]
                                                                                          ['NumberOfTime30-59DaysPastDueNotWorse']))
print("Unique values in '60-89 Days' when '30-59 Days' values are less than 90:",np.unique(train[train['NumberOfTime30-59DaysPastDueNotWorse']<90]
                                                                                           ['NumberOfTime60-89DaysPastDueNotWorse']))
print("Unique values in '90 Days' when '30-59 Days' values are less than 90:",np.unique(train[train['NumberOfTime30-59DaysPastDueNotWorse']<90]
                                                                                        ['NumberOfTimes90DaysLate']))
print("Proportion of positive class with special 96/98 values:",
      round(train[train['NumberOfTime30-59DaysPastDueNotWorse']>=90]['SeriousDlqin2yrs'].sum()*100/
      len(train[train['NumberOfTime30-59DaysPastDueNotWorse']>=90]['SeriousDlqin2yrs']),2),'%')

Changing Delinquency Values and Validating Uniqueness of Values

The delinquency columns in the code will be modified: the values of '30 - 59 Days', '60 - 89 Days', and '90 Days' will be changed to predetermined values (12, 11, and 17 respectively) if the value exceeds the threshold 90. It lastly shows the unique values in these columns after the changes which assists in the uniformity of data for the analysis.

train.loc[train['NumberOfTime30-59DaysPastDueNotWorse'] >= 90, 'NumberOfTime30-59DaysPastDueNotWorse'] = 12
train.loc[train['NumberOfTime60-89DaysPastDueNotWorse'] >= 90, 'NumberOfTime60-89DaysPastDueNotWorse'] = 11
train.loc[train['NumberOfTimes90DaysLate'] >= 90, 'NumberOfTimes90DaysLate'] = 17
print("Unique values in 30-59Days", np.unique(train['NumberOfTime30-59DaysPastDueNotWorse']))
print("Unique values in 60-89Days", np.unique(train['NumberOfTime60-89DaysPastDueNotWorse']))
print("Unique values in 90Days", np.unique(train['NumberOfTimes90DaysLate']))

Analyzing Unique Values in Test Data

print("Unique values in '30-59 Days' values that are more than or equal to 90:",np.unique(test[test['NumberOfTime30-59DaysPastDueNotWorse']>=90]
                                                                                          ['NumberOfTime30-59DaysPastDueNotWorse']))
print("Unique values in '60-89 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(test[test['NumberOfTime30-59DaysPastDueNotWorse']>=90]
                                                                                                       ['NumberOfTime60-89DaysPastDueNotWorse']))
print("Unique values in '90 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(test[test['NumberOfTime30-59DaysPastDueNotWorse']>=90]
                                                                                                    ['NumberOfTimes90DaysLate']))
print("Unique values in '30-59 Days' values that are less than 90:",np.unique(test[test['NumberOfTime30-59DaysPastDueNotWorse']<90]
                                                                                          ['NumberOfTime30-59DaysPastDueNotWorse']))
print("Unique values in '60-89 Days' when '30-59 Days' values are less than 90:",np.unique(test[test['NumberOfTime30-59DaysPastDueNotWorse']<90]
                                                                                           ['NumberOfTime60-89DaysPastDueNotWorse']))
print("Unique values in '90 Days' when '30-59 Days' values are less than 90:",np.unique(test[test['NumberOfTime30-59DaysPastDueNotWorse']<90]
                                                                                        ['NumberOfTimes90DaysLate']))

Changing Delinquency Values and Validating Uniqueness of Values

The delinquency columns in the code will be modified: the values of '30 - 59 Days', '60 - 89 Days', and '90 Days' will be changed to predetermined values (13, 7, and 15 respectively) if the value exceeds the threshold 90. It lastly shows the unique values in these columns after the changes which assists in the uniformity of data for the analysis.

test.loc[test['NumberOfTime30-59DaysPastDueNotWorse'] >= 90, 'NumberOfTime30-59DaysPastDueNotWorse'] = 13
test.loc[test['NumberOfTime60-89DaysPastDueNotWorse'] >= 90, 'NumberOfTime60-89DaysPastDueNotWorse'] = 7
test.loc[test['NumberOfTimes90DaysLate'] >= 90, 'NumberOfTimes90DaysLate'] = 15
print("Unique values in 30-59Days", np.unique(test['NumberOfTime30-59DaysPastDueNotWorse']))
print("Unique values in 60-89Days", np.unique(test['NumberOfTime60-89DaysPastDueNotWorse']))
print("Unique values in 90Days", np.unique(test['NumberOfTimes90DaysLate']))

Summarizing Key Features

The code generates the summary statistics (mean, min, max, etc.) for 'DebtRatio' and 'RevolvingUtilizationOfUnsecuredLines' present in the training dataset. It aids in visualizing the distribution and extent of these significant variables for further evaluation.

print('Debt Ratio: \n',train['DebtRatio'].describe())
print('\nRevolving Utilization of Unsecured Lines: \n',train['RevolvingUtilizationOfUnsecuredLines'].describe())

Analyzing High Debt Ratio Values

This particular piece of code ensures that the training set consists of filtered rows where, ‘DebtRatio’ is above or equal to the 95th percentile, before giving summary statistics for ‘SeriousDlqin2yrs’ (dependent variable) and ‘MonthlyIncome’. This serves to analyze the borrowers with the highest levels of debt ratios especially when it comes to their income and the status of their delinquency.

train[train['DebtRatio'] >= train['DebtRatio'].quantile(0.95)][['SeriousDlqin2yrs','MonthlyIncome']].describe()

Counting Specific Conditions

The section of the code above counts the number of records having both conditions met (i.e. DebtRatio>95thPercentile and) where SeriousDlqin2yrs is equivalent to MonthlyIncome likely in error. It also determines if there are such records and this helps identify conflicts in the dataset i.e. issues or errors.

train[(train["DebtRatio"] > train["DebtRatio"].quantile(0.95)) & (train['SeriousDlqin2yrs'] == train['MonthlyIncome'])].shape[0]

Elimination of Certain Conditions from the Dataset

This particular code eliminates records from the training dataset in which the variable ‘DebtRatio’ surpasses the 95th percentile and ‘SeriousDlqin2yrs’ equals ‘MonthlyIncome’ to create a clean dataset (new_train). The code also provides the shape of the new dataset cut off from these records for clarity that the cleaned data does not contain the discrepancy in question.

new_train = train[-((train["DebtRatio"] > train["DebtRatio"].quantile(0.95)) & (train['SeriousDlqin2yrs'] == train['MonthlyIncome']))]
new_train.shape