Project Overview:
This project will guide the learner through creating a loan-based eligibility forecast model. This model utilizes income, credit score, loan amount, and applicant background as its data points. Machine learning is applied so that the model finds previous data patterns to correctly forecast future eligibility. This will be done through data processing with the help of recognized libraries such as Pandas or initiating the models with the help of Scikit-Learn. Controlling for each stage throughout this work, common steps will include data cleaning and feature selection, as well as model training and assessment. This approach restores order, reliability, and efficiency in the lending/borrowing process to benefit the lenders and borrowers in the market.
This guide is your one-stop source of Loan Eligibility Prediction explained simply and in a manner you can easily follow.
Prerequisites
Before commencing with loan eligibility application prediction, you must possess or acquire the following knowledge and tools:
- Knowledge of the basic Python programming language, including its data formats
- The concepts of model training, and assessment of its performance using different metrics.
- Familiarity with cleaning, filtering, and reshaping data with Pandas.
- Statistical knowledge such as averages, variance, and related statistics.
- Create either a Jupyter Notebook Setup or Google Colab where coding and visualization will be done.
- Ensure all the packages required such as Panda, Numpy, and Scikit-learn are installed.
- An understanding of Matplotlib or Seaborn to present the analysis of data in noticeable trends.
- A basic understanding of how predictions are made with the models and the data.
Approach
When building a loan eligibility prediction system, following a continuous and detailed step-wise process helps in saving biases and inaccuracies. First, data is collected and examined – looking for patterns and investigating for missing values and outliers. Data is then cleaned by addressing the null values and encoding categorical data through numerical representation for simplicity. Data cleaning is done then we carry out feature selection by considering only significant predictors such as income, credit score, and loan amount that determine eligibility.
Then we create the training and testing datasets and assess the performance of the model created. In this process, we train the model on a training set using machine learning algorithms. After the training phase is complete, the model is evaluated for accuracy. This enhances the performance of the model as well as its adaptability to new data. Finally, the implementation of the model facilitates fast decision-making for lenders, thus offering the institution and the applicant an easy interface.
Workflow and Methodology
Here's a step-by-step workflow you'll follow to build a successful loan eligibility prediction model:
- Data Collection and Loading: Begin by collecting the Loan eligibility data set from Kaggle and loading and importing them to a Pandas data frame
- Data Cleaning: To improve data quality, you need to detect and deal with missing values, ensure the correct type of data, and the handling of outliers.
- Exploratory Data Analysis (EDA): By applying EDA, you can understand the distribution of data and its prominent features.
- Data Preprocessing: You have to scale the numerical data and convert the categorical data into numeric data for better model training.
- Model Selection: Use classification models as this is a classification task.
- Model Training: Train all models with cleaned data and prepared data.
- Model Evaluation: Compare the model using other parameters such as the precision, recall, F1-score, ROC, and AUC curve metrics.
- Hyperparameter Tuning: Optimize model parameters to improve prediction accuracy.
Data Collection and Preparation
Data collection
The Loan Eligibility dataset is available in Kaggale. It is possible to conveniently and securely access a Kaggle dataset from within Google Colab after configuring your Kaggle credentials to prevent compromising sensitive information. It brings in the user’s data to collect securely the Kaggle API key and username and assigns them as environment variables. This enables the use of Kaggle’s CLI command which authenticates the user and downloads the dataset straight into Colab.
Data Preparation
Data pre-processing refers to cleaning and formatting of raw data into an analysis preparing the data for analysis and model development. This pre-processing stage prepares the dataset meaning it deals with missing values, categorical features, and scaling numerical features to make the dataset ready for modeling.
Data preparation workflow
- Data Cleaning: Handling missing values with median or mode. Then convert data types into correct formats.
- Outlier Management: Detect outliers for better model performance using statistical methods like IQR.
- Feature Engineering: Transform categorical variables with label encoding or one-hot encoding. Create additional features if they can improve model performance.
- Scaling and Normalization: Use StandardScaler to normalize numeric columns.
- Data Splitting: Split data into training and testing sets to prepare for model training.
Code Explanation
STEP 1:
Mounting Google Drive
First, mount Google Drive to access the dataset that is stored in the cloud.
from google.colab import drive
drive.mount('/content/drive')
The first line in this code block deals with the issue of warning messages by using warnings.filterwarnings("ignore") to make the output as clean as possible. This code imports several data manipulation (pandas, numpy), data visualization (matplotlib, seaborn), and machine learning (sklearn, xgboost, imblearn) libraries and modules. It also tries to resolve compatibility issues with six and sys.modules. The command %matplotlib inline has the utility of plotting the images within the Colab notebook rather than in a separate window.
import warnings
warnings.filterwarnings("ignore")
import six
import sys
sys.modules['sklearn.externals.six'] = six
import os
import joblib
import operator
import statistics
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import metrics
from matplotlib import pyplot
import sklearn.neighbors._base
from scipy.stats import boxcox
import matplotlib.pyplot as plt
from sklearn.svm import LinearSVC
from xgboost import XGBClassifier
from sklearn import preprocessing
from xgboost import plot_importance
from sklearn.metrics import roc_curve
from sklearn.utils import _safe_indexing
from imblearn.over_sampling import SMOTE
from sklearn.metrics import roc_auc_score
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelBinarizer,StandardScaler,OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split,GridSearchCV,cross_val_score
from sklearn.linear_model import LogisticRegression,RidgeClassifier, PassiveAggressiveClassifier
%matplotlib inline
This ensures the smooth execution of code that relies on the older sklearn structure without modifying the source code.
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
sys.modules['sklearn.utils.safe_indexing'] = sklearn.utils._safe_indexing
STEP 2:
Load the dataset
This line of code loads data from Google Drive into the pandas DataFrame.
#Importing the datasets
data =pd.read_csv("/content/drive/MyDrive/Aionlinecourse_badhon/Project/Loan Eligibility Prediction using Gradient Boosting Classifier/LoansTrainingSetV2.csv")
This code shows the dataset structure and its first 10 rows. This helps you to understand the dataset overview and its structure.
data_info = data.info()
data_head = data.head()
data_info, data_head
This code shows the statistical descriptions like count, mean, median, and percentile. This helps you to understand the dataset overview and its structure.
The code provides a summary of the dataset by calculating some statistics for numerical columns and providing counts for the categorical ones. This helps in understanding the distribution and compositions of the numerical and qualitative data attributes.
# Numerical columns
numerical_summary = data.describe()
# Categorical columns
categorical_columns = data.select_dtypes(include=['object']).columns
categorical_summary = {col: data[col].value_counts() for col in categorical_columns}
numerical_summary, categorical_summary
This line removes the columns Loan ID and Customer ID from the dataset. Because these two columns do not hold that importance for the analysis or model training.
# Drop unnecessary columns (e.g., IDs if not useful for analysis)
data.drop(columns=['Loan ID', 'Customer ID'], inplace=True)
This line of code checks if there are any null values present in each feature.
# Data Cleaning
data.isnull().sum()
The code block fills missing values in a dataset with representative values, using median values for numerical columns and the most frequent value for categorical data, ensuring data integrity.
# Handling missing values
data['Credit Score'].fillna(data['Credit Score'].median(), inplace=True)
data['Annual Income'].fillna(data['Annual Income'].median(), inplace=True)
data['Bankruptcies'].fillna(data['Bankruptcies'].median(), inplace=True)
data['Months since last delinquent'].fillna(data['Months since last delinquent'].median(), inplace=True)
data['Years in current job'].fillna(data['Years in current job'].mode()[0], inplace=True)
data['Tax Liens'].fillna(data['Tax Liens'].median(), inplace=True)
The code ensures data consistency by converting the columns for Maximum Open Credit and Monthly Debt to numeric data types, making it easier to use them as numerical features in computations and analysis.
data['Monthly Debt'] = pd.to_numeric(data['Monthly Debt'], errors='coerce')
data['Maximum Open Credit'] = pd.to_numeric(data['Maximum Open Credit'], errors='coerce')
STEP 3:
Univariate Column Analysis
Current Loan Amount
This line of code shows the statistical overview of the Current Loan Amount column.
data['Current Loan Amount'].describe()
The code creates a histogram with a KDE overlay to show the Current Loan Amount feature. It helps in spotting patterns, spreads, and shapes, and it points out outliers for more analysis or to enhance model performance.
# Distribution of Current Loan Amount
plt.figure(figsize=(10, 6))
sns.histplot(data['Current Loan Amount'], bins=30, kde=True)
plt.title('Distribution of Current Loan Amount')
plt.xlabel('Current Loan Amount')
plt.ylabel('Frequency')
plt.show()
The code generates a box plot of the Current Loan Amount to determine the outliers and illustrates the distribution and the median. This helps in finding out extreme values that could be causing some impact that should be reduced before training the models.
# Box plot to check for outliers
plt.figure(figsize=(10, 6))
sns.boxplot(x=data['Current Loan Amount'])
plt.title('Box Plot of Current Loan Amount')
plt.show()
The code shows the use of the Interquartile Range(IQR) for outlier detection in the Current Loan Amount feature. It calculates the 25th and the 75th percentiles of the data (Q1 and Q3) and creates a band around the middle 50% of values. Using the IQR rule, it defines limits beyond which values are considered outliers. Any values that fall outside of these limits are considered extreme values. This technique is an effective method for identifying and addressing such values.
Q1 = data['Current Loan Amount'].quantile(0.25)
Q3 = data['Current Loan Amount'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
It helps to quickly locate any extreme values within the dataset which are outside of the upper and lower limits. Then, the tail() function used here takes a few rows of these outliers to give a picture of extreme values for further analysis or possible deletion.
data["Current Loan Amount"][((data["Current Loan Amount"] < (Q1 - 1.5 * IQR)) |(data["Current Loan Amount"] > (Q3 + 1.5 * IQR)))].tail()
This code addresses an extreme outlier. Replace the extreme value 99999999 with NaN. After marking it as NaN value then the missing values are filled with the median of the Current Loan Amount.
# Replace with NaN for imputation
data["Current Loan Amount"].replace(99999999, np.nan, inplace=True)
# Impute with median
data["Current Loan Amount"].fillna(data["Current Loan Amount"].median(), inplace=True)
The code creates a histogram with a KDE overlay to show the Current Loan Amount feature. It helps in spotting patterns, spreads, and shapes, and it points out outliers for more analysis or to enhance model performance after the preprocessing.
# Distribution of Current Loan Amount
plt.figure(figsize=(10, 6))
sns.histplot(data['Current Loan Amount'], bins=30, kde=True)
plt.title('Distribution of Current Loan Amount')
plt.xlabel('Current Loan Amount')
plt.ylabel('Frequency')
plt.show()
Credit Score
This line of code shows the statistical overview of the Credit Score column.
data['Credit Score'].describe()
The code creates a histogram with a KDE overlay to show the Credit Score feature. It helps in spotting patterns, spreads, and shapes, and it points out outliers for more analysis or to enhance model performance.
# Distribution of Current Loan Amount
plt.figure(figsize=(10, 6))
sns.histplot(data['Credit Score'], bins=30, kde=True)
plt.xlabel('Credit Score')
plt.ylabel('Frequency')
plt.show()
The code creates a distribution plot with a KDE overlay to show the Credit Score feature. It allows for an assessment of the data's spread, central tendency, and overall shape. This plot helps identify patterns, skewness, and any potential outliers in the Credit Score data.
sns.distplot(data["Credit Score"])
This code caps extreme values in the Credit Score column by setting limits at the 5th and 95th percentiles. Values above or below these thresholds are adjusted to the nearest bound, minimizing the impact of outliers. It is then checked using a box plot that extreme values have been minimized.
# Define the upper and lower bounds for capping
upper_bound = data['Credit Score'].quantile(0.95)
lower_bound = data['Credit Score'].quantile(0.05)
# Cap the values in the 'Credit Score' column
data['Credit Score'] = data['Credit Score'].apply(lambda x: min(x, upper_bound))
data['Credit Score'] = data['Credit Score'].apply(lambda x: max(x, lower_bound))
# Verify the capping by checking the distribution again
plt.figure(figsize=(8, 6))
sns.boxplot(x=data['Credit Score'])
plt.title('Box Plot of Credit Score (After Capping)')
plt.show()
Year In Current Job
This line displays all unique “Years in current job” column values. This helps analyze the dataset's job tenure values and discover inconsistencies or categories that may need further processing or encoding.
data['Years in current job'].unique()
This code converts the “Years in current job” column from text to numbers, assigning integers to represent employment durations. This change helps the model measure job tenure through numerical analysis.
data['Years in current job'] = data['Years in current job'].replace({'\< 1 year': 0, '1 year': 1, '2 years': 2,
'3 years': 3, '4 years': 4, '5 years': 5,
'6 years': 6, '7 years': 7, '8 years': 8,
'9 years': 9, '10+ years': 10}) \
.astype(int)
The code creates a histogram with a KDE overlay to show the “Years in current job” feature. It helps in spotting patterns, spreads, and shapes, and it points out outliers for more analysis or to enhance model performance.
plt.figure(figsize=(10, 6))
sns.histplot(data['Years in current job'], bins=11, kde=False, color='skyblue')
plt.title('Distribution of Years in Current Job')
plt.xlabel('Years in Current Job')
plt.ylabel('Frequency')
plt.show()
Annual Income
This line of code shows the statistical overview of the Annual Income column.
data['Annual Income'].describe()
This code block provides a detailed visualization of the Annual Income feature to analyze its distribution and detect outliers.
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.distplot(data['Annual Income'], color='skyblue')
plt.title('Distribution of Annual Income')
plt.xlabel('Annual Income')
plt.ylabel('Frequency')
plt.grid(True)
plt.subplot(1, 2, 2)
sns.boxplot(y=data['Annual Income'], color='lightcoral')
plt.title('Box Plot of Annual Income')
plt.ylabel('Annual Income')
plt.grid(True)
plt.tight_layout()
plt.show()
It finds the quantiles for the Annual Income column, showing what income is spread out, how it clusters, and what extreme values exist. This helps with spotting skewness and making decisions about what to do with outliers, or even normalizing the data.
data['Annual Income'].quantile([.2,0.5,0.75,0.90,.95,0.99,.999])