XGBoost | Machine Learning

Written by- AionlinecourseMachine Learning Tutorials

XGBoost, short for eXtreme Gradient Boosting, is a powerful and popular machine learning algorithm. It's a type of ensemble learning method that combines the predictions from multiple "weak" models—in this case, decision trees—to create a single, highly accurate "strong" model. XGBoost is celebrated for its exceptional performance in predictive modeling tasks, fast execution speed, and scalability. This tutorial will walk you through a practical classification problem to predict customer churn, a task where XGBoost truly shines.

You can find the full Python code for this tutorial in a runnable format in this Google Colab notebook.

A Deeper Dive into XGBoost

At its core, XGBoost is an ensemble learning method that uses a technique called gradient boosting. Ensemble learning means that instead of relying on a single, complex model, it combines the output of multiple simpler models to make a more accurate prediction. The "boosting" part of the name refers to an iterative process where each new model is trained to correct the errors of the combined previous models. This process is like a team of experts, where each expert learns from the mistakes of the previous ones. XGBoost is considered "eXtreme" because it includes several key enhancements, such as regularization to prevent overfitting and parallel processing to speed up training, making it highly efficient and robust.

XGBoost in Python Step 1: Data Preprocessing

Before we can build and train our model, we first need to get our data ready. This involves importing the necessary libraries and loading our dataset. We will then separate our feature variables (the data we use for prediction) from our target variable (the outcome we want to predict). Our goal here is to predict if a customer will leave the bank, which is a classic classification problem.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values


Many machine learning models, especially tree-based ones, need numerical data. Our dataset, however, contains categorical data like 'Country' and 'Gender.' To handle this, we perform a crucial step called **categorical encoding**. We use a `ColumnTransformer` to apply `OneHotEncoder` to the 'Country' column, creating new binary columns for each country, and `OrdinalEncoder` to the 'Gender' column, converting it into a simple numerical representation.

Encoding categorical data
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
# Country column
ct = ColumnTransformer([("Country", OneHotEncoder(), [1]), ("Gender", OrdinalEncoder(), [2])], remainder = 'passthrough')
X = ct.fit_transform(X)

Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
XGBoost in Python Step 2: Model Training and Evaluation

With our data preprocessed, we can now train the XGBoost classifier. We import the `XGBClassifier` class, create an instance of our model, and use the `fit` method to train it on our prepared training data. This is where the magic of the boosting algorithm happens, as it iteratively builds trees to correct the errors of previous predictions.

from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)


Once the model is trained, we can use it to make predictions on our test set. To understand how well our model performed, we'll create a confusion matrix, which is a powerful tool for evaluating the performance of a classification model. The matrix helps us see the number of correct predictions (True Positives and True Negatives) and incorrect predictions (False Positives and False Negatives).

Predicting the Test set results
y_pred = classifier.predict(X_test)

Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)


Let's break down the confusion matrix. The matrix is a table that reports the number of True Positives ($TP$), True Negatives ($TN$), False Positives ($FP$), and False Negatives ($FN$). For a churn prediction problem, this would mean:

  • $TP$: The model correctly predicted a customer would churn.
  • $TN$: The model correctly predicted a customer would not churn.
  • $FP$: The model incorrectly predicted a customer would churn (a false alarm).
  • $FN$: The model incorrectly predicted a customer would not churn (a missed opportunity).

The confusion matrix you would get from this code shows that the model made a total of 1729 correct predictions ($1521 + 208$) and 271 incorrect ones ($197 + 74$), resulting in a strong accuracy of approximately 86%. While a single accuracy score is useful, it can sometimes be misleading. For a more robust evaluation, we use k-Fold Cross-Validation.

Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()


Cross-validation provides a more reliable estimate of a model's performance by splitting the data into multiple folds and training the model on different subsets. By taking the mean of the accuracies from each fold, we get a single, more dependable accuracy score. This helps ensure our model isn't just lucky on a single test set. The output of this step confirms that the model consistently achieves an average accuracy of around 86%.