- An Introduction to Machine Learning | The Complete Guide
- Data Preprocessing for Machine Learning | Apply All the Steps in Python
- Regression
- Learn Simple Linear Regression in the Hard Way(with Python Code)
- Multiple Linear Regression in Python (The Ultimate Guide)
- Polynomial Regression in Two Minutes (with Python Code)
- Support Vector Regression Made Easy(with Python Code)
- Decision Tree Regression Made Easy (with Python Code)
- Random Forest Regression in 4 Steps(with Python Code)
- 4 Best Metrics for Evaluating Regression Model Performance
- Classification
- A Beginners Guide to Logistic Regression(with Example Python Code)
- K-Nearest Neighbor in 4 Steps(Code with Python & R)
- Support Vector Machine(SVM) Made Easy with Python
- Kernel SVM for Dummies(with Python Code)
- Naive Bayes Classification Just in 3 Steps(with Python Code)
- Decision Tree Classification for Dummies(with Python Code)
- Random forest Classification
- Evaluating Classification Model performance
- A Simple Explanation of K-means Clustering in Python
- Hierarchical Clustering
- Association Rule Learning | Apriori
- Eclat Intuition
- Reinforcement Learning in Machine Learning
- Upper Confidence Bound (UCB) Algorithm: Solving the Multi-Armed Bandit Problem
- Thompson Sampling Intuition
- Artificial Neural Networks
- Natural Language Processing
- Deep Learning
- Principal Component Analysis
- Linear Discriminant Analysis (LDA)
- Kernel PCA
- Model Selection & Boosting
- K-fold Cross Validation in Python | Master this State of the Art Model Evaluation Technique
- XGBoost
- Convolution Neural Network
- Dimensionality Reduction
XGBoost | Machine Learning
XGBoost, short for eXtreme Gradient Boosting, is a powerful and popular machine learning algorithm. It's a type of ensemble learning method that combines the predictions from multiple "weak" models—in this case, decision trees—to create a single, highly accurate "strong" model. XGBoost is celebrated for its exceptional performance in predictive modeling tasks, fast execution speed, and scalability. This tutorial will walk you through a practical classification problem to predict customer churn, a task where XGBoost truly shines.
You can find the full Python code for this tutorial in a runnable format in this Google Colab notebook.
A Deeper Dive into XGBoost
At its core, XGBoost is an ensemble learning method that uses a technique called gradient boosting. Ensemble learning means that instead of relying on a single, complex model, it combines the output of multiple simpler models to make a more accurate prediction. The "boosting" part of the name refers to an iterative process where each new model is trained to correct the errors of the combined previous models. This process is like a team of experts, where each expert learns from the mistakes of the previous ones. XGBoost is considered "eXtreme" because it includes several key enhancements, such as regularization to prevent overfitting and parallel processing to speed up training, making it highly efficient and robust.
Before we can build and train our model, we first need to get our data ready. This involves importing the necessary libraries and loading our dataset. We will then separate our feature variables (the data we use for prediction) from our target variable (the outcome we want to predict). Our goal here is to predict if a customer will leave the bank, which is a classic classification problem.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values
Many machine learning models, especially tree-based ones, need numerical data. Our dataset, however, contains categorical data like 'Country' and 'Gender.' To handle this, we perform a crucial step called **categorical encoding**. We use a `ColumnTransformer` to apply `OneHotEncoder` to the 'Country' column, creating new binary columns for each country, and `OrdinalEncoder` to the 'Gender' column, converting it into a simple numerical representation.
Encoding categorical data
# Encoding categorical data from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder from sklearn.compose import ColumnTransformer # Country column ct = ColumnTransformer([("Country", OneHotEncoder(), [1]), ("Gender", OrdinalEncoder(), [2])], remainder = 'passthrough') X = ct.fit_transform(X)
Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
With our data preprocessed, we can now train the XGBoost classifier. We import the `XGBClassifier` class, create an instance of our model, and use the `fit` method to train it on our prepared training data. This is where the magic of the boosting algorithm happens, as it iteratively builds trees to correct the errors of previous predictions.
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)
Once the model is trained, we can use it to make predictions on our test set. To understand how well our model performed, we'll create a confusion matrix, which is a powerful tool for evaluating the performance of a classification model. The matrix helps us see the number of correct predictions (True Positives and True Negatives) and incorrect predictions (False Positives and False Negatives).
Predicting the Test set results
y_pred = classifier.predict(X_test)
Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
Let's break down the confusion matrix. The matrix is a table that reports the number of True Positives ($TP$), True Negatives ($TN$), False Positives ($FP$), and False Negatives ($FN$). For a churn prediction problem, this would mean:
- $TP$: The model correctly predicted a customer would churn.
- $TN$: The model correctly predicted a customer would not churn.
- $FP$: The model incorrectly predicted a customer would churn (a false alarm).
- $FN$: The model incorrectly predicted a customer would not churn (a missed opportunity).
The confusion matrix you would get from this code shows that the model made a total of 1729 correct predictions ($1521 + 208$) and 271 incorrect ones ($197 + 74$), resulting in a strong accuracy of approximately 86%. While a single accuracy score is useful, it can sometimes be misleading. For a more robust evaluation, we use k-Fold Cross-Validation.
Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
Cross-validation provides a more reliable estimate of a model's performance by splitting the data into multiple folds and training the model on different subsets. By taking the mean of the accuracies from each fold, we get a single, more dependable accuracy score. This helps ensure our model isn't just lucky on a single test set. The output of this step confirms that the model consistently achieves an average accuracy of around 86%.