Image

Insurance Pricing Forecast Using XGBoost Regressor

The project, Insurance Pricing Forecast Using XGBoost Regressor focuses on leveraging machine learning to accurately predict healthcare costs for insurance companies. This helps insurance companies forecast future expenses. The goal is to set accurate premiums. Insurance companies need accurate methods to forecast future expenses. This helps them set premiums profitably. Traditional methods often struggle with complex data interactions. Machine learning, especially XGBoost, offers a valuable solution. This project develops a machine learning model. It helps insurers establish rates based on features like age, BMI, and smoking status. The goal is to ensure profitability while providing fair coverage.

Project Overview

In this project, we build a machine learning model using XGBoost Regressor. This XGBoost Regressor predicts healthcare expenses. It considers factors like age, BMI, smoking status, and region. These factors help estimate healthcare costs accurately. We also build a linear regression model as a baseline for comparison. By the end of this project, insurance companies will have a reliable tool. This tool helps set premiums based on predicted expenses. It reduces reliance on manual calculations and improves profitability.


Prerequisites

Before starting this project, understand Python, statistics, and machine learning. Familiarity with libraries like NumPy, Pandas, Matplotlib, and Scikit-learn. These libraries help with data manipulation, visualization, and model building. Also, be familiar with XGBoost Regressor, linear regression, and regression analysis. This knowledge will help you understand the modeling process.


Approach

We focus on building an XGBoost Regressor to predict healthcare costs. The model uses several features for forecasting. Additionally, we will compare this XGBoost Regressor with a linear regression model. This comparison helps evaluate the model's effectiveness. We select the XGBoost Regressor for its ability to handle non-linear relationships. Furthermore, the XGBoost Regressor excels with complex datasets. It provides high predictive power and efficiency. Although other machine learning techniques could be used, the XGBoost Regressor stands out.


Workflow and Methodology

The overall workflow of this project includes:

  • Problem Definition: Predict healthcare expenses using various features like age and smoking status.

  • Data Collection: Gather data from healthcare records, including patient demographics and medical expenses.

  • Data Preparation: Clean, transform, and encode the data for modeling.

  • Modeling: Build a baseline linear regression model first. Then, use an XGBoost regressor to achieve better accuracy.

  • Evaluation: Use evaluation metrics to assess model performance. Check Root Mean Squared Error (RMSE). Also, calculate Mean Absolute Percentage Error (MAPE).

  • Conclusion: Analyze results and finalize the best model for predicting healthcare expenses.

The methodology involves:

  • Exploratory Data Analysis (EDA): Understanding feature distributions, correlations, and trends in the data.

  • Data Preprocessing: Handle missing values in the dataset. Encode categorical variables appropriately. Transform the target variable as needed. Ensure data suitability for modeling.

  • Feature Engineering: Creating or refining features that improve model performance.

  • Hyperparameter Tuning: Using Bayesian Optimization to fine-tune the XGBoost Regressor for optimal results.

  • Model Comparison: Compare the linear regression model with the XGBoost Regressor. Determine which model performs better. Assess their accuracy in predicting healthcare costs.


Data Collection

Data Preparation

We use a dataset with healthcare records to train the XGBoost Regressor. Specifically, features include age, BMI, smoking status, region, and costs. This data represents real-world medical expenses from diverse health profiles. Our goal is to identify features that impact costs. Therefore, we use the XGBoost Regressor. We then use this information to predict future expenses. We do this accurately with the XGBoost Regressor.

Data Preparation Workflow

  • Data Cleaning: We start by checking for missing values and outliers in the dataset. This ensures that the data is clean and consistent for modeling.

  • Feature Encoding: We one-hot encode categorical variables like 'sex' and 'region.' This process converts them into numerical values.

  • Target Transformation: Healthcare costs often have a skewed distribution. We apply a Yeo-Johnson transformation. This makes the target variable more normally distributed. As a result, model performance improves.

  • Data Splitting: The dataset is split into training and test sets, typically using a ratio of 80:20. This allows us to train the model on one portion of the data and evaluate it on the remaining portion.


Code Explanation

STEP 1:

You can mount your Google Drive in a Google Colab notebook with this block of code. This lets users easily view files saved in Google Drive within Colab. You can modify and analyze your data or even train models using the files.

from google.colab import drive
drive.mount('/content/drive')

Import required packages

We import essential libraries such as numpy, pandas, and matplotlib. We also include seaborn, plotly, and xgboost. These libraries help with data manipulation, visualization, and building machine learning models.

!pip install numpy
!pip install pandas
!pip install plotly
!pip install scikit-learn
!pip install scikit-optimize
!pip install statsmodels
!pip install category_encoders
!pip install xgboost
!pip install nbformat
!pip install matplotlib

Import libraries

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import plotly.express as px
import sys
from sklearn.model_selection import train_test_split
from category_encoders import OneHotEncoder
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import math
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.feature_selection import RFE

STEP 2:

Exploratory Data Analysis (EDA)

EDA stands for "Exploratory Data Analysis." It is a method used to examine data through visualizations. Specifically, EDA involves identifying trends and patterns using statistical and visual techniques.


People use it to figure out trends in data, find outliers, test assumptions, and so on. The main goal of Exploratory Data Analysis (EDA) is to allow individuals to explore and understand the data before developing any theories or hypotheses about it.


When creating a machine learning model, EDA is a crucial step. It helps us understand how variables are distributed and how different variables relate to each other. EDA also identifies which features are crucial for making predictions.


Firstly, let's read the information, which is in the folder called "input" and is named "insurance.csv".


Load Dataset

data = pd.read_csv('/content/drive/MyDrive/Aionlinecourse/data/insurance_dataset.csv')
data.head()
data.info()

We have three numeric features: Age, BMI, and Children. Additionally, we have three categorical features: Sex, Smoker, and Region.


NOTE: there are no null values in any of the columns, which means we won't need to impute values in the data preprocessing step. However, This is usually a step that you'll need to consider when building a machine learning model.


The target variable, which we want to predict, is the `charges` column. Now, let's split the dataset into features (X) and the target (y):

target = 'Charges'
X = data.drop(target, axis=1)
y = data[target]
X.shape, y.shape

Distributions

Let's examine the distribution of each feature by plotting a histogram for each. Additionally, here are key points to note about the distribution of each feature:

  • Age - Approximately uniformly distributed.

  • Sex - Approximately equal volume in each category.

  • Bmi - Approximately normally distributed.

  • Children - Right skewed (i.e. higher volume in lower range).

  • Smoker - Significantly more volume in the no category vs the yes category.

  • Region - Approximately equal volume in each category.


The distribution is right skewed (i.e. higher volume in the lower range).

fig = px.histogram(data, x=target, nbins=50, title="Distribution of Charges")
fig.show()

Univariate analysis (with respect to the target)

The next step is to use binary analysis on the target. In other words, we look at each trait and figure out how it fits with the goal.


How we do this changes based on whether the trait is a number or a set of words. We'll use a scatterplot for numerical features and a boxplot for categorical features.


Numeric features

Points to note regarding each feature:
  • Age - As Age increases, Charges also tend to increase (although there is a large variance in Charges for a given Age).

  • BMI - There is no clear relationship. However, a group of individuals with BMI > 30 tends to have charges above 30k.

  • This group may become more apparent when we carry out our bivariate analysis later.

  • Children - No clear relationship (although Charges seem to decrease as Children increase).

  • Since there are only 6 unique values, we will treat this feature as categorical. This approach is useful for univariate analysis.

numeric_features = X.select_dtypes(include=[np.number])
numeric_features.columns
# plot_heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(numeric_features.corr(), annot=True, fmt=".2f", cmap="coolwarm")
plt.show()

STEP 3:

Categorical features

categorical_features = X.select_dtypes(include=["object"]).columns
categorical_features

Things to keep in mind about each feature:

  • Sex - No significant differences in Charges between the categories.
  • Smoker - Charges for Smoker == 'yes' are generally much higher than when Smoker == 'no'.
  • Region - No significant differences in Charges between the categories.
  • Children - No significant differences in Charges appear between the categories. Children with 4 or more are skewed towards lower Charges. However, this is likely due to low volumes in those categories. Refer to the distributions section for more details.
# Create paired boxplots for each categorical feature against the target variable
for col in categorical_features:
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=col, y=target, data=data)
    plt.title(f"Distribution of {target} by {col}")
    plt.show()