Image

Insurance Pricing Forecast Using XGBoost Regressor

Our project's goal is to create a machine learning model that can predict how much health care will cost so that insurance rates can be set correctly. Two models will be made and compared: an XGBoost Regressor and a Linear Regression model. The study's overarching goal is to assist insurance companies in setting profitable rates by estimating healthcare expenses using age, BMI, and smoking status. Data preparation, feature engineering, model training, assessment, and success comparison are all steps in the process. The main focus is on telling all parties, technical and not, about the results.

Explanation All Code

Step 1:

You can mount your Google Drive in a Google Colab notebook with this piece of code. This creates it simple to see files saved in Google Drive in the Colab setting so that data can be changed, analyzed, and models can be trained.

from google.colab import drive
drive.mount('/content/drive')

Import required packages

!pip install numpy
!pip install pandas
!pip install plotly
!pip install scikit-learn
!pip install scikit-optimize
!pip install statsmodels
!pip install category_encoders
!pip install xgboost
!pip install nbformat
!pip install matplotlib

Step 2:

Import libraries

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import plotly.express as px
import sys
from sklearn.model_selection import train_test_split
from category_encoders import OneHotEncoder
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import math
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.feature_selection import RFE

Exploratory Data Analysis (EDA)


EDA stands for "Exploratory Data Analysis". It is a way to look at data with pictures. It includes looking for certain trends in data using statistics and visual methods. People use it to figure out trends in data, find outliers, test assumptions, and so on. The main goal of EDA is to help people look into the data before they come up with a theory about it. When you are making a machine learning model, EDA is a very important step. It helps us figure out how the variables are spread out, how they are connected, and which traits will be good at making predictions. First, let's read the information, which is in the folder called "input" and is named "insurance.csv".

data = pd.read_csv('/content/drive/MyDrive/Aionlinecourse/data/insurance_dataset.csv')
data.head()
data.info()

So we have three numeric features (AgeBmi and Children) and three categorical features (SexSmoker and Region).

NOTE: there are no null values in any of the columns, which means we won't need to impute values in the data preprocessing step. However, This is usually a step that you'll need to consider when building a machine learning model.

The target (i.e. the variable that we want to predict) is the charges column, so let's split the dataset into features (X) and the target (y):

target = 'Charges'
X = data.drop(target, axis=1)
y = data[target]
X.shape, y.shape

Distributions

Let's now look at the distribution of each feature by plotting a histogram for each Points to note regarding the distribution of each feature:


  • Age - Approximately uniformly distributed.
  • Sex - Approximately equal volume in each category.
  • Bmi - Approximately normally distributed.
  • Children - Right skewed (i.e. higher volume in lower range).
  • Smoker - Significantly more volume in the no category vs the yes category.
  • Region - Approximately equal volume in each category.
  • The distribution is right skewed (i.e. higher volume in the lower range).
fig = px.histogram(data, x=target, nbins=50, title="Distribution of Charges")
fig.show()

Univariate analysis (with respect to the target)

The next step is to use binary analysis on the target. In other words, we look at each trait and figure out how it fits with the goal.

How we do this changes based on whether the trait is a number or a set of words. A scatterplot will be used for numerical features and a boxplot will be used for category features.


Numeric features

Points to note regarding each feature:

  • Age - As Age increases, Charges also tends to increase (although there is a large variance in Charges for a given Age).
  • bmi - There is no clear relationship, although there seems to be a group of individuals with Bmi > 30 that have Charges > 30k.
  • This group may become more apparent when we carry out our bivariate analysis later.
  • Children - No clear relationship (although Charges seems to decrease as Children increases).
  • Since there are only 6 unique values for this feature, let's try treating it as a categorical feature for the purposes of univariate analysis.
numeric_features = X.select_dtypes(include=[np.number])
numeric_features.columns
# plot_heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(numeric_features.corr(), annot=True, fmt=".2f", cmap="coolwarm")
plt.show()

Step 3:

Categorical features

categorical_features = X.select_dtypes(include=["object"]).columns
categorical_features

Things to keep in mind about each feature:


  • Sex - No significant differences in Charges between the categories.
  • Smoker - Charges for Smoker == 'yes' are generally much higher than when Smoker == 'no'.
  • Region - No significant differences in Charges between the categories.
  • Children - No significant differences in Charges between the categories (Children >= 4 are skewed towards lower Charges, but this is likely due to the low volumes in those categories - see the Distributions section).
# Create paired boxplots for each categorical feature against the target variable
for col in categorical_features:
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=col, y=target, data=data)
    plt.title(f"Distribution of {target} by {col}")
    plt.show()
# Create paired scatterplots for each pair of numeric features
for i, col1 in enumerate(numeric_features.columns):
    for j, col2 in enumerate(numeric_features.columns):
        if i < j:
            plt.figure(figsize=(6, 4))
            sns.scatterplot(x=col1, y=col2, data=data)
            plt.title(f"Scatterplot of {col1} vs. {col2}")
            plt.show()

Step 4:

Create scatter matrix and Chi squared Test.

sns.pairplot(data)
# Display the scatter matrix
plt.show()
px.imshow(X.select_dtypes(include=np.number).corr())
#Chi Squared Test
import scipy.stats as stats
# Loop through each categorical feature
for col in categorical_features:
    # Create a contingency table
    crosstab = pd.crosstab(data[col], data[target])
    # Perform the chi-squared test
    chi2, p, dof, expected = stats.chi2_contingency(crosstab)
    # Print the results
    print(f"Chi-squared test for {col} and {target}:")
    print(f"- Chi-squared statistic: {chi2:.4f}")
    print(f"- P-value: {p:.4f}")
    print(f"- Degrees of freedom: {dof}")
    print(f"- Expected frequencies:\n{expected}")
    # Interpret the results
    if p < 0.05:
        print(f"There is a statistically significant association between {col} and {target}.")
    else:
        print(f"There is no statistically significant association between {col} and {target}.")

The Chi-squared test checks if there is a significant association between two categorical variables by comparing observed and expected frequencies in a contingency table. If the p-value is less than 0.05, it indicates a significant association, suggesting that the variables are related rather than independent.

chi2
Numeric-categorical feature pairs

ANOVA
We will use an ANOVA test for pairings of numerical and category features. Analysis of Variance, or ANOVA, tells us if there are changes between the means of two or more independent groups that are statistically significant.

Analysis of Variances, or ANOVA, looks at the differences between population means. There are two groups being tested to see if their means are different. At least one continuous variable and one category variable that divides your data into groups for comparison are needed to run ANOVA. What "analysis of variances" means is how the test looks at differences between the means to see if they are different.

It uses ANOVA to see how the variance of group means compared to the variance of the groups themselves. This process figures out if the groups are part of a larger population or if they are separate populations that can be separated in different ways.

Despite the fact that it looks at differences, it tries limits. This is the simplest type of ANOVA. This way of comparing groups goes beyond t tests and can be used with more than two groups. There is a possibility that at least one group has a different mean, which is called the alternative theory.

Watch full explanation for Introduction to ANOVA here.
!pip install pingouin
import pingouin as pg
X.head()
print(X.columns)
X_anova = pg.anova(dv='Age', between='Bmi', data=X)
X_anova

Step 5:

Data preprocessing
Splitting the dataset into training data and test data

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.33,
    random_state=42
cols_to_drop = [
    'Children',
    'Region',
    'Sex'
]
X_train.drop(cols_to_drop, axis=1, inplace=True)
X_test.drop(cols_to_drop, axis=1, inplace=True)
) 
ohe = OneHotEncoder(use_cat_names=True)
X_train = ohe.fit_transform(X_train)
X_test = ohe.transform(X_test)
cols_to_drop = ['Smoker_no']
X_train.drop(cols_to_drop, axis=1, inplace=True)
X_test.drop(cols_to_drop, axis=1, inplace=True)
pt = PowerTransformer(method='yeo-johnson')
y_train_t = pt.fit_transform(y_train.values.reshape(-1, 1))[:, 0]
y_test_t = pt.transform(y_test.values.reshape(-1, 1))[:, 0]
pt = PowerTransformer(method='yeo-johnson')
y_train_t = pt.fit_transform(y_train.values.reshape(-1, 1))[:, 0]
y_test_t = pt.transform(y_test.values.reshape(-1, 1))[:, 0]
pd.Series(y_train_t).hist(figsize=(5, 3))
pd.Series(y_test_t).hist(figsize=(5, 3))

Model training 


With sampling weights, we can make the residuals more homoscedastic. When the model is trained, observations with higher "Charges" will be given more weight than observations with lower "Charges". This means that residuals are punished more for observations that have bigger "charges" than for observations that have smaller "Charges". We will use the sample weight from the target column and scale it by the lowest number in the "Charges" column, which is 1. This means that the lowest sample weight is 1.

sample_weight = y_train / y_train.min()
lr = LinearRegression()
lr.fit(
    X_train,
    y_train_t,
    sample_weight=sample_weight
)

Step 6:

Evaluation The Model


We can use our model to make predictions on both our training and test sets now that we've trained it. The code looks for metrics like MSE, RMSE, MAE, and R^2 to see how well a machine learning model that was taught with linear regression on training data is doing. The results are then shown.

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error
import math
# Train evaluation
y_pred_train_t = lr.predict(X_train)
y_pred_train = pt.inverse_transform(y_pred_train_t.reshape(-1, 1))[:, 0]
y_train = pt.inverse_transform(y_train_t.reshape(-1, 1))[:, 0]
mse_train = mean_squared_error(y_train, y_pred_train)
rmse_train = math.sqrt(mse_train)
mape_train = mean_absolute_percentage_error(y_train, y_pred_train)
r2_train = r2_score(y_train, y_pred_train)
print("Train Evaluation:")
print(f"MSE: {mse_train:.4f}")
print(f"RMSE: {rmse_train:.4f}")
print(f"MAPE: {mape_train:.4f}")
print(f"R^2: {r2_train:.4f}")

This code evaluates the performance of a machine learning model test with linear regression on test data, calculating metrics such as mean squared error (MSE), root mean squared error (RMSE), mean absolute percentage error (MAPE), and R-squared (R^2) score, then prints the results.

# Test evaluation
y_pred_test_t = lr.predict(X_test)
y_pred_test = pt.inverse_transform(y_pred_test_t.reshape(-1, 1))[:, 0]
y_test = pt.inverse_transform(y_test_t.reshape(-1, 1))[:, 0]
mse_test = mean_squared_error(y_test, y_pred_test)
rmse_test = math.sqrt(mse_test)
mape_test = mean_absolute_percentage_error(y_test, y_pred_test)
r2_test = r2_score(y_test, y_pred_test)
print("Test Evaluation:")
print(f"MSE: {mse_test:.4f}")
print(f"RMSE: {rmse_test:.4f}")
print(f"MAPE: {mape_test:.4f}")
print(f"R^2: {r2_test:.4f}")
This code creates a table displaying evaluation metrics (MSE, RMSE, MAPE, R^2) for both training and test datasets, then plots predicted values against actual values for visualization.
# Create a table with the evaluation metrics
evaluation_metrics = pd.DataFrame({
    "Metric": ["MSE", "RMSE", "MAPE", "R^2"],
    "Train": [mse_train, rmse_train, mape_train, r2_train],
    "Test": [mse_test, rmse_test, mape_test, r2_test]
})
print(evaluation_metrics.to_string())
# Plot the predicted values against the actual values
plt.figure(figsize=(6, 4))
plt.scatter(y_test, y_pred_test)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], "k--", lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted Values")
plt.show()

Check normality of residuals


A QQ (quantile-quantile) plot can be used to see if the residuals are average. This shows the difference between the values of each real quantile (from the data) and the theoretical quantile (based on the idea of a normal distribution). You would expect the data points to be on a straight line if the data is perfectly normally distributed. We will also use a histogram to make the residuals easier to understand.

residuals_train = y_train - y_pred_train
residuals_test = y_test - y_pred_test
fig = sm.qqplot(
    residuals_train,
    fit=True,
    line='45'
)
fig = sm.qqplot(
    residuals_test,
    fit=True,
    line='45'
)
# Create a figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
# Plot the residuals for the train set
sns.histplot(residuals_train, ax=ax1, kde=True, color='blue')
ax1.set_title('Train Residuals')
ax1.set_xlabel('Residuals')
ax1.set_ylabel('Frequency')
# Plot the residuals for the test set
sns.histplot(residuals_test, ax=ax2, kde=True, color='orange')
ax2.set_title('Test Residuals')
ax2.set_xlabel('Residuals')
ax2.set_ylabel('Frequency')
# Show the plot
plt.show()
Check homoscedasticity

We can check for homoscedasticity using a scatterplot, where the target is shown along the x-axis and the residuals are shown along the y-axis. We would expect the datapoints to be equally distributed across the y-axis as x (i.e. the target value) increases:
px.scatter(x=y_train, y=residuals_train)
px.scatter(x=y_test, y=residuals_test)

Step 7:

Improve on the baseline linear model

Let's train a non-linear model to see if it can do better than our linear regression model. In this case, "non-linear" means that the model can learn from the data connections that are not linear.


Data preprocessing

We need to make a new training set and test set because the ones we used for the baseline linear model have been changed. Note: We use the same "random_seed" number for both the training and test sets to make sure that the observations are the same as those in the baseline linear model's training and test sets:

Train/test split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.33,
    random_state=42
)
ohe = OneHotEncoder(use_cat_names=True)
X_train = ohe.fit_transform(X_train)
X_test = ohe.transform(X_test)
rfe = RFE(estimator=XGBRegressor())
xgb = XGBRegressor()
steps = [
    ('rfe', rfe),
    ('xgb', xgb)
]
pipe = Pipeline(steps)
num_features = X_train.shape[1]
search_spaces = {
    'rfe__n_features_to_select': Integer(1, num_features), # Num features returned by RFE
    'xgb__n_estimators': Integer(1, 500), # Num trees built by XGBoost
    'xgb__max_depth': Integer(2, 8), # Max depth of trees built by XGBoost
    'xgb__reg_lambda': Integer(1, 200), # Regularisation term (lambda) used in XGBoost
    'xgb__learning_rate': Real(0, 1), # Learning rate used in XGBoost
    'xgb__gamma': Real(0, 2000) # Gamma used in XGBoost
}
xgb_bs_cv = BayesSearchCV(
    estimator=pipe, # Pipeline
    search_spaces=search_spaces, # Search spaces
    scoring='neg_root_mean_squared_error', # BayesSearchCV tries to maximise scoring metric, so negative RMSE used
    n_iter=100, # Num of optimisation iterations
    cv=3, # Number of folds
    n_jobs=-1, # Uses all available cores to compute
    verbose=1, # Show progress
    random_state=0 # Ensures reproducible results
)
xgb_bs_cv.fit(
    X_train,
    y_train,
)

Model evaluation

First, let's take a look at how each set of parameters did on each fold. Each record in the dataset is a set of parameters that were tried. We use rank_test_score to make sure that the best set of parameters is at the top:

cv_results = pd.DataFrame(xgb_bs_cv.cv_results_).sort_values('rank_test_score')
cv_results

Let's use the model we trained with our best parameters to make guesses on both our training and test sets:

y_pred_train_xgb = xgb_bs_cv.predict(X_train)
y_pred_test_xgb = xgb_bs_cv.predict(X_test)
# Train evaluation
mse_train_xgb = mean_squared_error(y_train, y_pred_train_xgb)
rmse_train_xgb = math.sqrt(mse_train_xgb)
mape_train_xgb = mean_absolute_percentage_error(y_train, y_pred_train_xgb)
r2_train_xgb = r2_score(y_train, y_pred_train_xgb)
print("Train Evaluation:")
print(f"Root Mean Squared Error: {rmse_train_xgb:.4f}")
print(f"Mean Squared Error: {mse_train_xgb:.4f}")
print(f"Mean Absolute Error: {rmse_train_xgb:.4f}")
print(f"Mean Absolute Percentage Error: {mape_train_xgb:.4f}")
print(f"R Squared: {r2_train_xgb:.4f}")
# Test evaluation
mse_test_xgb = mean_squared_error(y_test, y_pred_test_xgb)
rmse_test_xgb = math.sqrt(mse_test_xgb)
mape_test_xgb = mean_absolute_percentage_error(y_test, y_pred_test_xgb)
r2_test_xgb = r2_score(y_test, y_pred_test_xgb)
print("Test Evaluation:")
print(f"Root Mean Squared Error: {rmse_test_xgb:.4f}")
print(f"Mean Squared Error: {mse_test_xgb:.4f}")
print(f"Mean Absolute Error: {rmse_test_xgb:.4f}")
print(f"Mean Absolute Percentage Error: {mape_test_xgb:.4f}")
print(f"R Squared: {r2_test_xgb:.4f}")
#Create a table with the evaluation metrics
evaluation_metrics = pd.DataFrame({
    "Metric": ["Root Mean Squared Error", "Mean Squared Error", "Mean Absolute Error", "Mean Absolute Percentage Error", "R Squared"],
    "Linear Regression": [rmse_train, mse_train, rmse_train, mape_train, r2_train],
    "XGBoost": [rmse_train_xgb, mse_train_xgb, rmse_train_xgb, mape_train_xgb, r2_train_xgb],
})
print(evaluation_metrics.to_string())
# Plot the predicted values against the actual values
plt.figure(figsize=(6, 4))
plt.scatter(y_test, y_pred_test_xgb)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], "k--", lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted Values (XGBoost)")
plt.show()

Comparison to the baseline model

# Create a table with the evaluation metrics for both models
evaluation_metrics = pd.DataFrame({
    "Metric": ["Root Mean Squared Error", "Mean Squared Error", "Mean Absolute Error", "Mean Absolute Percentage Error", "R Squared"],
    "Linear Regression": [rmse_train, mse_train, rmse_train, mape_train, r2_train],
    "XGBoost": [rmse_train_xgb, mse_train_xgb, rmse_train_xgb, mape_train_xgb, r2_train_xgb],
})
# Print the table
print(evaluation_metrics.to_string())
# Compare the evaluation metrics for both models
print("Comparison of Evaluation Metrics:")
print("- Root Mean Squared Error:")
print(f"  Linear Regression: {rmse_train:.4f}")
print(f"  XGBoost: {rmse_train_xgb:.4f}")
print("- Mean Squared Error:")
print(f"  Linear Regression: {mse_train:.4f}")
print(f"  XGBoost: {mse_train_xgb:.4f}")
print("- Mean Absolute Error:")
print(f"  Linear Regression: {rmse_train:.4f}")
print(f"  XGBoost: {rmse_train_xgb:.4f}")
print("- Mean Absolute Percentage Error:")
print(f"  Linear Regression: {mape_train:.4f}")
print(f"  XGBoost: {mape_train_xgb:.4f}")
print("- R Squared:")
print(f"  Linear Regression: {r2_train:.4f}")
print(f"  XGBoost: {r2_train_xgb:.4f}")
# Based on the comparison, choose the model with the better performance
#plots for Compare the evaluation metrics for both models
# Create a figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
# Plot the RMSE for both models
ax1.bar(['Linear Regression', 'XGBoost'], [rmse_train, rmse_train_xgb])
ax1.set_title('Root Mean Squared Error')
ax1.set_xlabel('Model')
ax1.set_ylabel('RMSE')
# Plot the R^2 for both models
ax2.bar(['Linear Regression', 'XGBoost'], [r2_train, r2_train_xgb])
ax2.set_title('R Squared')
ax2.set_xlabel('Model')
ax2.set_ylabel('R^2')
# Show the plot
plt.show()

Giving the data to stakeholders who aren't technical



A lot of the time, data scientists have to tell people who aren't pros in the field how well a model works. This means that measures like RMSE aren't very useful because they're hard to understand.

Instead, let's show what percentage of the time our model's predictions are close to the real "charges" number. As an example, the portion of our model's test set predictions that are within $10,000 of the real "charges" number is:

# The percentage of our model's predictions (on the test set) that are within $10000 of the actual charges value is:
# Calculate the number of predictions within $2000 of the actual charges value
num_within_10000 = len(np.where((abs(y_test - y_pred_test_xgb) <= 10000))[0])
# Calculate the percentage of predictions within $2000 of the actual charges value
percentage_within_10000 = (num_within_10000 / len(y_test)) * 100
# Print the percentage
print(f"The percentage of predictions within $10000 of the actual charges value is: {percentage_within_10000:.2f}%")

We can also show what percentage of our model's predictions are within a certain percentage of the actual charges value.

For example, the percentage our model's predictions (on the test set) that are within 30% of the actual charges value is:

# The percentage our model's predictions (on the test set) that are within 30% of the actual charges value is:
# Calculate the number of predictions within 30% of the actual charges value
num_within_30_percent = len(np.where((abs(y_test - y_pred_test_xgb) / y_test) <= 0.30)[0])
# Calculate the percentage of predictions within 30% of the actual charges value
percentage_within_30_percent = (num_within_30_percent / len(y_test)) * 100
# Print the percentage
print(f"The percentage of predictions within 30% of the actual charges value is: {percentage_within_30_percent:.2f}%")

Conclusion

We understood what Linear Regression is based on and how to use and test it based on those assumptions. We learned a lot about the XGBoost algorithm and how to use it with the pipeline object in sklearn and bayesian optimization. We also learned how to tell non-technical users about how well a model is working.  

Next, we'll look at some real-life insurance industry uses of Machine Learning. The Insurance Industry and Machine Learning.


Machine Learning in Insurance Industry

  • Machine learning assists them in identifying probable fraudulent claims more quickly and correctly, and flagging them for inquiry. Given the variety of fraud types and the relatively low ratio of known frauds in typical samples, detecting insurance fraud is a difficult challenge. Machine learning techniques enable loss control units to achieve larger coverage with lower false positive rates by enhancing forecasting accuracy.
  • Consumers want personalized solutions, which are made feasible by machine learning algorithms that analyze their profiles and offer tailored items. A growing number of the insurance industry are using chatbots on messaging apps to answer easy questions and handle claims.
  • The insurance industry are using machine learning more and more to reach important goals like lowering costs, improving underwriting, and finding scams.
Code Editor