## Insurance Pricing Forecast Using XGBoost Regressor

Our project's goal is to create a machine learning model that can predict how much health care will cost so that insurance rates can be set correctly. Two models will be made and compared: an XGBoost Regressor and a Linear Regression model. The study's overarching goal is to assist insurance companies in setting profitable rates by estimating healthcare expenses using age, BMI, and smoking status. Data preparation, feature engineering, model training, assessment, and success comparison are all steps in the process. The main focus is on telling all parties, technical and not, about the results.

# Explanation All Code

### Step 1:

You can mount your Google Drive in a Google Colab notebook with this piece of code. This creates it simple to see files saved in Google Drive in the Colab setting so that data can be changed, analyzed, and models can be trained.

from google.colab import drive drive.mount('/content/drive')

Import required packages

!pip install numpy !pip install pandas !pip install plotly !pip install scikit-learn !pip install scikit-optimize !pip install statsmodels !pip install category_encoders !pip install xgboost !pip install nbformat !pip install matplotlib

### Step 2:

Import libraries

import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import numpy as np import plotly.express as px import sys from sklearn.model_selection import train_test_split from category_encoders import OneHotEncoder from sklearn.linear_model import LinearRegression import statsmodels.api as sm import math from xgboost import XGBRegressor from sklearn.pipeline import Pipeline from skopt import BayesSearchCV from skopt.space import Real, Categorical, Integer from sklearn.preprocessing import StandardScaler, PowerTransformer from sklearn.feature_selection import RFE

**Exploratory Data Analysis (EDA)**

EDA stands for "Exploratory Data Analysis". It is a way to look at data with pictures. It includes looking for certain trends in data using statistics and visual methods.
People use it to figure out trends in data, find outliers, test assumptions, and so on. The main goal of EDA is to help people look into the data before they come up with a theory about it.
When you are making a machine learning model, EDA is a very important step. It helps us figure out how the variables are spread out, how they are connected, and which traits will be good at making predictions.
First, let's read the information, which is in the folder called "input" and is named "insurance.csv".

data = pd.read_csv('/content/drive/MyDrive/Aionlinecourse/data/insurance_dataset.csv') data.head()

data.info()

So we have three numeric features (Age, Bmi and Children) and three categorical features (Sex, Smoker and Region).

NOTE: there are no null values in any of the columns, which means we won't need to impute values in the data preprocessing step. However, This is usually a step that you'll need to consider when building a machine learning model.

The target (i.e. the variable that we want to predict) is the charges column, so let's split the dataset into features (X) and the target (y):

target = 'Charges' X = data.drop(target, axis=1) y = data[target] X.shape, y.shape

**Distributions**

Let's now look at the distribution of each feature by plotting a histogram for each Points to note regarding the distribution of each feature:

- Age - Approximately uniformly distributed.
- Sex - Approximately equal volume in each category.
- Bmi - Approximately normally distributed.
- Children - Right skewed (i.e. higher volume in lower range).
- Smoker - Significantly more volume in the no category vs the yes category.
- Region - Approximately equal volume in each category.
- The distribution is right skewed (i.e. higher volume in the lower range).

fig = px.histogram(data, x=target, nbins=50, title="Distribution of Charges") fig.show()

**Univariate analysis (with respect to the target)**

The next step is to use binary analysis on the target. In other words, we look at each trait and figure out how it fits with the goal.

How we do this changes based on whether the trait is a number or a set of words. A scatterplot will be used for numerical features and a boxplot will be used for category features.

**Numeric features**

Points to note regarding each feature:

- Age - As Age increases, Charges also tends to increase (although there is a large variance in Charges for a given Age).
- bmi - There is no clear relationship, although there seems to be a group of individuals with Bmi > 30 that have Charges > 30k.
- This group may become more apparent when we carry out our bivariate analysis later.
- Children - No clear relationship (although Charges seems to decrease as Children increases).
- Since there are only 6 unique values for this feature, let's try treating it as a categorical feature for the purposes of univariate analysis.

numeric_features = X.select_dtypes(include=[np.number]) numeric_features.columns

# plot_heatmap plt.figure(figsize=(6, 4)) sns.heatmap(numeric_features.corr(), annot=True, fmt=".2f", cmap="coolwarm") plt.show()

### Step 3:

** Categorical features**

categorical_features = X.select_dtypes(include=["object"]).columns categorical_features

**Things to keep in mind about each feature:**

- Sex - No significant differences in Charges between the categories.
- Smoker - Charges for Smoker == 'yes' are generally much higher than when Smoker == 'no'.
- Region - No significant differences in Charges between the categories.
- Children - No significant differences in Charges between the categories (Children >= 4 are skewed towards lower Charges, but this is likely due to the low volumes in those categories - see the Distributions section).

# Create paired boxplots for each categorical feature against the target variable for col in categorical_features: plt.figure(figsize=(6, 4)) sns.boxplot(x=col, y=target, data=data) plt.title(f"Distribution of {target} by {col}") plt.show()

# Create paired scatterplots for each pair of numeric features for i, col1 in enumerate(numeric_features.columns): for j, col2 in enumerate(numeric_features.columns): if i < j: plt.figure(figsize=(6, 4)) sns.scatterplot(x=col1, y=col2, data=data) plt.title(f"Scatterplot of {col1} vs. {col2}") plt.show()

### Step 4:

**Create scatter matrix and Chi squared Test.**

sns.pairplot(data) # Display the scatter matrix plt.show()

px.imshow(X.select_dtypes(include=np.number).corr())

#Chi Squared Test import scipy.stats as stats # Loop through each categorical feature for col in categorical_features: # Create a contingency table crosstab = pd.crosstab(data[col], data[target]) # Perform the chi-squared test chi2, p, dof, expected = stats.chi2_contingency(crosstab) # Print the results print(f"Chi-squared test for {col} and {target}:") print(f"- Chi-squared statistic: {chi2:.4f}") print(f"- P-value: {p:.4f}") print(f"- Degrees of freedom: {dof}") print(f"- Expected frequencies:\n{expected}") # Interpret the results if p < 0.05: print(f"There is a statistically significant association between {col} and {target}.") else: print(f"There is no statistically significant association between {col} and {target}.")

The Chi-squared test checks if there is a significant association between two categorical variables by comparing observed and expected frequencies in a contingency table. If the p-value is less than 0.05, it indicates a significant association, suggesting that the variables are related rather than independent.

chi2

**Numeric-categorical feature pairs**

**ANOVA**

!pip install pingouin import pingouin as pg

X.head()

print(X.columns)

X_anova = pg.anova(dv='Age', between='Bmi', data=X) X_anova

### Step 5:

**Data preprocessing**

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42 cols_to_drop = [ 'Children', 'Region', 'Sex' ] X_train.drop(cols_to_drop, axis=1, inplace=True) X_test.drop(cols_to_drop, axis=1, inplace=True) ) ohe = OneHotEncoder(use_cat_names=True) X_train = ohe.fit_transform(X_train) X_test = ohe.transform(X_test) cols_to_drop = ['Smoker_no'] X_train.drop(cols_to_drop, axis=1, inplace=True) X_test.drop(cols_to_drop, axis=1, inplace=True) pt = PowerTransformer(method='yeo-johnson') y_train_t = pt.fit_transform(y_train.values.reshape(-1, 1))[:, 0] y_test_t = pt.transform(y_test.values.reshape(-1, 1))[:, 0] pt = PowerTransformer(method='yeo-johnson') y_train_t = pt.fit_transform(y_train.values.reshape(-1, 1))[:, 0] y_test_t = pt.transform(y_test.values.reshape(-1, 1))[:, 0] pd.Series(y_train_t).hist(figsize=(5, 3)) pd.Series(y_test_t).hist(figsize=(5, 3))

**Model training**

With sampling weights, we can make the residuals more homoscedastic. When the model is trained, observations with higher "Charges" will be given more weight than observations with lower "Charges". This means that residuals are punished more for observations that have bigger "charges" than for observations that have smaller "Charges".
We will use the sample weight from the target column and scale it by the lowest number in the "Charges" column, which is 1. This means that the lowest sample weight is 1.

sample_weight = y_train / y_train.min() lr = LinearRegression() lr.fit( X_train, y_train_t, sample_weight=sample_weight )

### Step 6:

**Evaluation The Model**

We can use our model to make predictions on both our training and test sets now that we've trained it. The code looks for metrics like MSE, RMSE, MAE, and R^2 to see how well a machine learning model that was taught with linear regression on training data is doing. The results are then shown.

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error import math # Train evaluation y_pred_train_t = lr.predict(X_train) y_pred_train = pt.inverse_transform(y_pred_train_t.reshape(-1, 1))[:, 0] y_train = pt.inverse_transform(y_train_t.reshape(-1, 1))[:, 0] mse_train = mean_squared_error(y_train, y_pred_train) rmse_train = math.sqrt(mse_train) mape_train = mean_absolute_percentage_error(y_train, y_pred_train) r2_train = r2_score(y_train, y_pred_train) print("Train Evaluation:") print(f"MSE: {mse_train:.4f}") print(f"RMSE: {rmse_train:.4f}") print(f"MAPE: {mape_train:.4f}") print(f"R^2: {r2_train:.4f}")

This code evaluates the performance of a machine learning model test with linear regression on test data, calculating metrics such as mean squared error (MSE), root mean squared error (RMSE), mean absolute percentage error (MAPE), and R-squared (R^2) score, then prints the results.

# Test evaluation y_pred_test_t = lr.predict(X_test) y_pred_test = pt.inverse_transform(y_pred_test_t.reshape(-1, 1))[:, 0] y_test = pt.inverse_transform(y_test_t.reshape(-1, 1))[:, 0] mse_test = mean_squared_error(y_test, y_pred_test) rmse_test = math.sqrt(mse_test) mape_test = mean_absolute_percentage_error(y_test, y_pred_test) r2_test = r2_score(y_test, y_pred_test) print("Test Evaluation:") print(f"MSE: {mse_test:.4f}") print(f"RMSE: {rmse_test:.4f}") print(f"MAPE: {mape_test:.4f}") print(f"R^2: {r2_test:.4f}")

# Create a table with the evaluation metrics evaluation_metrics = pd.DataFrame({ "Metric": ["MSE", "RMSE", "MAPE", "R^2"], "Train": [mse_train, rmse_train, mape_train, r2_train], "Test": [mse_test, rmse_test, mape_test, r2_test] }) print(evaluation_metrics.to_string()) # Plot the predicted values against the actual values plt.figure(figsize=(6, 4)) plt.scatter(y_test, y_pred_test) plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], "k--", lw=2) plt.xlabel("Actual Values") plt.ylabel("Predicted Values") plt.title("Actual vs. Predicted Values") plt.show()

**Check normality of residuals**

A QQ (quantile-quantile) plot can be used to see if the residuals are average. This shows the difference between the values of each real quantile (from the data) and the theoretical quantile (based on the idea of a normal distribution). You would expect the data points to be on a straight line if the data is perfectly normally distributed.
We will also use a histogram to make the residuals easier to understand.

residuals_train = y_train - y_pred_train residuals_test = y_test - y_pred_test fig = sm.qqplot( residuals_train, fit=True, line='45' )

fig = sm.qqplot( residuals_test, fit=True, line='45' )

# Create a figure with two subplots fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6)) # Plot the residuals for the train set sns.histplot(residuals_train, ax=ax1, kde=True, color='blue') ax1.set_title('Train Residuals') ax1.set_xlabel('Residuals') ax1.set_ylabel('Frequency') # Plot the residuals for the test set sns.histplot(residuals_test, ax=ax2, kde=True, color='orange') ax2.set_title('Test Residuals') ax2.set_xlabel('Residuals') ax2.set_ylabel('Frequency') # Show the plot plt.show()

**Check homoscedasticity**

px.scatter(x=y_train, y=residuals_train)

px.scatter(x=y_test, y=residuals_test)

### Step 7:

**Improve on the baseline linear model**

Let's train a non-linear model to see if it can do better than our linear regression model. In this case, "non-linear" means that the model can learn from the data connections that are not linear.

**Data preprocessing**

We need to make a new training set and test set because the ones we used for the baseline linear model have been changed.
Note: We use the same "random_seed" number for both the training and test sets to make sure that the observations are the same as those in the baseline linear model's training and test sets:

**Train/test split**

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42 ) ohe = OneHotEncoder(use_cat_names=True) X_train = ohe.fit_transform(X_train) X_test = ohe.transform(X_test) rfe = RFE(estimator=XGBRegressor()) xgb = XGBRegressor() steps = [ ('rfe', rfe), ('xgb', xgb) ] pipe = Pipeline(steps) num_features = X_train.shape[1] search_spaces = { 'rfe__n_features_to_select': Integer(1, num_features), # Num features returned by RFE 'xgb__n_estimators': Integer(1, 500), # Num trees built by XGBoost 'xgb__max_depth': Integer(2, 8), # Max depth of trees built by XGBoost 'xgb__reg_lambda': Integer(1, 200), # Regularisation term (lambda) used in XGBoost 'xgb__learning_rate': Real(0, 1), # Learning rate used in XGBoost 'xgb__gamma': Real(0, 2000) # Gamma used in XGBoost } xgb_bs_cv = BayesSearchCV( estimator=pipe, # Pipeline search_spaces=search_spaces, # Search spaces scoring='neg_root_mean_squared_error', # BayesSearchCV tries to maximise scoring metric, so negative RMSE used n_iter=100, # Num of optimisation iterations cv=3, # Number of folds n_jobs=-1, # Uses all available cores to compute verbose=1, # Show progress random_state=0 # Ensures reproducible results ) xgb_bs_cv.fit( X_train, y_train, )

**Model evaluation**

First, let's take a look at how each set of parameters did on each fold. Each record in the dataset is a set of parameters that were tried. We use rank_test_score to make sure that the best set of parameters is at the top:

cv_results = pd.DataFrame(xgb_bs_cv.cv_results_).sort_values('rank_test_score') cv_results

**Let's use the model we trained with our best parameters to make guesses on both our training and test sets:**

y_pred_train_xgb = xgb_bs_cv.predict(X_train) y_pred_test_xgb = xgb_bs_cv.predict(X_test) # Train evaluation mse_train_xgb = mean_squared_error(y_train, y_pred_train_xgb) rmse_train_xgb = math.sqrt(mse_train_xgb) mape_train_xgb = mean_absolute_percentage_error(y_train, y_pred_train_xgb) r2_train_xgb = r2_score(y_train, y_pred_train_xgb) print("Train Evaluation:") print(f"Root Mean Squared Error: {rmse_train_xgb:.4f}") print(f"Mean Squared Error: {mse_train_xgb:.4f}") print(f"Mean Absolute Error: {rmse_train_xgb:.4f}") print(f"Mean Absolute Percentage Error: {mape_train_xgb:.4f}") print(f"R Squared: {r2_train_xgb:.4f}")

# Test evaluation mse_test_xgb = mean_squared_error(y_test, y_pred_test_xgb) rmse_test_xgb = math.sqrt(mse_test_xgb) mape_test_xgb = mean_absolute_percentage_error(y_test, y_pred_test_xgb) r2_test_xgb = r2_score(y_test, y_pred_test_xgb) print("Test Evaluation:") print(f"Root Mean Squared Error: {rmse_test_xgb:.4f}") print(f"Mean Squared Error: {mse_test_xgb:.4f}") print(f"Mean Absolute Error: {rmse_test_xgb:.4f}") print(f"Mean Absolute Percentage Error: {mape_test_xgb:.4f}") print(f"R Squared: {r2_test_xgb:.4f}")

#Create a table with the evaluation metrics evaluation_metrics = pd.DataFrame({ "Metric": ["Root Mean Squared Error", "Mean Squared Error", "Mean Absolute Error", "Mean Absolute Percentage Error", "R Squared"], "Linear Regression": [rmse_train, mse_train, rmse_train, mape_train, r2_train], "XGBoost": [rmse_train_xgb, mse_train_xgb, rmse_train_xgb, mape_train_xgb, r2_train_xgb], }) print(evaluation_metrics.to_string())

# Plot the predicted values against the actual values plt.figure(figsize=(6, 4)) plt.scatter(y_test, y_pred_test_xgb) plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], "k--", lw=2) plt.xlabel("Actual Values") plt.ylabel("Predicted Values") plt.title("Actual vs. Predicted Values (XGBoost)") plt.show()

**Comparison to the baseline model**

# Create a table with the evaluation metrics for both models evaluation_metrics = pd.DataFrame({ "Metric": ["Root Mean Squared Error", "Mean Squared Error", "Mean Absolute Error", "Mean Absolute Percentage Error", "R Squared"], "Linear Regression": [rmse_train, mse_train, rmse_train, mape_train, r2_train], "XGBoost": [rmse_train_xgb, mse_train_xgb, rmse_train_xgb, mape_train_xgb, r2_train_xgb], }) # Print the table print(evaluation_metrics.to_string()) # Compare the evaluation metrics for both models print("Comparison of Evaluation Metrics:") print("- Root Mean Squared Error:") print(f" Linear Regression: {rmse_train:.4f}") print(f" XGBoost: {rmse_train_xgb:.4f}") print("- Mean Squared Error:") print(f" Linear Regression: {mse_train:.4f}") print(f" XGBoost: {mse_train_xgb:.4f}") print("- Mean Absolute Error:") print(f" Linear Regression: {rmse_train:.4f}") print(f" XGBoost: {rmse_train_xgb:.4f}") print("- Mean Absolute Percentage Error:") print(f" Linear Regression: {mape_train:.4f}") print(f" XGBoost: {mape_train_xgb:.4f}") print("- R Squared:") print(f" Linear Regression: {r2_train:.4f}") print(f" XGBoost: {r2_train_xgb:.4f}") # Based on the comparison, choose the model with the better performance

#plots for Compare the evaluation metrics for both models # Create a figure with two subplots fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6)) # Plot the RMSE for both models ax1.bar(['Linear Regression', 'XGBoost'], [rmse_train, rmse_train_xgb]) ax1.set_title('Root Mean Squared Error') ax1.set_xlabel('Model') ax1.set_ylabel('RMSE') # Plot the R^2 for both models ax2.bar(['Linear Regression', 'XGBoost'], [r2_train, r2_train_xgb]) ax2.set_title('R Squared') ax2.set_xlabel('Model') ax2.set_ylabel('R^2') # Show the plot plt.show()

**Giving the data to stakeholders who aren't technical**

Instead, let's show what percentage of the time our model's predictions are close to the real "charges" number. As an example, the portion of our model's test set predictions that are within $10,000 of the real "charges" number is:

# The percentage of our model's predictions (on the test set) that are within $10000 of the actual charges value is: # Calculate the number of predictions within $2000 of the actual charges value num_within_10000 = len(np.where((abs(y_test - y_pred_test_xgb) <= 10000))[0]) # Calculate the percentage of predictions within $2000 of the actual charges value percentage_within_10000 = (num_within_10000 / len(y_test)) * 100 # Print the percentage print(f"The percentage of predictions within $10000 of the actual charges value is: {percentage_within_10000:.2f}%")

We can also show what percentage of our model's predictions are within a certain percentage of the actual charges value.

For example, the percentage our model's predictions (on the test set) that are within 30% of the actual charges value is:

# The percentage our model's predictions (on the test set) that are within 30% of the actual charges value is: # Calculate the number of predictions within 30% of the actual charges value num_within_30_percent = len(np.where((abs(y_test - y_pred_test_xgb) / y_test) <= 0.30)[0]) # Calculate the percentage of predictions within 30% of the actual charges value percentage_within_30_percent = (num_within_30_percent / len(y_test)) * 100 # Print the percentage print(f"The percentage of predictions within 30% of the actual charges value is: {percentage_within_30_percent:.2f}%")

**Conclusion**

We understood what Linear Regression is based on and how to use and test it based on those assumptions. We learned a lot about the XGBoost algorithm and how to use it with the pipeline object in sklearn and bayesian optimization. We also learned how to tell non-technical users about how well a model is working.

Next, we'll look at some real-life insurance industry uses of Machine Learning. The Insurance Industry and Machine Learning.

**Machine Learning in Insurance Industry**

- Machine learning assists them in identifying probable fraudulent claims more quickly and correctly, and flagging them for inquiry. Given the variety of fraud types and the relatively low ratio of known frauds in typical samples, detecting insurance fraud is a difficult challenge. Machine learning techniques enable loss control units to achieve larger coverage with lower false positive rates by enhancing forecasting accuracy.
- Consumers want personalized solutions, which are made feasible by machine learning algorithms that analyze their profiles and offer tailored items. A growing number of the insurance industry are using chatbots on messaging apps to answer easy questions and handle claims.
- The insurance industry are using machine learning more and more to reach important goals like lowering costs, improving underwriting, and finding scams.