Machine Learning Project: Airline Tickets Price Prediction

Written by- Aionlinecourse2907 times views

This project is all about predicting airline ticket prices. In the dataset, there are many columns like Departure time, Arrival time, Date of journey, Duration, and so on. By doing data preprocessing, data analysis, feature selection, and many other techniques we built our cool and fancy machine learning model. And at the end, we applied many ml algorithms to get the very good accuracy of our model.

15_air_ticket_price_prediction

Understand The Data

Dealing with missing values

For data manipulation, numerical computation, and visualization we imported pandas, NumPy, seaborn, and matplotlib library
Reading data and saving it into the train_data variable.
Previewing data by calling head function.

import pandas as pd                #importing pandas for data manupulation, data analysis 
import numpy as np                 #importing numpy for numerical computation
import seaborn as sns              #importing seaborn for visualization purpose
import matplotlib.pyplot as plt    #importing matplotlib for another visualization tasks
train_data=pd.read_excel('Data_Train.xlsx')   #reading data
train_data.head()                  #calling head to get preview of data

15_Data_Preview_1

The shape of the entire data frame.

train_data.shape

15_Datashape_3

Getting summation of missing values by calling isna() & sum() function.

train_data.isna().sum()

15_missing_values_2

Dropping missing values using dropna() function and updating data frame by setting inplace=True

train_data.dropna(inplace=True)

Here, doing a cross-check whether missing values get moved or not. In the output section, we can see there are no such missing values.

train_data.isna().sum()

15_crosscheck_4

Cleaning data for analysis and modeling purpose

Checking datatype of each & every variable of each and every column available in the data.

train_data.dtypes

15_dtype_5

This function converts variables to date-time format. It takes col as a parameter, pd.to_datetime() helps to convert col to DateTime and updates that col from train_data by assigning it into train_data[col].

def chanage_into_datetime(col):
    train_data[col]=pd.to_datetime(train_data[col])

All columns of train_data in the form of a list.

train_data.columns

15_all_column_6

Adding ‘Date_of_Journey’, ‘Dep_Time’, ‘Arrival_Time’ column into a list and looping through the list, after that passing each column into change_into_datetime function, to convert the object into DateTime.

for i in ['Date_of_Journey','Dep_Time','Arrival_Time']:
    change_into_datetime(i)

Checking train_data’s column data type, whether it is converted into DateTime or not.

train_data.dtypes

15_converted_7

Here, we are splitting our ‘Date_of_Journey’ column from train_data into a date, month, and year. Because when we will pass this into our machine learning model, that model will not understand which one is the month/day/year. And assigning into a new column named journey_day and journey_month.

train_data['journey_day']=train_data['Date_of_Journey'].dt.day
train_data['journey_month']=train_data['Date_of_Journey'].dt.month

Calling head function to get a rough idea of our data frame named ‘train_data’.

train_data.head()

15_daycolumn_8

Now dropping our ‘Date_of_Journey’ column as we fetched everything from that column. By setting inplace=True we are updating our data frame and axis=1 means, we are removing our column vertically.

train_data.drop('Date_of_Journey',axis=1,inplace=True)

You can see there is no such column named ‘Date_of_Journey’.

train_data.head()

15_dropcol_9

How to extract Derived features from data

We created this function to extract hours and minutes. After that, we built a drop function to drop that column from where we fetched hours and minutes. These functions take two parameters as input, one is a data frame, another is a column, and save this extraction into a data frame by concatenating with the column. Finally, updating the data frame by inplace=True.

def extract_hour(df,col):
    df[col+'_hour']=df[col].dt.hour
def extract_hour(df,col):
    df[col+'_minute']=df[col].dt.minute
def drop_column(df,col):
    df.drop(col,axis=1,inplace=True)

Extracting hour and minute from Dep_Time column by calling extract_hour and extract_min function, secondly dropping Dep_Time from the table by calling drop_column function.

extract_hour(train_data,'Dep_Time')
extract_min(train_data,'Dep_Time')
drop_column(train_data,'Dep_Time')

You can see there is no Dep_Time column as we dropped it, and two columns are added named ‘Dep_Time_hour’ and ‘Dep_Time_min’.

train_data.head()

15_deptime_10

Now, extract hour and minute from the Arrival_Time column and drop this column by calling the same function.

extract_hour(train_data,'Arrival_Time')
extract_min(train_data,'Arrival_Time')
drop_column(train_data,'Arrival_Time')

The Arrival_Time column is dropped and two new columns are added, Arrival_Time_hour and Arrival_Time_minute.

train_data.head()

15_arrival_11

In the duration column, we can see, for some value minute is not given. So, we fixed it by assigning zero minutes after an hour/zero hour after a minute, if any blank is found.

duration = list(train_data['Duration'])
for i in range(len(duration)):
    if len(duration[i].split(' '))==2:
        pass
    else:
        if 'h' in duration[i]:
            duration[i]=duration[i]+'0m'
        else:
            duration[i]='0h'+duration[i]

Saving new results after operation into the Duration column.

train_data['Duration']=duration

We can see, Duration column is updated.

train_data.head()

15_duration_12

Perform Data Pre-processing

We split our Duration column into Duration_hours and Duration_minutes. Two columns were added and we dropped the Duration column. So, our machine learning model can understand the features.

def hour(x):
    return x.split(' ')[0][0:-1]
def minute(x):
    return x.split(' ')[1][0:-1]

Here, I applied the hour and minute function into our Duration column and saved them into new columns named ‘Duration_hours’ and ‘Duration_mins’.

train_data['Duration_hours']=train_data['Duration'].apply(hour)
train_data['Duration_mins']=train_data['Duration'].apply(min)

Calling head to see the changes

train_data.head()

15_Durationupdate_13

Removed the ‘Duration’ column by calling the drop_column function as it has no use currently.

drop_column(train_data,'Duration')

Checking the entire data columns after the operation.

train_data.head()

1642242661_15_drop_duration_13

The datatype of train_data set.

train_data.dtypes

15_dtype_14

Converted ‘Duration_hours’ and ‘Duration_mins’ from object type to (int) type. Finally, updating it.

train_data['Duration_hours']=train_data['Duration_hours'].astype(int)
train_data['Duration_mins']=train_data['Duration_mins'].astype(int)

Cross-checked dtype.

train_data.dtypes

15_crsdt_15

For extracting categorical data, numerical data, and continuous features. We looped through each and every column of this dataset and if a column's data type is an object, we considered that categorical column.

cat_col=[col for col in train_data.columns if train_data[col].dtype=='O']
cat_col

15_cat_col_16

Fetching continuous features.

cont_col=[col for col in train_data.columns if train_data[col].dtype!='O']
cont_col

15_cont_col_17

Handle Categorical Data & Feature Encoding

Categorical data is actually two types - one is nominal data and the second one is ordinal data. Nominal data are mainly those who have no hierarchy like the name of the country and ordinal data has some kind of hierarchy. So, for dealing with nominal we performed one-hot encoding and we applied a Feature Encoding class for dealing with ordinal data.

Passing out cat_col into train_data to get all categorical data. Saved into a new data frame.

categorical=train_data[cat_col]
categorical

15_cat_data_18

Now, access the Airline column and count each and every feature of this column.

categorical['Airline'].value_counts()

15_airline_19

Sorted train_data in descending order by Price column.

plt.figure(figsize=(15,5))
sns.boxplot(x='Airline',y='Price',data=train_data.sort_values('Price',ascending=False))

15_sortedtrain_20

Extracting total_stops is the same as before.

plt.figure(figsize=(15,5))
sns.boxplot(x='Total_Stops',y='Price',data=train_data.sort_values('Price',ascending=False))

15_total_stops_21

Performed one-hot encoding to the Airline column as our ml model doesn’t understand the string values.

Airline=pd.get_dummies(categorical['Airline'],drop_first=True)

Overview of Airline column after dummify.

Airline.head()

15_airline_22

Source column value count.

categorical['Source'].value_counts()

15_source_23

Extracting distribution of Source with respect of Price.

plt.figure(figsize=(15,5))
sns.boxplot(x='Source',y='Price',data=train_data.sort_values('Price',ascending=False))

15_dist_24

Dummifying Source column.

Source=pd.get_dummies(categorical['Source'],drop_first=True)
Source.head()

15_dum_25

Value counts of Destination.

categorical['Destination'].value_counts()

15_dest_26

Extracting distribution of Destination with respect to Price.

plt.figure(figsize=(15,5))
sns.boxplot(x='Destination',y='Price',data=train_data.sort_values('Price',ascending=False))

15_dest_27

Dummifying Destination column.

Destination=pd.get_dummies(categorical['Destination'],drop_first=True)
Destination.head()

15_dum_Dest_28

How to Perform Label Encoding on the dataset

For doing label encoding we are accessing Route.

categorical['Route_2']=categorical['Route'].str.split('→').str[1]
categorical['Route_3']=categorical['Route'].str.split('→').str[2]
categorical['Route_4']=categorical['Route'].str.split('→').str[3]
categorical['Route_5']=categorical['Route'].str.split('→').str[4]

Here you can see all 5 routes have been added.

categorical.head()

15_cat_head_29

Dropping Route and checking null values in the categorical data frame.

drop_column(categorical,'Route')
categorical.isnull().sum()

15_null_30

Iterating on null columns of Route_3, Route_4, and Route_5 and updating data frame.

for i in ['Route_3','Route_4','Route_5']:
    categorical[i].fillna('None',inplace=True)

An entire data frame.

categorical.isnull().sum()

15_no_values_31

Printing features of each and every column

for feature in categorical.columns:
    print('{} has total {} categories \n'.format(feature,len(categorical[feature].value_counts())))

15_feature_32

As we will see we have lots of features in Route, one hot encoding will not be a better option that’s why applying label encoder into our Route column. For that importing that class.

from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
categorical.columns

15_cat_33

for i in ['Route_1', 'Route_2', 'Route_3', 'Route_4','Route_5']:
    categorical[i]=encoder.fit_transform(categorical[i])

Overview of the dataset.

categorical.head()

15_head_34

Additional_Info contains almost 80% no_info, so we can drop this column

drop_column(categorical,'Additional_Info')

Checking value counts of Total_Stops column.

categorical['Total_Stops'].value_counts()

As this is the case of Ordinal Categorical type we perform LabelEncoder. Here Values are assigned with corresponding keys.

dict={'non-stop':0, '2 stops':2, '1 stop':1, '3 stops':3, '4 stops':4}
categorical['Total_Stops']=categorical['Total_Stops'].map(dict)
categorical.head()

15_tot_35

Concatenating data frame categorical, Airline, Source, Destination

data_train=pd.concat([categorical,Airline,Source,Destination,train_data[cont_col]],axis=1)
data_train.head()

15_concat_36

Dropped Airline, Source, and Destination

drop_column(data_train,'Airline')
drop_column(data_train,'Source')
drop_column(data_train,'Destination')
data_train.head()

15_droo_37

Showing maximum columns.

pd.set_option('display.max_columns',35)
data_train.head()

15_max_col_38

Outliers Detection in Data

This function takes data frames and columns as input. After that, make a distribution plot with respect to price.

def plot(df,col):
    fig,(ax1,ax2)=plt.subplots(2,1)
    sns.distplot(df[col],ax=ax1)
    sns.boxplot(df[col],ax=ax2)
plt.figure(figsize=(30,20))
plot(data_train,'Price')

Here we dealt with outliers and took the median.

data_train['Price']=np.where(data_train['Price']>=40000,data_train['Price'].median(),data_train['Price'])
plt.figure(figsize=(30,20))
plot(data_train,'Price')

We separated our dependent and independent features into x and y variables.

X=data_train.drop('Price',axis=1)
y=data_train['Price']

Select best Features using Feature Selection Technique

Finding out the best feature which will contribute and have good relationships with the target variable. Why apply Feature Selection? To select important features to get rid of the curse of dimensionality ie..to get rid of duplicate features.

I wanted to find mutual information scores or matrices to get to know about the relationship between all features.

Feature Selection using Information Gain.

from sklearn.feature_selection import mutual_info_classif
X.dtypes
mutual_info_classif(X,y)

15_mutual_39

Renamed data frame and sorted on the basis of importance column.

imp=pd.DataFrame(mutual_info_classif(X,y),index=X.columns)
imp.columns=['importance']
imp.sort_values(by='importance',ascending=False)

Apply Random Forest on Data & Automate your predictions

We are splitting our data into train and test forms so that we can train our desired model.

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
from sklearn import metrics
##dump your model using pickle so that we will re-use
import pickle
def predict(ml_model,dump):
    model=ml_model.fit(X_train,y_train)
    print('Training score : {}'.format(model.score(X_train,y_train)))
    y_prediction=model.predict(X_test)
    print('predictions are: \n {}'.format(y_prediction))
    print('\n')
    r2_score=metrics.r2_score(y_test,y_prediction)
    print('r2 score: {}'.format(r2_score))
    print('MAE:',metrics.mean_absolute_error(y_test,y_prediction))
    print('MSE:',metrics.mean_squared_error(y_test,y_prediction))
    print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test,y_prediction)))
    sns.distplot(y_test-y_prediction)

Importing random forest class.

from sklearn.ensemble import RandomForestRegressor
predict(RandomForestRegressor(),1)

15_rad_40

Play with multiple Algorithms & dump your model

Here, we applied several types of supervised algorithms to get the best accuracy of our model.

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
predict(DecisionTreeRegressor(),0)
predict(LinearRegression(),0)
predict(KNeighborsRegressor(),0)

How to Cross Validate your model

We have hyper-tuned our model. For this, we took into account the following steps

1. Choose the following method for hyperparameter tuning

a.RandomizedSearchCV --> Fast way to Hypertune model

b.GridSearchCV--> Slow way to hyper tune my model

2. Assign hyperparameters in form of a dictionary

3. Fit the model

4. Check best parameters and the best score

from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators=[int(x) for x in np.linspace(start=100,stop=1200,num=6)]

# Number of features to consider at every split
max_features=['auto','sqrt']

# Maximum number of levels in tree
max_depth=[int(x) for x in np.linspace(5,30,num=4)]

# Minimum number of samples required to split a node
min_samples_split=[5,10,15,100]

Create the random grid

random_grid={
    'n_estimators':n_estimators,
    'max_features':max_features,
'max_depth':max_depth,
    'min_samples_split':min_samples_split
}
random_grid

A random search of parameters, using 3 fold cross-validation

rf_random=RandomizedSearchCV(estimator=reg_rf,param_distributions=random_grid,cv=3,verbose=2,n_jobs=-1)

Fitting our X_train, y_train dataset

rf_random.fit(X_train,y_train)
rf_random.best_params_

Predicting our X_test dataset.

prediction=rf_random.predict(X_test)

Visualizing data

sns.distplot(y_test-prediction)

15_test_41

Prediction score of our model.

metrics.r2_score(y_test,prediction)
metrics.mean_absolute_error(y_test,prediction)
metrics.mean_squared_error(y_test,prediction)
np.sqrt(metrics.mean_squared_error(y_test,prediction))

15_score_42

So, this is our model accuracy. Thank you for reading this article.

Previous Next