Machine Learning Project: Hotel Booking Prediction [Part 2]

Written by- Aionlinecourse1934 times views

This is the second part of our Hotel Booking Prediction project. Throughout this tutorial, we will discuss outliers, the application of supervised algorithms, cross-validation, and so on.

How to Handle Outliers

If there are some data points that are too far away from the normal one those are exactly called outliers. More specifically, if you have data of 100 persons whose age range is between 1 to 100 but if 100 persons have an age of 700 years, that 100 persons are considered outliers. Here, we are going to build a model that will be badly impacted by outliers.

sas.distplot(dataframe['lead_time'])       #Making distribution plot of lead_time by accessing dataframe

30_dist_plot_1

import numpy as np      
def handle_outlier(col):            #taking log of lead_time time for greater extent of skewness 
    dataframe[col]=np.log1p(dataframe[col])
handle_outlier('lead_time')     #calling the function
sas.distplot(dataframe['lead_time'])              #showing distribution plot log applied lead_time

30_dist_plot_log_2

We handled outlier for our price feature named ‘adr’, same as before

     
sas.distplot(dataframe['adr'])                      #distribution plot of adr

30_dist_plot_adr_3

handle_outlier('adr')                                         #handling outlier for adr
sas.distplot(dataframe['adr'].dropna())                       #distribution plot of adr and handling missing values by dropna

30_dist_plot_adr_dropna_4

Applying Techniques of Feature Importance

Here, we are applying techniques of feature importance to our data for selecting the most important features because there are tons of features. By doing this we can build fancy/very useful machine learning models.

First, checked null values. We can see on the result that there is only one missing value in ‘adr’.

dataframe.isnull().sum()            #checking null values and doing their sum

30_null_values_sum_5

dataframe.dropna(inplace=True)                               #dropping null values and updating dataframe
y=dataframe['is_canceled']                                   #predicting independent features -> is_canceled
x=dataframe.drop('is_canceled',axis=1)                       #dropping is_canceled feature
from sklearn.linear_model import lasso
from sklearn.feature_selection import SelectFromModel        #for selecting important features

Alpha is a penalty parameter that means the bigger the value of alpha the less number of features will get selected.

feature_sel_model=SelectFromModel(Lasso(alpha=.005,random_state=0))                   #Specifying lasso regression model & setting low alpha value and putting random_state=0
feature_sel_model.fit(x,y)                                                            #fitting data to object
feature_sel_model.get_support()                                                       #getting all the values from list
cols=x.columns          #all the columns
selected_feat=cols[feature_sel_model.get_support()]                                   #adding filters to column
print('total features {}'.format(x.shape[1]))                                         #printing total features
print('Selected features {}').format(len(selected_feat))                              #printing the selecting features

30_selected_total_feat_6

selected_feat                                            #printing the entire features

30_total_feat_7

30_independent_df_update_8

Applying Logistic Regression on Data and Cross-Validating it

Logistic regression is one of the supervised algorithms and a statistical model. For this, we are going to apply logistic regression to our data and after that will cross-validate it.

from sklearn.model_selection import train_test_split                                        #for splitting data into train and test set
X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=0)           #taking 25% of data for testing
from sklearn.linear_model import LogisticRegression                                         #importing 
logreg=LogisticRegression()                                                                 #calling the logisticregression class
logreg.fit(X_train,y_train)                                                                 #fitting training data
y_pred=logreg.predict(X_test)                                                               #doing prediction on test data
y_pred                                                                                      #printing the prediction array

30_prediction_arr_9

from sklearn.metrics import confusion_matrix                    #importing confusion matrix
confusion_matrix(y_test,y_pred)                                 #confusion matrix of this logistic regression model

30_confusion_matrix_11

from sklearn.metrics import accuracy_score                  #importing accuracy_score to check accuracy
accuracy_score(y_test,y_pred)                               #checking accuracy_score of y test and prediction

30_accuracy_score_check_10

from sklearn.model_selection import cross_val_score           #importing cross validation
score=cross_val_score(logreg,x,y,cv=10)                          #applying cross validation for achieving more accurate score
score.mean()                                                    #achieved new score by calling mean

30_accuracy_score_11

Applying Multiple Algorithms on Data

Here we are applying different types of supervised algorithms naive Bayes, decision tree, logistic regression, and so on to achieve very good accuracy.

from sklearn.linear_model import LogisticRegression             #importing logisticregression from linear_model 
from sklearn.neighbors import KNeighborsClassifier              #importing knn algorithm
from sklearn.ensemble import RandomForestClassifier             #importing random forest from ensemble
from sklearn.tree import DecisionTreeClassifier                 #importing decision tree classifier
models=[]                                                       #this blank list is for appending all algorithms
 
models.append(('LogisticRegression',LogisticRegression()))      #appending logisticregression and initializing it
models.append(('Navie Bayes',GaussianNB()))                     #appending Naive Bayes and initializing it
models.append(('RandomForest',RandomForestClassifier()))        #appending random forest and initializing it
models.append(('Decision Tree',DecisionTreeClassifier()))       #appending decision tree and initializing it
models.append(('KNN',KNeighborsClassifier()))                   #appending knn and initializing it
 
for name,model in models:                                       #iterating over models
    print(name)                                                 #printing name of model
    model.fit(X_train,y_train)                                  #fitting train set of x & y into model
    predictions=model.predict(X_test)                           #predicting test set of x
 
    from sklearn.metrics import confusion_matrix                #importing confusion matrix
    print(confusion_matrix(predictions,y_test))                 #printing confusion matrix
    print('\n')
    print(accuracy_score(predictions,y_test))                   #printing accuracy_score
    print('\n')

30_final_accuracy_12

So, this is our model accuracy. Thank you for reading this article.