Learn Time Series Analysis in Python- A Step by Step Guide using the ARIMA Model

Written by- Aionlinecourse6390 times views

Image
11_time_series

Time series is the most common data type you encounter every day. They are everywhere- daily weather update, weekly stock price, annual sales of a company, your daily calorie consumption, and so on. As a data scientist, you will encounter time series data so often. So it is kind of a must-have skill for a data scientist to analyze time-series data.

Time series is a collection of data based on sequential time(hour, day, month, year, etc.). Here the data is dependent on time. That means you have to consider the time alongside with other variables. Weather forecasting is a great example of a time series analysis. Here the weather data is collected over a period of time and upon that data, the future weather is predicted.  

Whether you are a beginner or aspiring data scientist looking for a comprehensive guideline to work with time-series data, this is the greatest tutorial for you. In this tutorial, you are going to be familiarized with different aspects of time-series data and build a model to forecast the future time series using the ARIMA model in Python. Let's dive deep into it!

What is Time Series Analysis? 

Time series is a special type of data for doing analysis. Time series analysis is basically analyzing the data to find some pattern or trend over a certain period of time. After successfully analyzing the data, you have to forecast future trends/patterns. The nature of time series analysis is more identical to regression analysis, but here the data is time-dependent. In regression, it is assumed that the variables must be independent of each other. So, mere regression analysis can not be applied to time series data. You need different tools to do analysis.

Time Series Analysis in Python 

Python provides many libraries and APIs to work with time-series data. The most popular of them is the Statsmodels module. It provides almost all the classes and functions to work with time-series data. In this tutorial, we will use this module alongside other essential modules including NumPy, pandas, and matplotlib.

Key Contents of the Tutorial

  1. The Components of Time Series
  2. Import Time Series Data in Python
  3. Visualize Time Series Data in Python
  4. What is Stationarity in Time Series?
  5. Make a Time Series Stationary in Python
  6. Build Time Series Models in Python
  7. Determine the Order of the ARIMA Model
  8. Apply ARIMA Model in Python

The Components of Time Series

Before going to analysis, we first need to understand the characteristics of a time series. Recall, time series is a special kind of data structure and for that, they have some special components too. Generally, you will find four main components of a time series. 

  • Seasonality This happens when the time series shows a common pattern over a specific period of time such as day, week, month, or season. For example, sales of ice cream increase in the summer but falls in the winter and it follows this similarity every season.
  • Trend When the pattern shows some variations either moving up or moving down at a certain period of the time series, it takes as trends. For example, newer models of the iPhone get a pick in sales until another iPhone model comes into the market.
  • Cyclic Here, the same pattern of data repeats over a specific period. It is more like a sine curve. An example can be taken as the consumption of electricity at your house, most of the time the consumption shows a steady repetitive pattern.
  • Irregularity Simply, it is noise or randomness in data and does not fall into any of the above categories. A random uptrend or downtrend can be considered as irregularity. For example, the effect of the Corona pandemic makes a sudden rise in the sales of masks and sanitizers.

Keeping these components in mind, we have to work with time-series data. In the next section, we will learn about which type of time series can be analyzed or forecasted.

Import Time Series Data in Python

The Pandas library in python has many built-in functions for importing time series data. We will use those functions to load time series data for our project. Pandas provides other useful functions to work with the time series (i.e. visualization)

First, we need to load the essential libraries.

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from pandas import datetime

We are using the stock price dataset of apple from 1980 to 2020. The dataset contains the high, low, open, and close prices of the stocks of Apple and also the volume of the stock. For simplicity, we only analyze the closing price of the stock data. Let's read the dataset!

series = pd.read_csv('stockprices.csv')

 Before going to further operations, let's have an eye on the data.

print(series.head())

1592907521_23_stockprice_data

This is how the data looks like. Here, the data format is not appropriate to work(time is a very important parameter in time series analysis, remember!). So, we need to take the time into a suitable format and also we have to make the 'Date' column as the index for further analysis.

series['Date'] = pd.to_datetime(series['Date'], format='%Y-%m-%d') 
series = series.set_index('Date')

Now, we have imported and made the dataset working with further analysis.

Visualize Time Series Data in Python

Visualization is very important to understand the various aspects of time series data. We need to plot some crucial graphs to visually understand the time series. Let's plot some graphs and see how our data looks like!

df_close = series['Close'] 
plt.figure(figsize=(10,6)) 
df_close.plot() 
plt.ylabel('Closing Price') 
plt.legend()

1592907521_23_lineplot

This simple line plot shows us many important characteristics of our data. Here, you can observe that the closing price shows an uptrend with time. This is because of Apple inc. has been growing its worth day by day. So, it is intuitive that the price of their stocks will get higher. 

From the graph, we can say that the stock price shows a trend. We can not forecast the time series that shows a trend. This is because it does not possess the quality of being stationary. And to apply any statistical model to time series data, we must ensure that the data is stationary in nature.

What is Stationarity in Time Series?

In time series, stationarity means the statistical properties of the series will remain constant over the period of time. Here the statistical properties have two aspects-

  • Firstly the mean and variance of the time series will remain constant over time
  • Secondly, there should be constant autocorrelation among the data points. Autocorrelation is the degree of similarity between a variable's present value and the past value.

Most forecasting methods such as AR, MA, ARIMA assumed that the time series is stationary or could be converted to stationary before applying any forecasting methods. So, it is necessary to check whether the time series we are working on shows the quality of being stationary. To check that, we can use two different methods-

Rolling Statistics This is the rolling average of the mean and standard deviation of a time series. Here rolling average differs from the way general average in that it will replace a data point with the average of its previous n data points. Here n is defined as the count of previous data points i.e. 10 days. If n is set as 10 days, every data point will be the average of its previous 10 data points. To be a stationary time series, it must have a constant rolling mean and standard deviation. 

Now, let's check the rolling statistics for a time window of one year of our time series-

rolling_mean = df_close.rolling(window = 365).mean() 
rolling_std = df_close.rolling(window = 365).std() 
plt.figure(figsize=(10,6)) 
plt.plot(series['Close'], color = 'green', label = 'Original') 
plt.plot(rolling_mean, color = 'red', label = 'Rolling Mean') 
plt.plot(rolling_std, color = 'magenta', label = 'Rolling Std') 
plt.legend(loc = 'best') 
plt.title('Rolling Mean & Rolling Standard Deviation') 
plt.xlabel('Date') 
plt.ylabel('Closing Price') 
plt.legend()

23_Rolling_Mean_&_Rolling_Standard_Deviation

The interpretation of this plot is scary! As the rolling mean and rolling standard deviation are not constant over time, we can conclude that the data is not stationary. 

Augmented Dickey-Fuller Test(ADF Test) It is the most common statistical test to determine whether a time series is stationary or not. It works with taking a null hypothesis that tells that the series is not stationary. A minimal value of p(<0.005) is required to reject that null hypothesis. The test also considers the ADF statistics must be less than different critical values. In python, we can apply the ADF test using the Statsmodels library.

from statsmodels.tsa.stattools import adfuller 
adf_test = adfuller(df_close) 
print('ADF Statistics: {}'.format(adf_test[0])) 
print('p-value: {}'.format(adf_test[1])) 
for key, value in adf_test[4].items(): 
    print('Critical Values Over {}: {}'.format(key, value)) 
Output:
ADF Statistics: 1.893973846826085 
p-value: 0.9985182618845982 
Critical Values Over 1%: -3.4310126482982626 
Critical Values Over 5%: -2.861832850708558 
Critical Values Over 10%: -2.5669258793000673

Here the p-value is much higher than the threshold(0.005) and the ADF statistic is also significantly higher than the critical values. So, we can not reject the null hypothesis and must conclude that the time series is not stationary.

Our time series is not stationary which means we can not fit any statistical models to it and forecast the future. So, how can we make a time series stationary? 

Before proceed to the next section make a function of both tests as we will need them many times to test stationarity.

def check_stationarity(timeseries): 
    # rolling statistics 
    rolling_mean = df_close.rolling(window = 365).mean() 
    rolling_std = df_close.rolling(window = 365).std() 

    #rolling statistics plot 
    plt.figure(figsize=(10,6)) 
    plt.plot(timeseries, color = 'green', label = 'Original') 
    plt.plot(rolling_mean, color = 'red', label = 'Rolling Mean') 
    plt.plot(rolling_std, color = 'magenta', label = 'Rolling Std') 
    plt.legend(loc = 'best') 
    plt.title('Rolling Mean & Rolling Standard Deviation') 
    plt.xlabel('Date') 
    plt.ylabel('Closing Price') 
    plt.legend()

    # Augmented Dickey–Fuller test: 
    adf_test = adfuller(timeseries) 
    print('ADF Statistic: {}'.format(adf_test[0])) 
    print('p-value: {}'.format(adf_test [1])) 
    print('Critical Values:') 
    for key, value in adf_test[4].items(): 
        print('Critical Values Over {}: {}'.format(key, value))

Make a Time Series Stationary in Python

To make a nonstationary time series into stationary time series, we need to perform various transformations of the data and check the stationarity every time with rolling statistics and ADF test. If any transformation could satisfy the tests, we can reject the null hypothesis and conclude that the series now has become stationary. Here are some examples of transformation we can apply to make our data stationary. Let's check that

Log Transformation The log values of the dependent variable(here it is the 'Close' column) will lower down the rolling average to some extent. Let's check that.

df_log = np.log(series['Close']) 
check_stationarity(df_log) 
Output:
ADF Statistic: 0.05924754204842996 
p-value: 0.9631660399368777 
Critical Values: 
Critical Values Over 1%: -3.431011308048762 
Critical Values Over 5%: -2.8618322584647857 C
ritical Values Over 10%: -2.5669255640476747

1592922813_23_Log_Transformation_Graph

From both the test we can see that the p-value can not satisfy to reject the null hypothesis and the ADF statistics is higher than the critical values. So, log transformation won't help us to make the time series stationary.

Subtracting Rolling Mean to the Log Values To make the series stationary we can subtract the rolling mean from the log values. Let's check what happens-

rolling_mean = df_log.rolling(window=365).mean() 
df_log_minus_mean = df_log - rolling_mean 
df_log_minus_mean.dropna(inplace=True) 
check_stationarity(df_log_minus_mean) 
Output:ADF Statistic: -4.938568014985331 
p-value: 2.932168266538975e-05 
Critical Values: 
Critical Values Over 1%: -3.4310365815907327 
Critical Values Over 5%: -2.8618434265626393 
Critical Values Over 10%: -2.5669315088533877

1592922813_23_Rolling_mean_subtracted_log_values_graph

Here the ADF statistic is less than the critical values. But the rolling mean and rolling standard deviation are not constant and still, the p-value is higher than the threshold. So, we can not reject the null hypothesis yet and conclude it as a stationary series.

Subtracting Exponential Decay from the Log Values Subtracting exponential decay can help to make the series stationary.

mean_exp_decay = df_log.ewm(halflife=365, min_periods=0, adjust=True).mean() 
df_log_exp_decay = df_log - mean_exp_decay 
df_log_exp_decay.dropna(inplace=True) 
check_stationarity(df_log_exp_decay) 
Output:
ADF Statistic: -3.520728117215512 
p-value: 0.007470439194956383 
Critical Values: 
Critical Values Over 1%: -3.431011308048762 
Critical Values Over 5%: -2.8618322584647857 
Critical Values Over 10%: -2.5669255640476747

23_Mean_exponential_decay_subtracted_log_values_graph

Still the rolling mean and standard deviation have not become constant and the p-value is still higher. So, the series has yet not become a stationary series.

df_log_shift = df_log - df_log.shift() 
df_log_shift.dropna(inplace=True) 
check_stationarity(df_log_shift) 
Output: 
ADF Statistic: -22.602200726350297 
p-value: 0.0 
Critical Values: 
Critical Values Over 1%: -3.431011308048762 
Critical Values Over 5%: -2.8618322584647857 
Critical Values Over 10% -2.5669255640476747

23_Shift_subtracted_log_values_graph

The rolling mean and standard deviation now become quite consistent, the ADF statistic is lower than the critical values and the p-value is lower than the threshold. Excellent! The time series have converged to stationary and now we can apply statistical models to it.

Build Time Series Models in Python

There are many statistical models available in python to perform time series forecasting. In our tutorial, we are using the most popular ARIMA model to forecast the time series. It is the combination of two different models called AR and MA models with adding an integration factor to it. Before using the model, let's take a look at these models.

Auto-Regressive or AR Model It is quite similar to the multiple linear regression model. Here the difference is it depends upon the linear combinations of the past values of a dependent variable to forecast the target variable. For example, if we want the forecast of today's stock price, the AR model will analyze the linear combinations of the stock's past values to forecast that.

 Moving Average of MA In a moving average model, the past values of the target variables are not considered, rather the errors from the previous forecasts are taken into account to forecast the target variable.

Auto-Regressive Integrated Moving Average or ARIMA Model Also known as the Box-Jenkins model is basically the combination of AR and MA models. Here a differencing factor is used to make the time series stationary. A stationary time series that does not show any seasonal patterns or trends can be modeled with an ARIMA model. To work with an ARIMA model, we need to consider three factors-

  • is the ordering terms of the Auto Regressive part of the model
  • q is the ordering terms of the Moving Average part of the model
  • d is the differencing factor for the model

Determine the Order of the ARIMA Model

When you build the ARIMA model in Python you have to specify the p, q, and the d parameters. These parameters are very important to make the model better forecasting the variable. You can determine those parameters both manually and automatically. Here, are the different ways you can find the parameters-

Auto Correlation Function(ACF) This function gives us the autocorrelation of the present value with its lagged values. It tells us how well the present value is related to its past values in time. ACF function is commonly used to find the ordering terms of the MA model.

Partial Auto Correlation Function(PACF) It is a subset of ACF. Here, the autocorrelation is calculated between the two points in time with taking account of the residual of the lagged correlated values. Hence the correlation is not complete. It is used to find the AR ordering or p terms.

Auto ARIMA Model This model is an implementation of the R language's auto.arima() function. There are many packages and codes available to implement the model in Python.

Apply ARIMA Model in Python

Now, we will apply the ARIMA model in Python to forecast our time series. First of all, we will decompose the time series to check how close it is related to our original series.

from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(df_log_shift, freq=100) 
model = ARIMA(df_log_shift, order=(2,1,2)) 
results = model.fit(disp=-1) 
plt.figure(figsize=(10,6)) 
plt.title('Seasonal Decomposition') 
plt.xlabel('Date') 
plt.ylabel('Closing Price') 
plt.plot(df_log_shift, color='green') 
plt.plot(results.fittedvalues, color='magenta')

24_Seasonal_Decomposition

Now we will see how well it fits our model-

from statsmodels.tsa.arima_model import ARIMA 
from sklearn.metrics import mean_squared_error 
X = df_log_shift.values 
size = int(len(X) * 0.7) 
train, test = X[0:size], X[size:len(X)] 
history = [x for x in train] 
predictions = list() 
for i in range(len(test)): 
    model = ARIMA(history, order=(2,1,2)) 
    model_fit = model.fit(disp=0) 
    results = model_fit.forecast() 
    pred = results[0] 
    predictions.append(pred) 
    exp = test[i] 
    history.append(exp) 
    print('predicted= {}, expected= {}'.format(pred, exp)) 

error = mean_squared_error(test, predictions) 
print('Test MSE: {}'.format(error)) 

# plot plt.plot(test, color='green') 
plt.plot(predictions, color='magneta') 
plt.show()

Final Thoughts

If you have come this far, then congratulations to you! Now, you have got a clear idea of time series analysis and forecasting in Python. Time series analysis is a huge part of the data science domain. And you can not learn everything with just this tutorial. For this, you need more and more practice. You should review the whole article again and find a new dataset to work on with these codes. Feel free to change the codes where necessary. 

Hope this tutorial helped to learn time series analysis. If you have some cool ideas to make this project more effective, let us know.

You can find the source code of the tutorial in this Github repository.

Happy Machine Learning!