Build an Autoregressive and Moving Average Time Series Model

Welcome to time series analysis! We explore this project at a much deeper level to understand and predict the IoT sensor readings. It is mainly to investigate how the sensor data can be used to analyze through Moving Average and Autoregressive models. The models above can help us find hidden patterns and predict future readings.

Project Overview

This project starts by cleaning and preparing the IoT sensor data so that we can analyze it. We then proceed to build many models: Moving Average models (MA(1), MA(2)) and then Autoregressive models (AR(1), AR(2), AR(3), AR(4)). Now that we have these models, we can understand the relationship between past and future values.

We then determine how well each model describes the predictions with Root Mean Squared Error (RMSE), a common way to gauge accuracy in forecasts. This makes things more insightful, so we bring in visualizations such as autocorrelation plots and rolling average plots. They help us see how sensor readings behave over time and how different models perform. By the end of this project, we will have a good idea of which model fits the data best and which one can confidently predict from historical sensor data.

Prerequisites

Knowledge of time series analysis and some basic concepts, such as stationarity and autocorrelation.
Python and libraries including Pandas, NumPy, and Matplotlib.
Knowledge of machine learning models such as Moving Average (MA) and Autoregressive (AR) models.
Experience with model performance metric computation including Root Mean Squared Error (RMSE).
The knowledge on how to preprocess clean data to analyze time series data.
Visualization tools for time series, for example, autocorrelation plots and rolling averages.
An Augmented Dickey-Fuller (ADF) test for stationarity

Approach

First, we clean up the IoT sensor data to prepare it for analysis and modeling. Then we dive into different time series models, starting with Moving Average (MA) models and then Autoregressive (AR) models with varying lags. These models aim to capture the dependence of future values of the data on the past. To check for stationarity we use the Augmented Dickey-Fuller (ADF) test, and to smoothen the data we use rolling averages. For each model, we compute the Root Mean Squared Error (RMSE) so that we can evaluate the accuracy of the predictions of each model. Autocorrelation plots give us visuals of how the data is related to each other. We finally pick the most accurate model using RMSE and then use the one we picked to make future predictions to gain valuable insights into IoT sensor behavior.

Workflow and Methodology

Load and prepare the IoT sensor data for further analysis.
Apply the Augmented Dickey-Fuller (ADF) test for stationarity.
Construct various time series models (MA and AR) under different lags.
Train each of the models with the prepared data.
Compute a fitted value for each model.
Calculate the Root Mean Square Error (RMSE) for each model to evaluate model performance.
Visualize the autocorrelation plots and rolling average of data.
Compare the RMSE value for every model to identify the best among the models about performance.
Have predictions based on the selected model.

Data Collection and Preparation

Data Collection:

In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.

Data Preparation Workflow:

Import the dataset and inspect it for any missing values or inconsistencies.
Convert the date column to a proper timestamp format for time series analysis.
Set the time column as the index to allow for time-based operations.
Handle missing values by using methods like forward filling or backward filling.
Visualize the time series data to identify trends, seasonality, and noise.

Code Explanation

STEP 1:

Mounting of Google Drive

This code mounts your Google Drive into the Colab environment so that you can access files stored in your drive. Your Google Drive is made accessible under the /content/drive path.

from google.colab import drive
drive.mount('/content/drive')

Ignoring Warnings

This code will suppress all the warnings, thus preventing them from being displayed during execution. This ensures that the output is clean while running the program.

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

Required Library Installation

This code is meant to install the required libraries for Python such as: plotting through matplotlib, data manipulation by pandas, performing statistical modeling with statsmodels, seaborn for visualizing the data, scipy computes scientific in addition to mathematical problems, provides numerical work through numpy, and the last one is machine learning with scikit-learn.

!pip install matplotlib
!pip install pandas
!pip install statsmodels
!pip install seaborn
!pip install scipy
!pip install numpy
!pip install scikit-learn

Importing Required Libraries for Time Series Analysis

All the libraries have been imported to perform time series analysis, including pandas, numpy, statsmodels, and matplotlib. All the libraries support functions like seasonal decomposition, statistical tests, ARIMA modeling, and graphical representation of autocorrelation functions for time series data analysis.

#importing all required libraries
import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller,kpss
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.ar_model import AutoReg
from sklearn.metrics import mean_squared_error
from pandas.plotting import autocorrelation_plot
import scipy.stats
import pylab
from statsmodels.tsa.stattools import kpss
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose

STEP 2:

Loading Data and Checking Shape

This code loads the CSV file. After loading the dataset it prints the dataset’s shape to check the number of rows and columns. The %time magic command in the notebook records the time taken to perform the task.

#read the data
data = pd.read_csv('/content/drive/MyDrive/New 90 Projects/Project_15/Data-Chillers.csv')
df = data.copy()
df.shape

Previewing Data

This code displays the dataset's first few rows for a quick overview.

#checking first five rows of the data
df.head()

Dataset Info

The purpose of the given code is to provide a summary of the DataFrame df by displaying the number of records, names of the columns, types of columns, count of non-null values, and the size in memory.

# checking the structure of data
df.info()

Checking Missing Values

This code calculates the overall number of null values present in every column of the df Data Frame. This helps in identifying null values for further processing of the data.

#checking missing values
df.isnull().sum()

Convert the Date Column to Timestamp Format

Converts the time column of the DataFrame df to timestamp format as defined by the date and time format ('%d-%m-%Y %H:%M'). It makes the operations based on date easier when operating in a series analysis of time.

#converting date column into timestamp format
df.time = pd.to_datetime(df.time, format='%d-%m-%Y %H:%M')

This code extracts the maximum or latest date from the time column of the dataframe df so that one can infer the most recent record in the dataset.

#minimum data in the dataset
df['time'].max()

This code extracts the minimum or earliest date from the time column of the dataframe df so that one can infer the first record in the dataset.

#max date ibda dataset
df['time'].min()

The purpose of the given code is to provide a summary of the DataFrame df after the conversion by displaying the number of records, names of the columns, types of columns, count of non-null values, and the size in memory.

# checking the structure of data after converting datetime format
df.info()

Calculate Correlation with Other Features

This piece of code calculates the correlation matrix of the DataFrame df to show the correlation between the defined numeric features. Using this information, one will find how strongly every single feature is correlated with other features.

#correlation with other features.
df.corr()

STEP 3:

Plotting IoT Sensor Reading

This will plot the IOT_Sensor_Reading column from the DataFrame df as a time series. This plot is 20 inches wide by 5 inches height and has an appropriate title for visual clarity.

df.IOT_Sensor_Reading.plot(figsize=(20,5), title="IOT Sensor_Reading")
plt.show()

Plotting Error Present

This code plots the Error_Present column form of DataFrame df as a time series. The size of the plot is set as 20x5 along with the title to make the plot better understandable.

df.Error_Present.plot(figsize=(20,5), title="Error_Present")
plt.show()

This code plots the Sensor_2 column form of DataFrame df as a time series. The size of the plot is set as 20x5 along with the title to make the plot better understandable.

df.Sensor_2.plot(figsize=(20,5), title="Sensor_2")
plt.show()

This code plots the Sensor_Value column form of DataFrame df as a time series. The size of the plot is set as 20x5 along with the title to make the plot better understandable.

df.Sensor_Value.plot(figsize=(20,5), title="Sensor_Value")
plt.show()

QQ plot for IOT Sensor readings

This code is used to generate a QQ plot for IOT_Sensor_Reading to check if it is normally distributed using scipy.stats.probplot.

# The QQ plot
scipy.stats.probplot(df.IOT_Sensor_Reading, plot=pylab)
plt.title("QQ plot for IOT_Sensor_Reading")
pylab.show()

STEP 4:

Computing the Mean of IOT Sensor Readings

The mean value of the IOT_Sensor_Reading column in DataFrame df is computed by this code, which gives an idea of the sensor data's central tendency.

df['IOT_Sensor_Reading'].mean()

Computing the Minimum of IOT Sensor Readings

The minimum value of the IOT_Sensor_Reading column in DataFrame df is computed by this code, which gives an idea of the sensor data's lowest record.

df['IOT_Sensor_Reading'].min()

Computing the Maximum of IOT Sensor Readings

The maximum value of the IOT_Sensor_Reading column in DataFrame df is computed by this code, which gives an idea of the sensor data's highest record.

df['IOT_Sensor_Reading'].max()

Time of Day Categorization

The hour is extracted from the time column in DataFrame df, and so the hour is grouped into three labels, namely 'morning', 'noon', and 'evening'. A new column, time_of_day, is created with the hour of the time.

# Extract the hour of the day from the 'datetime' column
hour = df['time'].dt.hour
# Create a new column with labels for 'morning', 'noon', and 'evening'
df['time_of_day'] = pd.cut(hour, bins=[0, 11, 16, 23], labels=['morning', 'noon', 'evening'])
df.head()

Maximum IOT Sensor Readings across Various Times in a Day

The code aggregates the data categorically time_of_day and computes the highest given IOT_Sensor_Reading for that time ('morning,' 'noon,' 'evening'). It can indicate at which time during the day the maximum sensor reading occurs.

max_IOT_Sensor_Reading = df.groupby('time_of_day')['IOT_Sensor_Reading'].max()
max_IOT_Sensor_Reading

Minimum IOT Sensor Readings across Various Times in a Day

The code aggregates the data categorically time_of_day and computes the lowest given IOT_Sensor_Reading for that time ('morning', 'noon', 'evening'). It can indicate at which time during the day the minimum sensor reading occurs.

min_IOT_Sensor_Reading = df.groupby('time_of_day')['IOT_Sensor_Reading'].min()
min_IOT_Sensor_Reading

Maximum IOT Sensor Readings across Various Times in a Day

The code aggregates the data categorically time_of_day and computes the average given IOT_Sensor_Reading for that time ('morning', 'noon', 'evening'). It helps to understand the typical sensor reading during each part of the day.

avg_IOT_Sensor_Reading = df.groupby('time_of_day')['IOT_Sensor_Reading'].mean()
avg_IOT_Sensor_Reading

Extracting Weekday

It creates a new column called day_of_week in the DataFrame df, which extracts the day names, for example, Monday or Tuesday from the time column. This would help in analyzing the data for different days of the week.

df['day_of_week'] = df['time'].dt.day_name()

Finding Maximum IOT Sensor Readings by Day of the Week

This code groups by day_of_week and gets the maximum of the IOT_Sensor_Reading for each day. It helps to identify what is the highest reading on each distinct day of the week regarding the specific sensor.

max_IOT_Sensor_Reading = df.groupby('day_of_week')['IOT_Sensor_Reading'].max()
max_IOT_Sensor_Reading

Finding Minimum IOT Sensor Readings by Day of the Week

This code groups by day_of_week and gets the maximum of the IOT_Sensor_Reading for each day. It helps to identify what is the lowest reading on each distinct day of the week regarding the specific sensor

min_IOT_Sensor_Reading = df.groupby('day_of_week')['IOT_Sensor_Reading'].min()
min_IOT_Sensor_Reading

Finding Average IOT Sensor Readings by Day of the Week

This code groups by day_of_week and gets the average of the IOT_Sensor_Reading for each day.