Dimitrije Stankovic (Aug 2024)

Univariate Time Series Forecasting Models

In the previous installment of this series on traditional forecasting models (which you should definitely read before this post), I went over some of the conditions required to make a valid forecasting model, and checked these assumptions against some example data. I then fitted an autoregressive (AR) model to the data, and showed the results. This post will explain what exactly an AR model is, how it relates to the moving average (MA) model, and how these models are combined to model more complex processes.

Autoregression

The instinctual approach to forecasting, I think, is to predict a future value based on one or two of the most recent past values. If I see a line going up, I'm not going to bet it comes crashing down in the next step. This is the intuition behind autoregression, which literally means regression of a variable on itself (i.e. its past values). Now, of course, real-world data is rarely ever a straight line or perfect curve, so there is some noise we have to account for as well. The simplest autoregressive (AR) model translates these intuitions into something that looks like this:

Y_{t} = β + ϕ_{1} Y_{t - 1} + u_{t}

Basically, we're looking for an offset or intercept ( $β$ ) and a coefficient on the lag ( $ϕ$ ), with the expectation that there will be some noise ( $u_{t}$ ). The model above is what we call an AR(1) model, since we only include one lag, but we can choose any number of lags. For p lags, which is the standard notation for the autoregressive order, we construct an AR(p) model which looks like:

Y_{t} = β + ϕ_{1} Y_{t - 1} + ϕ_{2} Y_{t - 2} + \dots + ϕ_{p} Y_{t - p} + u_{t}

In my post on statistical assumptions, we predicted the frequency of sunspot occurences using an AR(2) model with great effectiveness. In that post I did not explain any of my reasoning for choosing two lags in particular, but there is one more thing to explain before we get to that: the partial autocorrelation function (PACF). The autocorrelation function (ACF), to refresh your memory, is expressed as a function of $h$ lags in the following way:

ρ_{h} = \frac{Cov ( Y _{t} , Y _{t - h} )}{Var ( Y _{t} ) Var ( Y _{t - h} )}

The ACF, being measured in this way, indirectly considers the effect of each intermediate lag between $t - 1$ and $t - h$ . The PACF, on the other hand, is the correlation between $Y_{t}$ and $Y_{t - h}$ , controlling for the intermediate lags (hence, partial). The PACF at lag $h$ is calculated in a bit of a funky way, but it basically amounts to fitting an AR(h) model and returning $ϕ_{h}$ , which controls for all intermediate lags and any noise in the time series process as well.

Now that we know what the PACF is, as well as its ties to the AR model, we can now use it to select our AR order. We'll take a look at the PACF of the time series from the last post:

"""
Data Import/Cleaning & PACF Plot
"""
%pip install pydataset
from pydataset import data

import pandas as pd
import numpy as np

import matplotlib as mpl
import matplotlib.pyplot as plt

import statsmodels.api as sm

df_full = data('sunspot.year')
time_series = df_full[df_full['time'] <= 1800].copy()

time_series['time'] = pd.to_datetime(time_series['time'], format="%Y")
time_series.set_index('time', inplace=True)
time_series = time_series.asfreq('YS')

sm.graphics.tsa.plot_pacf(time_series, lags=30)
plt.show()

Sunspot PACF

Here we see that the first two lags are independently highly correlated, so both should be included in an AR model, resulting in the AR(2) model constructed below (and in the previous post).

"""
Constructing the AR(2) Model
"""
import statsmodels.tsa.arima.model.ARIMA as ARIMA

model = ARIMA(time_series, order=(2,0,0))
results = model.fit()

print(results.summary())

                               SARIMAX Results                                
==============================================================================
Dep. Variable:           sunspot.year   No. Observations:                  101
Model:                 ARIMA(2, 0, 0)   Log Likelihood                -421.234
Date:                Tue, 27 Aug 2024   AIC                            850.469
Time:                        16:31:11   BIC                            860.929
Sample:                    01-01-1700   HQIC                           854.703
                         - 01-01-1800                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         45.3837      5.752      7.890      0.000      34.109      56.658
ar.L1          1.3509      0.062     21.835      0.000       1.230       1.472
ar.L2         -0.6662      0.075     -8.921      0.000      -0.813      -0.520
sigma2       240.1278     31.631      7.592      0.000     178.133     302.123
===================================================================================
Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):                12.69
Prob(Q):                              1.00   Prob(JB):                         0.00
Heteroskedasticity (H):               2.01   Skew:                             0.57
Prob(H) (two-sided):                  0.05   Kurtosis:                         4.31
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

This is what the model looks like. Ignore the warning, it's basically just telling us it used a certain method for calculating variances (OPG). We'll talk more about the evaluation of models in the next installment of this series on forecasting, but looking at the P>|z| (p-values) for ar.L1 and ar.L2, we can see that both lags are indeed significant. The results of this model can be plotted like so:

"""
Plotting the AR(2) Model
"""
fitted_values = results.fittedvalues

plt.figure(figsize=(8, 5))

plt.plot(time_series, label='Actual', color='blue')
plt.plot(time_series.index, fitted_values, label='Predicted', color='orange')

plt.title('AR(2) Model Fit of Sunspot Ocurrences')
plt.xlabel('Year')
plt.ylabel('Sunspot Ocurrences')
plt.legend()

plt.show()

AR(2) for the Sunspot Data

Easy enough, but sometimes the time series we have is structured a little differently.

Moving Averages

Many natural events like weather, and pseudo-natural events like changes in economic growth, are determined not by the previous value in the series (at least not solely), but by the shock incurred at the previous step. So how better to model these kinds of data than by using the same structure as AR(p) models, but applied to the residuals of each time-step? This is exactly what a moving average (MA) model does. For q lags, we get the following equation:

Y_{t} = μ + u_{t} + θ_{1} u_{t - 1} + \dots + θ_{q} u_{t - q}

where we add our shocks/residuals ( $u$ ), with some coefficient ( $θ$ ) applied, to the mean of the series ( $μ$ ) to predict the next value of the series. Admittedly, most processes that we'd like to model or predict are not pure MA(q) processes, but let's look to the great lakes for an example. The dataset we'll look at is the annual water level (in feet) for Lake Huron from 1875 to 1972.

"""
Data Import/Cleaning
"""
ts_huron = data('LakeHuron')

ts_huron['time'] = pd.to_datetime(ts_huron['time'], format="%Y")
ts_huron.set_index('time', inplace=True)
ts_huron = ts_huron.asfreq('YS')

We checked the stationarity of the sunspot data in the previous post, but let's do that for the Lake Huron data as well:

"""
Running the ADF for the Huron Data
"""
from statsmodels.tsa.stattools import adfuller

adfuller(ts_huron)[1]

0.0020520736790757706

Now that we have the time series and know that it's stationary, how do we determine the order of our MA process? Well, we used the PACF for the AR process because it removes the effect of intermediate lags and their residuals, but these residuals are precisely what we'd like to model in an MA model. So, we use the ACF to determine the order of an MA process.

"""
ACF Plot for Huron Data
"""
sm.graphics.tsa.plot_acf(ts_huron, lags=30)
plt.show()

ACF for Lake Huron Water Level

In the previous section, selecting an order from the PACF was fairly simple with its clear, sharp peaks, but the ACF is a bit more fluid. Selecting the MA order is more art than science, at least until we discuss information criteria and the grid search. In this case, the order is best set at 2.

"""
Constructing the MA(2) Model
"""
model_huron = ARIMA(ts_huron, order=(0,0,2))
results_huron = model_huron.fit()

print(results_huron.summary())

                               SARIMAX Results                                
==============================================================================
Dep. Variable:              LakeHuron   No. Observations:                   98
Model:                 ARIMA(0, 0, 2)   Log Likelihood                -111.465
Date:                Wed, 28 Aug 2024   AIC                            230.931
Time:                        10:05:51   BIC                            241.270
Sample:                    01-01-1875   HQIC                           235.113
                         - 01-01-1972                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        579.0130      0.195   2962.352      0.000     578.630     579.396
ma.L1          1.0174      0.089     11.450      0.000       0.843       1.192
ma.L2          0.5008      0.101      4.969      0.000       0.303       0.698
sigma2         0.5626      0.088      6.393      0.000       0.390       0.735
===================================================================================
Ljung-Box (L1) (Q):                   2.44   Jarque-Bera (JB):                 0.58
Prob(Q):                              0.12   Prob(JB):                         0.75
Heteroskedasticity (H):               1.14   Skew:                            -0.14
Prob(H) (two-sided):                  0.70   Kurtosis:                         2.75
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

The constant, which is the mean of the time series in this instance, and the lags are significant again, and the model appears to be behaving well. So what does it look like?

"""
Plotting the MA(2) Model
"""
fitted_values_huron = results_huron.fittedvalues

plt.figure(figsize=(8, 5))

plt.plot(ts_huron, label='Actual', color='blue')
plt.plot(ts_huron.index, fitted_values_huron, label='Predicted', color='orange')

plt.title('MA(2) Model Fit of Annual Water Level in Lake Huron')
plt.xlabel('Year')
plt.ylabel('Water Level (ft)')
plt.legend()

plt.show()

MA Model Fit for Huron Data

This looks, superficially, like a worse prediction of the data than before, but remember that this is a model defined only by the residuals of the previous two time-steps. But even this process is not a pure MA process, so how can we improve the prediction?

The ARIMA model

The AutoRegressive Integrated Moving Average (ARIMA) model combines the two models that we've discussed above, while also allowing for differencing (integration) to deal with trend or other non-stationarities. The model is a simple combination of these things, such that an ARIMA(p, d, q) process is defined as:

Y_{t} = β + ϕ_{1} Y_{t - 1}^{d} + \dots + ϕ_{p} Y_{t - p}^{d} + θ_{1} u_{t - 1} + θ_{q} u_{t - q} + u_{t}

where $Y_{t}^{d}$ is the d-th difference of $Y$ at time $t$ . If the series is already stationary, we can set d to 0 and technically use an ARMA(p, q) model, but for our purposes, we'll continue calling it an ARIMA(p, 0, q) model.

Returning to our Huron example, if we look at the PACF we get this:

PACF of the Huron Data

There is a clear partial autocorrelation at lag 1, so this should be added to the model. Since we know that the series is stationary, integration can be ignored. So, we construct an ARIMA(1, 0, 2) model:

"""
Constructing the ARIMA(1, 0, 2) Model
"""
model_huron_arima = ARIMA(ts_huron, order=(1,0,2))
results_huron_arima = model_huron_arima.fit()

print(results_huron_arima.summary())

                               SARIMAX Results                                
==============================================================================
Dep. Variable:              LakeHuron   No. Observations:                   98
Model:                 ARIMA(1, 0, 2)   Log Likelihood                -103.232
Date:                Wed, 28 Aug 2024   AIC                            216.465
Time:                        10:55:14   BIC                            229.389
Sample:                    01-01-1875   HQIC                           221.692
                         - 01-01-1972                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        579.0522      0.356   1624.669      0.000     578.354     579.751
ar.L1          0.7304      0.123      5.918      0.000       0.489       0.972
ma.L1          0.3406      0.143      2.374      0.018       0.059       0.622
ma.L2          0.0273      0.139      0.196      0.845      -0.245       0.300
sigma2         0.4748      0.072      6.607      0.000       0.334       0.616
===================================================================================
Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):                 0.25
Prob(Q):                              1.00   Prob(JB):                         0.88
Heteroskedasticity (H):               1.56   Skew:                             0.09
Prob(H) (two-sided):                  0.21   Kurtosis:                         2.83
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

One of the lags now seems to be individually insignificant, but we can test its joint significance with a Wald test:

wald = results_huron_arima.wald_test(np.eye(5), scalar=True) # since there are 5 coefficients
print(wald)

<Wald test (chi2): statistic=2813875.0424544676, p-value=0.0, df_denom=5>

Which implies that the coefficients are all jointly significant. With that out of the way, let's plot this model!

fitted_values_huron_arima = results_huron_arima.fittedvalues

plt.figure(figsize=(8, 5))

plt.plot(ts_huron, label='Actual', color='blue')
plt.plot(ts_huron.index, fitted_values_huron_arima, label='Predicted', color='orange')

plt.title('ARIMA(1,0,2) Model Fit of Annual Water Level in Lake Huron')
plt.xlabel('Year')
plt.ylabel('Water Level (ft)')
plt.legend()

plt.show()

ARIMA for the Huron Data

This model performs much better, indeed. But this is a very clean, stationary dataset with wonderful behaviour in the residuals. What happens when we have data that's non-stationary by way of seasonality and trend? What if we want to capture the seasonality in our model?

The SARIMA Model

Seasonal ARIMA is the final form for univariate time series regression. The 'seasonal' part of this model is that we add seasonal lags to the equation, so we get a SARIMA(p, d, q) $\times$ (P, D, Q, s) model:

Y_{t} = β + ϕ_{1} Y^{d} t - 1 + \dots + ϕ_{p} Y_{t - p}^{d} + θ_{1} u_{t - 1} + \dots + θ_{q} u_{t - q} + Φ_{1} Y_{t - s}^{D} + \dots + Φ_{P} Y_{t - P s}^{D} + Θ_{1} u_{t - s} + \dots + Θ_{Q} u_{t - Q s} + u_{t}

Where P, D, and Q are the seasonal p, d and q (multiplied by the season, s). This is a bit arcane, I know, so let's take an example from my thesis on electricity demand.

The data we're looking at is monthly national electricity demand in megawatts from the January 2009 until May 2024 in Great Britain. I will spare you the data cleaning, but you can get this data from National Grid ESO at their data portal if you want to follow along, or check out my code and thesis if you are particularly interested. I made a year-month aggregate, by mean, of the national demand data, which is what we'll be looking at. Beginning with the seasonal decomposition:

Seasonal Decomposition of LTLF Series

Since it's monthly data, the season length made sense to be 12 (a year), and looking at the above plot, that was a reasonable assumption. We also see a clear downward trend and a particularly large shock (residual) at the beginning of lockdown. Before we do anything else, we should run an ADF test for stationarity.

ADF Statistic: -0.6222989782285488
p-value: 0.8658633473639568
ADF Statistic (post diff): -8.570489233800481
p-value (post diff): 8.223096059429689e-14

So, there should be integration of order 1. Now we have to figure out the seasonal and non-seasonal orders. Conveniently, both will show up on the ACF and PACF, along with the season length.

ACF/PACF for the LTLF Data

The ACF shows clear(-ish) peaks until lag 2, and the PACF seems to stop at lag 1. Seasonally, we see peaks spaced out by 12 months, with roughly the same (seasonally lagged) peaks as the non-seasonal orders. Given these observations, the model should be SARIMA(1, 1, 2) $\times$ (1, 1, 2, 12). Constructing this model, we get:

                                     SARIMAX Results                                      
==========================================================================================
Dep. Variable:                                 ND   No. Observations:                  132
Model:             SARIMAX(1, 1, 2)x(1, 1, 2, 12)   Log Likelihood               -1028.512
Date:                            Tue, 30 Jul 2024   AIC                           2071.025
Time:                                    19:46:28   BIC                           2090.478
Sample:                                01-01-2011   HQIC                          2078.924
                                     - 12-01-2021                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.2721      0.804      0.338      0.735      -1.303       1.847
ma.L1         -0.4302      0.791     -0.544      0.587      -1.981       1.121
ma.L2          0.0038      0.128      0.030      0.976      -0.247       0.255
ar.S.L12      -0.5019      1.038     -0.484      0.629      -2.536       1.532
ma.S.L12       0.1674      1.074      0.156      0.876      -1.937       2.272
ma.S.L24      -0.2174      0.299     -0.727      0.467      -0.804       0.369
sigma2      1.661e+06   2.21e+05      7.521      0.000    1.23e+06    2.09e+06
===================================================================================
Ljung-Box (L1) (Q):                   0.05   Jarque-Bera (JB):                 3.52
Prob(Q):                              0.83   Prob(JB):                         0.17
Heteroskedasticity (H):               0.93   Skew:                            -0.35
Prob(H) (two-sided):                  0.83   Kurtosis:                         3.47
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

This looks absolutely terrible on the face of it, but in practice, complex models like these rarely have individually significant coefficients. Running a Wald test, we get:

<Wald test (chi2): statistic=373.7053660747228, p-value=1.0325265818529919e-76, df_denom=7>

So, the coefficients are very jointly significant. No need to worry about that, then.

The resulting plot of the model -- which is taken directly from my thesis, and therefore looks nicer, with confidence intervals to boot -- looks like this:

SARIMA LTLF

This model turned out great! In the next installment of this series, we'll talk about what exactly 'great' means, beyond visually inspecting a plot.