A Primer
In this post, I’ll work through some of the elementary concepts in time series analysis. I’ll mostly focus on features of time series and popular (basic) modeling techniques. I’m doing this review to prepare for an interview, so I’ll reserve more advanced methods for a later post. I’ll also mostly use Introduction to Time Series and Forecasting by Brockwell and Davis.
A good place to start is with an explanation of what a time series is. In simple terms, a time series is a dataset where each observation is associated with some timepoint. Timepoints can be discrete or continuous, and they yield discrete time series and continuous time series, respectively.
We will use \(\{ x_t \}_{t = 0}^{n}\) to denote the observed values of a time series (where $n$ is permitted to be $\infty$). We assume that each $x_t$ is a realization of some random variable, $X_t$, and so the observed time series is a realization of the sequence \(\{ X_t \}_{t = 0}^m\) (where $m \geq n$ and is permitted to be infinity).
In what follows, we’ll omit the indexing of time for sequences of random variables (i.e. \(\{ X_t \}\)), but we will usually include the indexing for samples. The time series should be assumed to be either discrete or continuous unless context dictates otherwise. We’ll also do most of our work for univariate time series with the extension to multiple dimensions being fairly straightforward.
We now move to characteristics of time series, namely stationarity and autocorrelation.
With these definitions, we come to stationarity. In words, a stationary time series will have a mean function that doesn’t change over time.
Similarly, a time series, \(\{ X_t \}\), is called strictly stationary if, for any $h, n \in \mathbb{Z}^+$, $(X_1, \dots, X_n)$ and $(X_{1 + h}, \dots, X_{n + h})$ have the same joint distribution.
A strictly stationary time series will also be weakly stationary if $\mathbb{E}[X_t^2] < \infty$ for all $t$. Furthermore, a strictly stationary time series will be a sequence of identically distributed random variables.
For a stationary time series, we can define the autocovariance and autocorrelation functions. These functions describe the dependence (or lack thereof) between the random variables within the sequence.
The above functions have some nice properties for stationary processes, which we explore below.
Let $\{ X_t \}$ be a stationary process. Then:
To show the first claim, note that:
\[\gamma_X(0) = \text{Cov}(X_{t}, X_{t}) = \text{Var}(X_t)\]which necessarily must be non-negative. To show the second, recall that correlations have, at most, an absolute value of $1$. Then see that:
\[\begin{aligned} \rho_X(h) &= \frac{\gamma_X(h)}{\gamma_X(0)} \\ \implies \gamma_X(0) \rho_X(h) &= \gamma_X(h) \\ \implies \gamma_X(0) &\geq \gamma_X(h) \end{aligned}\]The third claim follows from the following argument:
\[\begin{aligned} \gamma_X(h) &= \text{Cov}(X_{t + h}, X_t) \\ &= \text{Cov}(X_t, X_{t+h}) \\ &= \gamma_X(-h) \end{aligned}\]We also have that the autocovariance function must be positive semi-definite. A real-valued function, $f: \mathbb{Z} \rightarrow \mathbb{R}$, is positive semi-definite if, for any vector $\mathbf{a} \in \mathbb{R}^n$ and any $n \in \mathbb{Z}^+$:
\[\begin{equation} \label{eq:non-negative-definite} \sum_{i = 1}^n \sum_{j = 1}^n \mathbf{a}_i f(i - j) \mathbf{a}_j \geq 0 \end{equation}\]A real-valued function, $f: \mathbb{Z} \rightarrow \mathbb{R}$, is the autocovariance function for some stationary time series if, and only if, it is even (i.e $f(x) = f(-x)$) and positive semi-definite.
Fix $\mathbf{a} \in \mathbb{R}^n$, and let $\mathbf{X}_n = (X_1, \dots, X_n)^\top$. Define the covariance matrix of $\mathbf{X}_n$ as:
\[\Gamma_n = \mathbb{E}\left[ (\mathbf{X}_n - \mathbb{E}[\mathbf{X}_n])(\mathbf{X}_n - \mathbb{E}[\mathbf{X}_n])^\top \right]\]We then have:
\[\begin{aligned} \text{Var}(\mathbf{a}^\top \mathbf{X}_n) &= \mathbf{a}^\top \text{Var}(\mathbf{X}_n) \mathbf{a} \\ &= \mathbf{a}^\top \mathbb{E}\left[ (\mathbf{X}_n - \mathbb{E}[\mathbf{X}_n])(\mathbf{X}_n - \mathbb{E}[\mathbf{X}_n])^\top \right] \mathbf{a} \\ &= \mathbf{a}^\top \Gamma_n \mathbf{a} &\geq 0 \end{aligned}\]where the inequality follows from the fact that $\Gamma_n$ is a covariance matrix (so it is positive semi-definite), and the equation is a quadratic form so it must be non-negative (see here).
The converse is more involved and omitted.
The above properties extend to the autocorrelation function with the addition that $\rho_X(\cdot)$ also satisfies $\rho_X(0) = 1$.
The above functions were defined for sequences of random variables. We can define analogous functions for observations.
We’ll introduce a few important examples from the book to help illustrate some of the concepts.
Since $\mathbb{E}\left[ X_t \right] = 0$ for all $t$ and $\gamma_X(t+h, t) = 0$ for all $h > 0$ (and equal to $\sigma^2$ for $h = 0$) by independence, white noise is a stationary time series.
Let ${ X_t }$ be i.i.d. noise with mean zero and variance $\sigma^2 < \infty$. The corresponding random walk ${ S_t }$ also has mean zero by the linearity of expectation. We also have:
\[\begin{aligned} \mathbb{E}\left[ S_t^2 \right] &= \mathbb{E}\left[ \left( \sum_{i = 0}^t X_i \right)^2 \right] \\ &= \sum_{i = 0}^t \sum_{j = 0}^t \mathbb{E}\left[ X_i X_j \right] \\ &= \sum_{i = 0}^t \mathbb{E}\left[ X_i^2 \right] & \left( X_i, X_j \text{ ind.}\right) \\ &= t \sigma^2 \end{aligned}\]However:
\[\begin{aligned} \gamma_S(t + h, t) &= \text{Cov}(S_{t+h}, S_t) \\ &= \text{Cov}(S_t + \sum_{i = t+1}^{t+h} X_i, S_t) \\ &= \text{Cov}(S_t, S_t) \\ &= t \sigma^2 \end{aligned}\]Thus, this random walk is not stationary.
We now introduce several stationary processes that are very important to time series analysis.
An $MA(q)$ process will be stationary, and it will also be strictly stationary because ${ Z_t }$ is i.i.d. noise.
A linear process can be written with a backward shift operator as:
\[\begin{aligned} \psi(B) &= \sum_{j = -\infty}^\infty \psi_j B^j \\ \X_t &= \psi(B) Z_t \end{aligned}\]If $\psi_j = 0$ for all $j < 0$, the we call the linear process a moving average (or an MA($\infty$)).
Combining a moving average model with an autoregressive model yields an autoregressive moving average model.
An ARMA process can also be written with a backward shift operator. If we define $B^jX_t = X_{t - j}$ and $B^j Z_t = Z_{t - j}$ for $j \in \mathbb{Z}$, then:
\[\phi(B) X_t = \theta(B) Z_t\]We can also define a special function for ARMA processes. The partial autocorrelation function (PACF) of an ARMA process is the function $\alpha(h) = \phi_{h, h}$ (with $\alpha(0) = 1$) for $h \geq 1$ where:
\[\phi_{h,h} = \left[ \begin{bmatrix} \gamma_X(1-1) & \gamma_X(1-2) & \dots & \gamma_X(1-h) \\ \gamma_X(2-1) & \gamma_X(2-2) & \dots & \gamma_X(2-h) \\ \vdots & \vdots & \ddots & \vdots \\ \gamma_X(h-1) & \gamma_X(h-2) & \dots & \gamma_X(h-h) \end{bmatrix} \begin{bmatrix} \gamma_X(1) \\ \vdots \\ \gamma_X(h) \end{bmatrix} \right]_{h}\]A sample version can also be computed using the sample autocovariance function.
There are many different models that can be used for time series analysis. Here, we’ll cover some of the most foundational and common.
The non-seasonal model with trend has the form:
\[\begin{equation} \label{eq:non-seasonal-with-trend} X_t = m_t + Y_t \end{equation}\]Here, $m_t$ is the trend component. This part is meant to capture long-term/slow moving change in $X_t$. We can think of the trend component as representing the overall, more general, change in the sequence. $Y_t$ is weakly stationary random noise and captures anything in the sequence that cannot be captured by $m_t$.
We are often interested estimating $m_t$ so that we can remove its effect, identify a reasonable (stationary!) noise distribution, and then use our knowledge of stationary processes for prediction and inference. There are several popular ways to do the initial estimation. We’ll assume we have some process, \(\{ X_t \}\), as described in Eq. \eqref{eq:non-seasonal-with-trend}.
One way to estimate $m_t$ is to use a moving average filter. For finite $q \in \mathbb{N}$, the two-sided moving average of \(\{ X_t \}\) is defined as:
\[W_t = \frac{1}{2q + 1} \sum_{j = - q}^q X_{t - j}\]Under the assumption that $m_t$ is approximately linear over $[t - q, t + q]$, and that the mean of the noise component over $[t - 1, t + q]$ is approximately zero, we can estimate $m_t$ as $W_t$:
\[\begin{aligned} W_t &= \frac{1}{2q + 1} \sum_{j = - q}^q (m_{t - j} + Y_{t - j}) \approx m_t \\ \implies \hat{m}_t &= \sum_{j = -\infty}^\infty a_j X_{t - j} \hspace{10mm} \text{for } a_j = \frac{1}{2q + 1}; \hspace{2mm} -q \leq j \leq q \end{aligned}\]One drawback is that this filter is two-sided, so it uses information from before time $t$ when estimating $m_t$. Additionally, our samples are usually finite (e.g. over time $[0, n]$), so the moving average estimate given above cannot be used for $t \leq q$ or $t > n - q$.
Exponential smoothing provides a one-sided alternative to the moving average filter. One can fix $\alpha \in [0, 1]$ and estimate $m_t$ with:
\[\hat{m}_t = \alpha X_t + (1 - \alpha) \hat{m}_{t - 1} \hspace{10mm} \text{for } t = 2, \dots, n\]where we define $\hat{m}_1 = X_1$. The name comes from the fact that the recursion compounds the multiplication by $\alpha$ for $t \geq 2$:
\[\hat{m}_t = (1 - \alpha)^{t-1} X_1 + \sum_{j = 0}^{t - 2} \alpha(1 -\alpha)^j X_{t - j}\]A simple alternative, though still useful, is polynomial fitting in which we simply fit a polynomial of degree $k$ that has the form:
\[m_t = a_0 + a_1 t + a_2 t^2 + \dots + a_k t^k\]These polynomials are usually fit via least squares.
To explain differencing, we first define the lag-$1$ difference operator and backward shift operator, respectively:
\[\begin{aligned} \nabla X_t &= X_t - X_{t-1} = (1 - B) X_t \\ B X_t &= X_{t-1} \end{aligned}\]We can compound both to the $k$-th degree as:
\[\begin{aligned} \nabla^k X_t &= \nabla(\nabla^{k - 1} X_t ) = (1 - B)^k X_t\\ B^k X_t &= X_{t - k} \end{aligned}\]Under the assumption that the trend component is a $k$-degree polynomial, we can apply the difference operator $k$ to yield:
\[\begin{equation} \label{eq:differencing} \begin{aligned} \nabla^k X_t &= \nabla^k m_t + \nabla^k Y_t \\ &= \nabla^{k}\left[ \sum_{j = 0}^k c_j t^j \right] + \nabla^k Y_t \\ &= k! c_k + \nabla^k Y_t \end{aligned} \end{equation}\]If $X_t = m_t + Y_t$ and $Y_t$ is stationary with mean zero, then Eq. \eqref{eq:differencing} is stationary with mean $k! c_k$.
The classical decomposition model has the form:
\[\begin{equation} \label{eq:classical-decomposition-model} X_t = m_t + s_t + Y_t \end{equation}\]The $m_t$ and $Y_t$ components are the same as in the non-seasonal model with trend, and $s_t$ is the seasonal component and has known period, $d$. The seasonal component captures patterns that might occur repeatedly in the sequence and should satisfy $\sum_{j = 1}^d s_j = 0$. The period represents the amount of time between repetitions, and thus $s_{t + d} = s_t$.
Because there is an additional component in the classical decomposition model, we must do several estimation steps to remove both the trend and seasonality to form a noise process.
For example, one way is to first apply a filter to estimate the trend. For an odd period, we can use a two-sided moving average centered at $t$ and of window equal to the period. For an even period (e.g. $d = 2q$), we can instead do a slightly modified moving average:
\[\hat{m}_t = \frac{1}{d} \left(\frac{1}{2} x_{t - q} + x_{t - q + 1} + \dots + x_{t + q - 1} + \frac{1}{2} x_{t+q} \right) \hspace{10mm} \text{for } q < t \leq n - q\]We can compute the sequence of deviations \(\{ x_t - \hat{m}_t \}_{t = q + 1}^{n - q}\). We can then compute the average of the deviations at each “position” in the period, which we denote with $w_k$, as the average of the set \(\{ x_t - \hat{m}_t \}\) for $q < k + jd \leq n - q$ for $k = 1, \dots, d$. We next estimate the seasonal component as:
\[\begin{aligned} \hat{s}_k &= w_k - \frac{1}{d} \sum_{i = 1}^d w_i \hspace{10mm} \text{for } k = 1, \dots, d \\ \hat{s}_k &= \hat{s}_{k - d} \hspace{10mm} \text{for } k > d \end{aligned}\]Removing the estimate for the seasonal component yields the deseasonalized data:
\[d_t = x_t - \hat{s}_t\]and we can use one of the methods from the previous section to estimate the trend for this non-seasonal data. The estimated noise series is:
\[\hat{Y}_t = x_t - \hat{m}_t - \hat{s}_t\]where $\hat{s}_t$ is the estimated seasonal component, and $\hat{m}_t$ is the estimated trend component for the non-seasonal data.
Similar to the non-seasonal model, we can use differencing for model with a seasonality component. The lag-$d$ differencing operator is defined as:
\[\nabla_d X_t = X_t - X_{t - d} = X_t - B^d X_t\]Using this operator on the classical decomposition model where $s_t$ has a period of $d$ yields:
\[\nabla_d X_t = m_t - m_{t - d} + Y_t - Y_{t - d}\]The above is simply a non-seasonal model with trend component $m_t - m_{t-d}$ and noise $Y_t - Y_{t - d}$, and the methods from the previous section can be used.