Every model we have built so far shares one fundamental characteristic: it is univariate. Whether it was STL decomposition, ETS, or ARIMA, we only looked inward — we used the history of y_t itself to forecast its future.
That approach has taken us surprisingly far. But it ignores the outside world entirely.
mexretail) reflects consumer purchasing power, employment, and credit conditions — none of which appear in the series itself.The question we are addressing now is: what if we gave our model access to external information?
The simplest case is the simple linear regression model:
y_t = \beta_0 + \beta_1 x_t + \varepsilon_t
where:
How are the \beta’s estimated? Ordinary Least Squares (OLS)
We cannot observe the \beta’s directly — we estimate them from data. The Ordinary Least Squares (OLS) method chooses the estimates \hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_k that minimize the sum of squared residuals:
\min_{\hat{\boldsymbol{\beta}}} \sum_{t=1}^{T} \hat{\varepsilon}_t^2 = \sum_{t=1}^{T} \left(y_t - \hat{\beta}_0 - \hat{\beta}_1 x_{1t} - \cdots - \hat{\beta}_k x_{kt}\right)^2
For the simple case (k = 1), taking partial derivatives and setting them to zero yields closed-form solutions:
\hat{\beta}_1 = \frac{\sum x_t y_t - T\bar{x}\bar{y}}{\sum x_t^2 - T\bar{x}^2}, \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}
For the general case with k predictors, the solution extends naturally to matrix form:
\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
where \mathbf{X} is the design matrix — a column of ones (for the intercept) plus one column per predictor — and \mathbf{y} is the vector of observed responses.
In practice, R handles all of this internally. But knowing the objective function matters: OLS penalizes large residuals more than small ones (because of the square), and under the classical assumptions the Gauss-Markov theorem guarantees the estimates are BLUE (Best Linear Unbiased Estimators) — regardless of whether we have one predictor or twenty.
Terminology: many names for y and x
The same concept appears under different names depending on the discipline:
| y (forecast variable) | x (predictor variables) |
|---|---|
| Dependent | Independent |
| Explained | Explanatory |
| Regressand | Regressor |
| Response | Stimulus / Covariate |
| Endogenous | Exogenous |
In time series forecasting, FPP3 uses forecast variable and predictor variables.
The origin of “regression” — Galton, 1886
The term was coined by Francis Galton while studying the relationship between parents’ height and their children’s height.
His finding: tall parents tend to have tall children, but on average their children are not as tall as them. Short parents tend to have children taller than themselves. There is a tendency to regress toward the mean.
Without this regression to the mean, the distribution of heights across generations would diverge — we would eventually have people of Hobbit stature and people of giant stature, with nothing in between.
Modern regression analysis generalizes this idea: studying the dependence of one variable on one or more others to predict its average value.
“A statistical relationship, however strong and suggestive, can never establish causal connexion: our ideas of causality must come from outside statistics, ultimately from some theory or other.”
— Kendall & Stuart (1961)
\text{Regression} \neq \text{Causality} \qquad \text{Correlation} \neq \text{Causality}
This is not a technicality — it is one of the most practically important ideas in data science:
Important
A regression model would happily fit a line through any of those charts. The model does not know the relationship is nonsense — you have to know. Causal direction requires theory, domain knowledge, or experimental design, not statistical strength alone.
Are these models linear?
Linearity in the variables — E(y \mid x) is a linear function of x. This is the restrictive sense: only straight lines.
Linearity in the parameters — E(y \mid x) is linear in the \beta’s.
The three models above are all nonlinear in x but linear in \beta. A linear regression model can produce a straight line, a parabola, an exponential curve, or a piecewise function, depending on the functional form chosen.
us_changeWe will use us_change — quarterly percentage changes in US macroeconomic variables. Our forecast variable is Consumption.
Before fitting any model, we look at the data.
The scatter matrix shows correlations between all pairs of variables. Notice that some predictors are correlated with each other — we will return to this when we discuss multicollinearity.
We start with the simplest possible model: Consumption as a function of Income only.
y_{t, \text{Consumption}} = \beta_0 + \beta_1 x_{t, \text{Income}} + \varepsilon_t
In fable, time series linear models use TSLM():
us_change tsibble.
TSLM() fits a Time Series Linear Model using OLS. The formula syntax is identical to lm().
report() prints the full regression output.
Series: Consumption
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-2.58236 -0.27777 0.01862 0.32330 1.42229
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.54454 0.05403 10.079 < 2e-16 ***
Income 0.27183 0.04673 5.817 2.4e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5905 on 196 degrees of freedom
Multiple R-squared: 0.1472, Adjusted R-squared: 0.1429
F-statistic: 33.84 on 1 and 196 DF, p-value: 2.4022e-08
The output has two main parts:
Individual significance — each coefficient has a t-test: H_0: \beta_i = 0 \qquad H_1: \beta_i \neq 0 The stars (*, **, ***) indicate p < 0.05, p < 0.01, p < 0.001.
Joint significance — the F-test asks whether any predictor is useful: H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0 A significant F-test does not mean all individual coefficients are significant.
The estimated model is: \hat{y}_{t, \text{Consumption}} = 0.545 + 0.272 \cdot x_{t, \text{Income}}
For every 1 percentage point increase in income growth, consumption growth increases by 0.272 percentage points on average.
OLS produces BLUE estimators only if the classical assumptions hold. The one most frequently violated in time series is no autocorrelation in the errors:
\text{cov}(\varepsilon_t, \varepsilon_s) = 0 \quad \forall \; t \neq s
If residuals are autocorrelated, OLS estimates are still unbiased but no longer efficient — standard errors are wrong, p-values are wrong, and forecast intervals are wrong. This is why we always check the residuals.
The residuals show clear autocorrelation. The simple model is not capturing all the dynamics of consumption. Let’s improve it.
In practice, one predictor is rarely sufficient. The multiple linear regression model is:
y_t = \beta_0 + \beta_1 x_{1t} + \beta_2 x_{2t} + \cdots + \beta_k x_{kt} + \varepsilon_t
For us_change, a natural extension is:
y_{t, \text{Cons.}} = \beta_0 + \beta_1 x_{t, \text{Inc.}} + \beta_2 x_{t, \text{Prod.}} + \beta_3 x_{t, \text{Sav.}} + \beta_4 x_{t, \text{Unemp.}} + \varepsilon_t
+.
Series: Consumption
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-0.90555 -0.15821 -0.03608 0.13618 1.15471
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.253105 0.034470 7.343 5.71e-12 ***
Income 0.740583 0.040115 18.461 < 2e-16 ***
Production 0.047173 0.023142 2.038 0.0429 *
Savings -0.052890 0.002924 -18.088 < 2e-16 ***
Unemployment -0.174685 0.095511 -1.829 0.0689 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3102 on 193 degrees of freedom
Multiple R-squared: 0.7683, Adjusted R-squared: 0.7635
F-statistic: 160 on 4 and 193 DF, p-value: < 2.22e-16
The multiple model substantially improves fit (\bar{R}^2 goes from ~0.15 to ~0.75) and the residuals look much closer to white noise.
The residuals vs. predictors plot is a useful diagnostic specific to multiple regression. If any panel shows a systematic pattern (curve, fan shape), it suggests a missing nonlinear term or interaction involving that predictor.
With k potential predictors, there are 2^k possible models. We need a principled way to choose among them.
Do not use in-sample R^2 for selection. Adding any predictor, even a random one, increases R^2. We need measures that penalize model complexity.
The three standard criteria available from glance() in fable:
Fit all possible models and compare:
us_change_fit_select <- us_change |>
model(
m1 = TSLM(Consumption ~ Income),
m2 = TSLM(Consumption ~ Income + Production),
m3 = TSLM(Consumption ~ Income + Savings),
m4 = TSLM(Consumption ~ Income + Unemployment),
m5 = TSLM(Consumption ~ Income + Production + Savings),
m6 = TSLM(Consumption ~ Income + Production + Unemployment),
m7 = TSLM(Consumption ~ Income + Savings + Unemployment),
m8 = TSLM(Consumption ~ Income + Production + Savings + Unemployment)
)
us_change_fit_select |>
glance() |>
select(.model, adj_r_squared, AIC, AICc, BIC) |>
arrange(AICc)Start with the full model and remove predictors one at a time:
us_change |>
model(
full = TSLM(Consumption ~ Income + Production + Savings + Unemployment),
drop1 = TSLM(Consumption ~ Income + Production + Savings),
drop2 = TSLM(Consumption ~ Income + Production + Unemployment),
drop3 = TSLM(Consumption ~ Income + Savings + Unemployment),
drop4 = TSLM(Consumption ~ Production + Savings + Unemployment)
) |>
glance() |>
select(.model, adj_r_squared, AICc, BIC) |>
arrange(AICc)Start with the intercept-only model and add predictors one at a time:
# Step 2: add a second predictor to the best single-predictor model
us_change |>
model(
f1 = TSLM(Consumption ~ Savings),
f12 = TSLM(Consumption ~ Savings + Income),
f13 = TSLM(Consumption ~ Savings + Production),
f14 = TSLM(Consumption ~ Savings + Unemployment)
) |>
glance() |>
select(.model, AICc) |>
arrange(AICc)AICc vs. BIC: which one to use?
Both penalize model complexity, but differently:
In forecasting contexts, AICc is generally preferred because we care more about predictive accuracy than model parsimony per se. When they disagree, consider whether interpretability or predictive accuracy is the primary goal.
Predictor selection and multicollinearity are closely linked. Even after running all-subsets or stepwise search, the selected model may still include predictors that are highly correlated with each other — and that creates problems that no information criterion will automatically flag.
When two or more predictors are highly correlated, we have multicollinearity. It does not bias the OLS estimates, but it inflates the variance of the coefficients, making them unstable and hard to interpret.
The Variance Inflation Factor (VIF) measures how much the variance of \hat{\beta}_j is inflated by its correlation with the other predictors:
\text{VIF}_j = \frac{1}{1 - R_j^2}
where R_j^2 is the R^2 obtained by regressing x_j on all the other predictors — a measure of how well one predictor can be linearly predicted by the rest.
extract_fit_engine() retrieves the underlying lm object from the fable mable, allowing us to use base-R or car diagnostics.
vif() from the car package computes the VIF for each predictor.
If VIF values are large, you have several options:
If you have a large number of potential predictors with substantial collinearity among them, consider dimensionality reduction (e.g., PCA) before fitting. That is outside this course’s scope, but it is a common preprocessing step in production forecasting systems.
Before we can forecast y_t with regression, we need to ask: what do we do with the predictors?
With univariate models, forecast(h = n) was sufficient — the model only needed its own history. Now the model needs future values of x_t to produce a forecast of y_t.
This is not a trivial problem. It creates three distinct forecasting scenarios:
Ex-ante forecasts use only information available at the time the forecast is made. The predictors must themselves be forecasted.
This is the “true” forecast — it is what you would actually produce in a real deployment.
# Step 1: forecast each predictor independently
us_change_fc_income <- us_change |>
model(ARIMA(Income)) |>
forecast(h = 8)
us_change_fc_prod <- us_change |>
model(ARIMA(Production)) |>
forecast(h = 8)
us_change_fc_sav <- us_change |>
model(ARIMA(Savings)) |>
forecast(h = 8)
us_change_fc_unemp <- us_change |>
model(ARIMA(Unemployment)) |>
forecast(h = 8)
# Step 2: construct future data with forecasted predictor values
us_change_future <- new_data(us_change, 8) |>
mutate(
Income = us_change_fc_income |> pull(.mean),
Production = us_change_fc_prod |> pull(.mean),
Savings = us_change_fc_sav |> pull(.mean),
Unemployment = us_change_fc_unemp |> pull(.mean)
)
# Step 3: forecast consumption using the projected predictors
us_change_fit_mult |>
forecast(new_data = us_change_future) |>
autoplot(us_change |> filter_index("2010 Q1" ~ .)) +
labs(title = "Ex-ante forecast: US Consumption",
y = "% change", x = NULL)Ex-post forecasts use actual realized values of the predictors. The forecast variable y_t is still unknown, but we feed in the true future x_t values.
These are not “real” forecasts — they are used to evaluate the regression model in isolation, removing predictor forecast error from the evaluation.
# Use filter_index to create a train/test split
us_change_train <- us_change |>
filter_index(. ~ "2017 Q4")
us_change_test <- us_change |>
filter_index("2018 Q1" ~ .)
us_change_fit_expost <- us_change_train |>
model(
multiple = TSLM(Consumption ~ Income + Production + Savings + Unemployment)
)
# Ex-post: we supply actual future predictor values
us_change_fit_expost |>
forecast(new_data = us_change_test) |>
autoplot(us_change |> filter_index("2014 Q1" ~ .)) +
labs(title = "Ex-post forecast: actual predictor values used",
y = "% change", x = NULL)Scenario-based forecasts construct hypothetical future values for the predictors and ask: what would consumption look like under each scenario?
These are especially useful for stress-testing and strategic planning.
us_change_fit_scen <- us_change |>
model(TSLM(Consumption ~ Income + Savings + Unemployment))
us_change_scenarios <- scenarios(
optimistic = new_data(us_change, 4) |>
mutate(Income = c(0.5, 0.8, 0.6, 1.0),
Savings = c(0.1, -0.2, 0.1, -0.1),
Unemployment = -0.1),
pessimistic = new_data(us_change, 4) |>
mutate(Income = -0.5,
Savings = -0.4,
Unemployment = 0.2),
names_to = "Scenario"
)
us_change_fit_scen |>
forecast(new_data = us_change_scenarios) |>
autoplot(us_change |> filter_index("2014 Q1" ~ .)) +
labs(title = "Scenario-based forecast: US Consumption",
y = "% change", x = NULL)TSLM() provides built-in terms that are particularly useful when working with time series:
trend() — a deterministic linear trend (t = 1, 2, \ldots, T).season() — seasonal dummy variables (one per season, with one omitted as baseline to avoid perfect collinearity).fourier(K) — Fourier terms for flexible seasonality (covered in Module 4).We will illustrate with quarterly beer production in Australia.
trend() adds a linear time index; season() adds quarterly dummy variables.
Series: Beer
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-42.9029 -7.5995 -0.4594 7.9908 21.7895
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 441.80044 3.73353 118.333 < 2e-16 ***
trend() -0.34027 0.06657 -5.111 2.73e-06 ***
season()year2 -34.65973 3.96832 -8.734 9.10e-13 ***
season()year3 -17.82164 4.02249 -4.430 3.45e-05 ***
season()year4 72.79641 4.02305 18.095 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.23 on 69 degrees of freedom
Multiple R-squared: 0.9243, Adjusted R-squared: 0.9199
F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.22e-16
Seasonal dummies are created automatically by season(), but it helps to see them explicitly:
The dummy variable trap: if you create dummies for all m seasons and include an intercept, you have perfect multicollinearity. Always use m - 1 dummies. TSLM() handles this automatically.
There is a notable outlier in Q2 1994. We can handle it with an intervention variable.
Dummy variables are not limited to capturing seasonality. They can capture specific events:
Series: Beer
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-42.7196 -6.9594 -0.7748 8.0050 21.5292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 442.6530 3.8050 116.336 < 2e-16 ***
trend() -0.3730 0.1285 -2.902 0.00501 **
season()year2 -33.3328 3.9745 -8.387 4.84e-12 ***
season()year3 -17.8072 3.9716 -4.484 2.95e-05 ***
season()year4 72.8436 3.9773 18.315 < 2e-16 ***
spike_Q2_1994 -24.5904 12.5547 -1.959 0.05432 .
level_2000 0.6182 5.5264 0.112 0.91127
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.07 on 67 degrees of freedom
Multiple R-squared: 0.9284, Adjusted R-squared: 0.922
F-statistic: 144.9 on 6 and 67 DF, p-value: < 2.22e-16
Intervention variables in TSLM() are the same mechanism used in Module 4 for outlier handling — they give the model a way to account for known anomalies without distorting the other coefficient estimates.
We established earlier that linear in the parameters does not mean a straight line. Three practically useful forms:
When the trend changes direction or slope at one or more points, a piecewise linear model (also called a broken stick) can be appropriate.
We use the Boston Marathon winning times as an example — a series with clear structural breaks.
boston_fit <- boston_men |>
model(
linear = TSLM(Minutes ~ trend()),
exponential = TSLM(log(Minutes) ~ trend()),
piecewise = TSLM(Minutes ~ trend(knots = c(1940, 1980)))
)
boston_fc <- boston_fit |>
forecast(h = 10)
boston_men |>
autoplot(Minutes, color = "grey60") +
geom_line(data = fitted(boston_fit),
aes(y = .fitted, color = .model)) +
autolayer(boston_fc, alpha = 0.4, level = 95) +
labs(title = "Boston Marathon: three functional forms",
y = "Minutes", x = NULL, color = "Model") +
theme(legend.position = "top")knots specify the breakpoints where the slope is allowed to change.
The piecewise model fits the historical data well, but notice what happens to the forecasts — particularly the linear and piecewise models. Always inspect the forecast, not just the in-sample fit. A model can fit the past perfectly and produce absurd predictions for the future.

Time Series Forecasting