Every model we have built so far shares one fundamental characteristic: it is univariate. Whether it was STL decomposition, ETS, or ARIMA, we only looked inward — we used the history of y_t itself to forecast its future.
That approach has taken us surprisingly far. But it ignores the outside world entirely.
mexretail) reflects consumer purchasing power, employment, and credit conditions — none of which appear in the series itself.The question we are addressing now is: what if we gave our model access to external information?
The simplest case is the simple linear regression model:
y_t = \beta_0 + \beta_1 x_t + \varepsilon_t
where:
How are the \beta’s estimated? Ordinary Least Squares (OLS)
We cannot observe the \beta’s directly — we estimate them from data. The Ordinary Least Squares (OLS) method chooses the estimates \hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_k that minimize the sum of squared residuals:
\min_{\hat{\boldsymbol{\beta}}} \sum_{t=1}^{T} \hat{\varepsilon}_t^2 = \sum_{t=1}^{T} \left(y_t - \hat{\beta}_0 - \hat{\beta}_1 x_{1t} - \cdots - \hat{\beta}_k x_{kt}\right)^2
For the simple case (k = 1), taking partial derivatives and setting them to zero yields closed-form solutions:
\hat{\beta}_1 = \frac{\sum x_t y_t - T\bar{x}\bar{y}}{\sum x_t^2 - T\bar{x}^2}, \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}
For the general case with k predictors, the solution extends naturally to matrix form:
\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
where \mathbf{X} is the design matrix — a column of ones (for the intercept) plus one column per predictor — and \mathbf{y} is the vector of observed responses.
In practice, R handles all of this internally. But knowing the objective function matters: OLS penalizes large residuals more than small ones (because of the square), and under the classical assumptions the Gauss-Markov theorem guarantees the estimates are BLUE (Best Linear Unbiased Estimators) — regardless of whether we have one predictor or twenty.
Terminology: many names for y and x
The same concept appears under different names depending on the discipline:
| y (forecast variable) | x (predictor variables) |
|---|---|
| Dependent | Independent |
| Explained | Explanatory |
| Regressand | Regressor |
| Response | Stimulus / Covariate |
| Endogenous | Exogenous |
In time series forecasting, FPP3 uses forecast variable and predictor variables.
The origin of “regression” — Galton, 1886
The term was coined by Francis Galton while studying the relationship between parents’ height and their children’s height.
His finding: tall parents tend to have tall children, but on average their children are not as tall as them. Short parents tend to have children taller than themselves. There is a tendency to regress toward the mean.
Without this regression to the mean, the distribution of heights across generations would diverge — we would eventually have people of Hobbit stature and people of giant stature, with nothing in between.
Modern regression analysis generalizes this idea: studying the dependence of one variable on one or more others to predict its average value.
“A statistical relationship, however strong and suggestive, can never establish causal connexion: our ideas of causality must come from outside statistics, ultimately from some theory or other.”
— Kendall & Stuart (1961)
\text{Regression} \neq \text{Causality} \qquad \text{Correlation} \neq \text{Causality}
This is not a technicality — it is one of the most practically important ideas in data science:
Important
A regression model would happily fit a line through any of those charts. The model does not know the relationship is nonsense — you have to know. Causal direction requires theory, domain knowledge, or experimental design, not statistical strength alone.
Are these models linear?
Linearity in the variables — E(y \mid x) is a linear function of x. This is the restrictive sense: only straight lines.
Linearity in the parameters — E(y \mid x) is linear in the \beta’s.
The three models above are all nonlinear in x but linear in \beta. A linear regression model can produce a straight line, a parabola, an exponential curve, or a piecewise function, depending on the functional form chosen.
us_changeWe will use us_change — quarterly percentage changes in US macroeconomic variables. Our forecast variable is Consumption.
Before fitting any model, we look at the data.
The scatter matrix shows correlations between all pairs of variables. Notice that some predictors are correlated with each other — we will return to this when we discuss multicollinearity.
We start with the simplest possible model: Consumption as a function of Income only.
y_{t, \text{Consumption}} = \beta_0 + \beta_1 x_{t, \text{Income}} + \varepsilon_t
In fable, time series linear models use TSLM():
us_change tsibble.
TSLM() fits a Time Series Linear Model using OLS. The formula syntax is identical to lm().
report() prints the full regression output.
Series: Consumption
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-2.58236 -0.27777 0.01862 0.32330 1.42229
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.54454 0.05403 10.079 < 2e-16 ***
Income 0.27183 0.04673 5.817 2.4e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5905 on 196 degrees of freedom
Multiple R-squared: 0.1472, Adjusted R-squared: 0.1429
F-statistic: 33.84 on 1 and 196 DF, p-value: 2.4022e-08
The output has two main parts:
Individual significance — each coefficient has a t-test: H_0: \beta_i = 0 \qquad H_1: \beta_i \neq 0 The stars (*, **, ***) indicate p < 0.05, p < 0.01, p < 0.001.
Joint significance — the F-test asks whether any predictor is useful: H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0 A significant F-test does not mean all individual coefficients are significant.
The estimated model is: \hat{y}_{t, \text{Consumption}} = 0.545 + 0.272 \cdot x_{t, \text{Income}}
For every 1 percentage point increase in income growth, consumption growth increases by 0.272 percentage points on average.
OLS produces BLUE estimators only if the classical assumptions hold. The one most frequently violated in time series is no autocorrelation in the errors:
\text{cov}(\varepsilon_t, \varepsilon_s) = 0 \quad \forall \; t \neq s
If residuals are autocorrelated, OLS estimates are still unbiased but no longer efficient — standard errors are wrong, p-values are wrong, and forecast intervals are wrong. This is why we always check the residuals.
The residuals show clear autocorrelation. The simple model is not capturing all the dynamics of consumption. Let’s improve it.
In practice, one predictor is rarely sufficient. The multiple linear regression model is:
y_t = \beta_0 + \beta_1 x_{1t} + \beta_2 x_{2t} + \cdots + \beta_k x_{kt} + \varepsilon_t
For us_change, a natural extension is:
y_{t, \text{Cons.}} = \beta_0 + \beta_1 x_{t, \text{Inc.}} + \beta_2 x_{t, \text{Prod.}} + \beta_3 x_{t, \text{Sav.}} + \beta_4 x_{t, \text{Unemp.}} + \varepsilon_t
+.
Series: Consumption
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-0.90555 -0.15821 -0.03608 0.13618 1.15471
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.253105 0.034470 7.343 5.71e-12 ***
Income 0.740583 0.040115 18.461 < 2e-16 ***
Production 0.047173 0.023142 2.038 0.0429 *
Savings -0.052890 0.002924 -18.088 < 2e-16 ***
Unemployment -0.174685 0.095511 -1.829 0.0689 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3102 on 193 degrees of freedom
Multiple R-squared: 0.7683, Adjusted R-squared: 0.7635
F-statistic: 160 on 4 and 193 DF, p-value: < 2.22e-16
The multiple model substantially improves fit (\bar{R}^2 goes from ~0.15 to ~0.75) and the residuals look much closer to white noise.
The residuals vs. predictors plot is a useful diagnostic specific to multiple regression. If any panel shows a systematic pattern (curve, fan shape), it suggests a missing nonlinear term or interaction involving that predictor.
With k potential predictors, there are 2^k possible models. We need a principled way to choose among them.
Do not use in-sample R^2 for selection. Adding any predictor, even a random one, increases R^2. We need measures that penalize model complexity.
The three standard criteria available from glance() in fable:
Fit all possible models and compare:
us_change_fit_select <- us_change |>
model(
m1 = TSLM(Consumption ~ Income),
m2 = TSLM(Consumption ~ Income + Production),
m3 = TSLM(Consumption ~ Income + Savings),
m4 = TSLM(Consumption ~ Income + Unemployment),
m5 = TSLM(Consumption ~ Income + Production + Savings),
m6 = TSLM(Consumption ~ Income + Production + Unemployment),
m7 = TSLM(Consumption ~ Income + Savings + Unemployment),
m8 = TSLM(Consumption ~ Income + Production + Savings + Unemployment)
)
us_change_fit_select |>
glance() |>
select(.model, adj_r_squared, AIC, AICc, BIC) |>
arrange(AICc)Start with the full model and remove predictors one at a time:
us_change |>
model(
full = TSLM(Consumption ~ Income + Production + Savings + Unemployment),
drop1 = TSLM(Consumption ~ Income + Production + Savings),
drop2 = TSLM(Consumption ~ Income + Production + Unemployment),
drop3 = TSLM(Consumption ~ Income + Savings + Unemployment),
drop4 = TSLM(Consumption ~ Production + Savings + Unemployment)
) |>
glance() |>
select(.model, adj_r_squared, AICc, BIC) |>
arrange(AICc)Start with the intercept-only model and add predictors one at a time:
# Step 2: add a second predictor to the best single-predictor model
us_change |>
model(
f1 = TSLM(Consumption ~ Savings),
f12 = TSLM(Consumption ~ Savings + Income),
f13 = TSLM(Consumption ~ Savings + Production),
f14 = TSLM(Consumption ~ Savings + Unemployment)
) |>
glance() |>
select(.model, AICc) |>
arrange(AICc)AICc vs. BIC: which one to use?
Both penalize model complexity, but differently:
In forecasting contexts, AICc is generally preferred because we care more about predictive accuracy than model parsimony per se. When they disagree, consider whether interpretability or predictive accuracy is the primary goal.
Predictor selection and multicollinearity are closely linked. Even after running all-subsets or stepwise search, the selected model may still include predictors that are highly correlated with each other — and that creates problems that no information criterion will automatically flag.
When two or more predictors are highly correlated, we have multicollinearity. It does not bias the OLS estimates, but it inflates the variance of the coefficients, making them unstable and hard to interpret.
The Variance Inflation Factor (VIF) measures how much the variance of \hat{\beta}_j is inflated by its correlation with the other predictors:
\text{VIF}_j = \frac{1}{1 - R_j^2}
where R_j^2 is the R^2 obtained by regressing x_j on all the other predictors — a measure of how well one predictor can be linearly predicted by the rest.
A quick visual check: if the ggpairs scatter matrix shows strong correlations between predictors (not between a predictor and y), multicollinearity is likely present. The VIF quantifies exactly how severe that problem is.
The answer depends critically on your goal.
The one scenario where multicollinearity does hurt forecasting is if the correlation structure among predictors breaks down in the future — i.e., predictors that moved together historically start moving independently. In that case, the model never learned their individual effects well enough to adapt.
Before we can forecast y_t with regression, we need to ask: what do we do with the predictors?
With univariate models, forecast(h = n) was sufficient — the model only needed its own history. Now the model needs future values of x_t to produce a forecast of y_t.
What changes across the three forecasting types is precisely what values we assign to the predictors:
Ex-ante forecasts use only information available at the time the forecast is made. The predictors must themselves be forecasted.
This is the “true” forecast — it is what you would actually produce in a real deployment.
us_change_future <- us_change |>
pivot_longer(-c(Quarter, Consumption), names_to = "variable") |>
model(arima = ARIMA(value)) |>
forecast(h = 8) |>
as_tsibble() |>
select(-c(.model, value)) |>
pivot_wider(names_from = variable, values_from = .mean)
us_change_fit_mult |>
forecast(new_data = us_change_future) |>
autoplot(us_change |> filter_index("2010 Q1" ~ .)) +
labs(title = "Ex-ante forecast: US Consumption",
y = "% change", x = NULL)variable.
model() call fits an ARIMA to each predictor automatically.
forecast() produces 8-step-ahead forecasts for all predictors simultaneously.
forecast() when passed as new_data.
Ex-post forecasts use actual realized values of the predictors. The forecast variable y_t is still unknown, but we feed in the true future x_t values.
These are not “real” forecasts — they are used to evaluate the regression model in isolation, removing predictor forecast error from the evaluation.
# Use filter_index to create a train/test split
us_change_train <- us_change |>
filter_index(. ~ "2017 Q4")
us_change_test <- us_change |>
filter_index("2018 Q1" ~ .)
us_change_fit_expost <- us_change_train |>
model(
multiple = TSLM(Consumption ~ Income + Production + Savings + Unemployment)
)
# Ex-post: we supply actual future predictor values
us_change_fit_expost |>
forecast(new_data = us_change_test) |>
autoplot(us_change |> filter_index("2014 Q1" ~ .)) +
labs(title = "Ex-post forecast: actual predictor values used",
y = "% change", x = NULL)Scenario-based forecasts construct hypothetical future values for the predictors and ask: what would consumption look like under each scenario?
These are especially useful for stress-testing and strategic planning.
us_change_fit_scen <- us_change |>
model(TSLM(Consumption ~ Income + Savings + Unemployment))
us_change_scenarios <- scenarios(
optimistic = new_data(us_change, 4) |>
mutate(Income = c(0.5, 0.8, 0.6, 1.0),
Savings = c(0.1, -0.2, 0.1, -0.1),
Unemployment = -0.1),
pessimistic = new_data(us_change, 4) |>
mutate(Income = -0.5,
Savings = -0.4,
Unemployment = 0.2),
names_to = "Scenario"
)
us_change |>
autoplot(Consumption) +
autolayer(forecast(us_change_fit_scen, new_data = us_change_scenarios)) +
labs(title = "Scenario-based forecast: US Consumption",
y = "% change", x = NULL)names_to labels each scenario in the output, making it easy to distinguish them in the plot.
autolayer() overlays the scenario forecasts on the historical series without requiring matching key structures.
We have seen regression with exogenous variables. But can TSLM() also handle the patterns we identified in Module 1 — trend and seasonality?
Yes — through specially constructed predictors built directly into the formula.
We will illustrate with quarterly beer production in Australia.
To capture seasonality in a regression, we use dummy variables — binary indicators that take the value 1 for a specific season and 0 otherwise.
This is analogous to a cross-sectional dummy: to distinguish male/female, you only need one binary variable — one category is always implicit in the intercept.
The dummy variable trap: if you create dummies for all m seasons and include an intercept, you have perfect multicollinearity. Always use m - 1 dummies. TSLM() handles this automatically with season().
TSLM() provides two built-in terms:
trend() — a deterministic linear trend: \beta_1 t where t = 1, 2, \ldots, T.season() — m - 1 seasonal dummy variables, with the first season as baseline.trend() adds a linear time index; season() adds quarterly dummies automatically.
Series: Beer
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-42.9029 -7.5995 -0.4594 7.9908 21.7895
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 441.80044 3.73353 118.333 < 2e-16 ***
trend() -0.34027 0.06657 -5.111 2.73e-06 ***
season()year2 -34.65973 3.96832 -8.734 9.10e-13 ***
season()year3 -17.82164 4.02249 -4.430 3.45e-05 ***
season()year4 72.79641 4.02305 18.095 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.23 on 69 degrees of freedom
Multiple R-squared: 0.9243, Adjusted R-squared: 0.9199
F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.22e-16
What about more complex seasonality? Fourier terms
For series with multiple seasonal periods or very long seasons (daily, hourly data), season() becomes impractical — the number of dummies grows too large. An alternative is Fourier terms: pairs of sine and cosine waves that approximate the seasonal pattern with far fewer parameters.
TSLM() supports this via fourier(K), where K controls how many Fourier pairs are included. We will cover this properly in Module 4 when we deal with complex seasonality.
There is a notable outlier in Q2 1994. We can detect it formally and then handle it with an intervention variable.
Rather than eyeballing the residual plot, we can detect outliers formally — the same way we did in the practical issues document. The key idea is that applying the IQR rule directly to the raw series does not work in time series: a series with trend has naturally increasing values, so later observations always look “large” even when they are perfectly normal.
The fix: decompose first, then apply IQR to the remainder. For beer, which has strong known seasonality, we use a full STL decomposition — including the seasonal component — so only genuine anomalies end up in the remainder1.
beer_dcmp <- beer |>
model(
stl = STL(Beer ~ trend() + season(), robust = TRUE)
) |>
components()
beer_outliers <- beer_dcmp |>
select(Quarter, remainder) |>
mutate(
Q1 = quantile(remainder, 0.25),
Q3 = quantile(remainder, 0.75),
IQR = Q3 - Q1,
lower_15 = Q1 - 1.5 * IQR,
upper_15 = Q3 + 1.5 * IQR,
lower_3 = Q1 - 3 * IQR,
upper_3 = Q3 + 3 * IQR,
outlier_15 = remainder < lower_15 | remainder > upper_15,
outlier_3 = remainder < lower_3 | remainder > upper_3
)
beer_outliers |>
filter(outlier_15) |>
select(Quarter, remainder, outlier_15, outlier_3)beer, this prevents normal seasonal variation from being flagged as anomalies.
In regression, rather than removing the outlier and imputing, we model it explicitly with a dummy variable. This keeps all observations and lets the model account for the anomaly without distorting the other coefficients.
Series: Beer
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-42.7196 -6.9594 -0.7748 8.0050 21.5292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 442.6530 3.8050 116.336 < 2e-16 ***
trend() -0.3730 0.1285 -2.902 0.00501 **
season()year2 -33.3328 3.9745 -8.387 4.84e-12 ***
season()year3 -17.8072 3.9716 -4.484 2.95e-05 ***
season()year4 72.8436 3.9773 18.315 < 2e-16 ***
spike_Q2_1994 -24.5904 12.5547 -1.959 0.05432 .
level_2000 0.6182 5.5264 0.112 0.91127
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.07 on 67 degrees of freedom
Multiple R-squared: 0.9284, Adjusted R-squared: 0.922
F-statistic: 144.9 on 6 and 67 DF, p-value: < 2.22e-16
We will revisit outlier handling in Module 4, where we cover robust methods and automatic outlier detection. Intervention variables are the regression-world equivalent of those techniques — explicit rather than automatic.
We established earlier that linear in the parameters does not mean a straight line. Here are the three most practically useful nonlinear forms — all estimable by OLS.
What changes relative to a standard linear regression is not the estimation method — it is the interpretation of the coefficients.
| Model | Interpretation of \hat{\beta}_1 |
|---|---|
| Lin-lin | 1-unit \uparrow x → \hat{\beta}_1-unit \Delta y |
| Log-log | 1% \uparrow x → \hat{\beta}_1% \Delta y (elasticity) |
| Log-lin | 1-unit \uparrow x → 100\hat{\beta}_1% \Delta y |
| Lin-log | 1% \uparrow x → \hat{\beta}_1/100-unit \Delta y |
What about other transformations?
These interpretations hold cleanly only for the natural logarithm (\ln). Two common alternatives:
Logarithm in a different base (e.g., \log_{10}): the coefficient is scaled by \ln(b) where b is the base. The elasticity interpretation is preserved in shape but the numeric value changes — an unnecessary complication. Stick with \ln.
Box-Cox (\lambda \neq 0, 1): the response is on the scale \frac{y^\lambda - 1}{\lambda}, which has no clean verbal interpretation. Box-Cox is better used as a pre-processing transformation to stabilize variance before modeling (as in Module 1) than inside a regression model where you need to interpret the coefficients.
When the trend changes slope at one or more points in time, a piecewise linear model fits separate linear segments joined at knots.
We use the Boston Marathon winning times — a series with clear structural breaks.
What does a single linear trend produce on this data?
boston_fit_linear <- boston_men |>
model(linear = TSLM(Minutes ~ trend()))
boston_men |>
autoplot(Minutes, color = "grey60") +
geom_line(
data = fitted(boston_fit_linear),
aes(y = .fitted), color = "#0072B2"
) +
autolayer(forecast(boston_fit_linear, h = 10),
alpha = 0.5, level = 95) +
labs(title = "Boston Marathon: linear trend",
y = "Minutes", x = NULL)The forecast predicts times will keep falling indefinitely — eventually reaching zero. A single linear trend is the wrong functional form here.
boston_fit <- boston_men |>
model(
linear = TSLM(Minutes ~ trend()),
exponential = TSLM(log(Minutes) ~ trend()),
piecewise = TSLM(Minutes ~ trend(knots = c(1940, 1980)))
)
boston_men |>
autoplot(Minutes, color = "grey60") +
geom_line(
data = fitted(boston_fit),
aes(y = .fitted, color = .model)
) +
autolayer(forecast(boston_fit, h = 10),
alpha = 0.4, level = 95) +
labs(title = "Boston Marathon: three functional forms",
y = "Minutes", x = NULL, color = "Model") +
theme(legend.position = "top")knots specify the years where the slope is allowed to change.
A key limitation of piecewise regression: we must choose the knots ourselves. Different choices can produce very different forecasts, even when in-sample fit looks similar.
boston_fit_knots <- boston_men |>
model(
k1 = TSLM(Minutes ~ trend(knots = c(1940, 1980))),
k2 = TSLM(Minutes ~ trend(knots = c(1915, 1950))),
k3 = TSLM(Minutes ~ trend(knots = c(1950, 1970, 1990)))
)
boston_men |>
autoplot(Minutes, color = "grey60") +
geom_line(
data = fitted(boston_fit_knots),
aes(y = .fitted, color = .model)
) +
autolayer(forecast(boston_fit_knots, h = 10),
alpha = 0.4, level = 95) +
labs(title = "Different knots → very different forecasts",
y = "Minutes", x = NULL, color = "Model") +
theme(legend.position = "top")These models fit the historical data similarly well — but their forecasts diverge dramatically. Always inspect the forecast, not just the in-sample fit. This is one motivation for Prophet (Module 4), which detects changepoints automatically from the data rather than requiring manual specification.
| Model | When to use |
|---|---|
| Simple TSLM | One predictor; baseline |
| Multiple TSLM | Several predictors; check VIF |
trend() + season() |
Deterministic patterns |
| Interventions | Known outliers / breaks |
| Log-log | Elasticity relationships |
| Log-lin | Exponential growth |
| Piecewise | Known structural breaks |
Key takeaways:
Coming up in 3.3: Dynamic regression — what happens when we let the error term \varepsilon_t follow an ARIMA process instead of assuming white noise. This directly addresses the residual autocorrelation we saw in the simple model.
When seasonality is weak or absent, season(period = 1) can be used to extract only the trend. See Practical Forecasting Issues for the full workflow including NA imputation.

Time Series Forecasting