Backtesting — overview
Case study — what does a polished backtest look like in practice?
In-sample tests — how do we know a predictor matters on the historical sample?
Out-of-sample forecasts — does the predictor still work outside the estimation window?
Backtesting is the standard quantitative-finance workflow for evaluating a strategy idea: take a historical dataset, simulate trading the strategy as if you had been deploying it in real time, and measure the resulting risk-adjusted return.
The fundamental challenge is data snooping . If you tweak the strategy parameters and re-run the backtest until performance looks good, you’re optimising on a single sample and the apparent edge will not survive in live trading. The whole field of empirical asset-pricing has spent two decades developing tools to detect and correct for this problem; today’s lecture introduces the most basic and most important one — out-of-sample evaluation .
The three-part structure (case study → in-sample tests → out-of-sample tests) starts concrete and gets more abstract: the SRA case study shows what good backtesting output looks like; the in-sample slide shows the standard regression test; the out-of-sample slide shows why in-sample alone is not enough.
In-sample tests — general setup
\[
r_{t,t+h} \;=\; a + b\,X_t + u_t
\]
\(r\) — log excess return (\(=\) asset return − risk-free rate)
\(X\) — vector of predictor(s)
\(t\) — end of period; \(h\) — forecast horizon (e.g. 6 months)
Example.
\(X_t\) = Dividend yield as of 31.12.2020
\(r_{t,t+6}\) = excess return for 31.12.2020 → 30.06.2021
This is the workhorse predictive regression of empirical asset pricing: regress future returns on a predictor observed today. The slope \(b\) is the effect we care about — if it’s significantly non-zero (and the right sign), the predictor appears to forecast returns.
Two technical features that matter:
Excess returns (\(r\) = asset return − risk-free rate). The risk-free rate is what an investor could earn without bearing market risk; subtracting it isolates the compensation for risk . All academic predictability work uses excess returns; raw-return predictability is mostly mechanical (real interest rates correlate with macro variables that correlate with returns).
Forecast horizon \(h\) . With \(h = 1\) month and monthly data you have \(T\) non-overlapping observations and standard t-statistics work. With \(h = 12\) months and monthly data you have heavily overlapping observations (the regression at \(t\) and \(t+1\) share 11 months of returns) which inflates t-statistics enormously if you ignore the overlap. The bootstrap procedure on the next-but-one slide is the (Welch and Goyal 2008 ) correction for this.
In practice the equation is run as lm(future_return ~ predictor) in R; the slope, t-statistic, and adjusted R² are the headline outputs. We’ll come back to this regression form repeatedly in Lectures 3 and 4 with progressively more sophisticated estimators.
In-sample tests — properties
In-sample OLS estimators have high power (when the model specification is valid).
Results often depend heavily on sample period (start and end dates).
Inference is tricky , especially with multi-period excess returns and persistent predictors.
“In-sample OLS estimators have high power” means: if a predictor genuinely affects returns, the in-sample regression will reliably detect it. The OLS coefficient is unbiased and the standard t-statistic asymptotically delivers the right rejection rate.
The two caveats are what make in-sample tests dangerous in practice:
Sample-period sensitivity. Predictors that look strong on 1947–1990 may be weak on 1947–2024 (or vice versa). Empirical asset-pricing literature has repeatedly found that “anomalies” weaken or disappear after publication, partly because of arbitrage activity but also because the original sample may have been atypical. Always report what your result looks like on a hold-out sub-period.
Inference is tricky. When the predictor is highly persistent (e.g., dividend yield with autocorrelation > 0.95) and the dependent variable involves overlapping returns, the t-statistic from lm() is badly miscalibrated — it can suggest p < 0.01 when the true p-value is 0.20. The Welch-Goyal bootstrap (next slide) and the Newey-West / Hodrick standard errors are the standard corrections.
These issues are why a positive in-sample t-statistic is not the end of the story. The out-of-sample test on the slide after next is the much more credible robustness check.
Welch & Goyal (2008) — bootstrap inference
Bootstrap procedure in Welch and Goyal (2008 ) follows Mark (1995 ) and Kilian (1999 ) .
Construct 10 000 bootstrapped time series by drawing residuals with replacement.
Initial observation: pick one date from the actual data at random.
The procedure preserves the autocorrelation structure of the predictor.
The Welch-Goyal bootstrap (Welch and Goyal 2008 ) , building on Mark (Mark 1995 ) and Kilian (Kilian 1999 ) , is a residual-resampling procedure that simulates many counterfactual histories of the predictor-and-return system. The crucial design choice: instead of resampling observations independently (which would destroy the autocorrelation structure of a persistent predictor), the bootstrap resamples residuals and reconstructs the time series with the original AR dynamics. That preserves the persistence and gives correctly-sized inference.
Operationally:
Fit r_t = a + b X_t + u_t on the actual data; estimate a, b, residuals u_t.
Fit X_t = c + d X_{t-1} + v_t (or a higher AR order); estimate c, d, residuals v_t.
Under the null (\(b = 0\) ): generate a counterfactual return series as r*_t = a + 0 × X_t + u*_t where u*_t is a random draw with replacement from the actual residuals; generate a counterfactual predictor from the AR equation and resampled v*_t.
Repeat 10 000 times. The distribution of bootstrapped t-statistics is the null distribution that the actual t-statistic should be compared to.
The result: a much more conservative p-value than the OLS t-statistic. Many predictors that look highly significant in the standard regression turn out to have only marginal significance under the bootstrap — and a few that look marginal under OLS turn out to be significant. The next slide reports the bootstrap-corrected results from WG08.
In-sample results — Welch and Goyal (2008 )
In-sample adjusted \(R^2\) over the full sample — significant predictors (* = p<0.10, ** = p<0.05, *** = p<0.01):
b/m Book-to-market
1921–2005
3.20*
i/k Investment / capital
1947–2005
6.63**
ntis Net equity expansion
1927–2005
8.15***
eqis Pct equity issuing
1927–2005
9.15***
all Kitchen sink
1927–2005
13.81**
Many other predictors have negative adjusted R² — see Welch and Goyal (2008 ) .
The headline numbers from Welch and Goyal’s exhaustive in-sample analysis (Welch and Goyal 2008 ) . Five predictors stand out as having both economically meaningful and statistically significant in-sample adjusted R²:
b/m (book-to-market) — value-vs-growth signal.
i/k (investment / capital) — corporate investment rate as a state variable.
ntis (net equity expansion) — net new equity issuance scaled by market cap.
eqis (percent equity issuing activity) — closely related to ntis.
kitchen sink — all predictors stacked into one regression.
R² of 3–14 % over multi-year horizons is large by financial-economics standards (single-factor models on monthly returns explain 1–2 %). The kitchen-sink R² of 13.81 % is suggestively high — but it’s also where the multiple-testing concern is most acute. Putting many predictors in the same regression all but guarantees one will look significant by chance.
The footnote — “many other predictors have negative adjusted R²” — is the silent companion. The full WG08 sample includes ~14 predictors; only the five listed clear the bar. The other 9 fail. This is the core “publication bias” concern: papers tend to publish the predictors that worked, leaving the failures unreported.
The next slide pulls the rug out from under most of these in-sample wins by showing the out-of-sample picture.
Out-of-sample tests — overview
OOS = robustness check for in-sample estimation.
Predictive performance is evaluated on data outside the training window.
Walk-forward : estimate parameters \(\hat a_{t-1}, \hat b_{t-1}\) using data through \(t-1\) ; forecast \(\hat r_t = \hat a_{t-1} + \hat b_{t-1} X_{t-1}\) ; advance one step; repeat.
Walk-forward (also called expanding-window or rolling-window) backtesting is the discipline of estimating the model only on data the strategy could plausibly have known at the time. At each forecast date \(t\) , you re-fit the model on data ending at \(t-1\) and forecast \(\hat r_t\) using that just-fitted model. No future information leaks into past forecasts.
Why it matters: the alternative — fit the model on the full sample, then evaluate the predictions — uses information from the future to estimate the model parameters that produce the past forecasts. That’s look-ahead bias and it inflates apparent forecast accuracy. Empirical-finance papers that don’t use walk-forward (or an explicit train/test split) are routinely overconfident.
There are two flavours of walk-forward:
Expanding window — re-fit on all data through \(t-1\) each step. Uses more data but recent observations are diluted.
Rolling window — re-fit on a fixed-size window (e.g., last 60 months). More responsive to regime change; smaller sample per fit.
For the project, expanding window is the simpler default; if your indicator has time-varying behaviour you may want to switch to rolling.
Implementation in R: it’s mostly a loop over time indices, with lm() re-fit on each iteration. The slider and tsibble packages have helpers; for production work, the tidymodels framework’s rsample::rolling_origin() automates the window-management.
Out-of-sample R²
Compare model forecast \(\hat r_t\) vs realisation \(r_t\) , and historical-mean forecast \(\bar r_t\) vs realisation:
\[
R^2_{OS} \;=\; 1 \;-\; \dfrac{\sum_{t=1}^T (r_t - \hat r_t)^2}{\sum_{t=1}^T (r_t - \bar r_t)^2}
\]
A positive \(R^2_{OS}\) means the model has lower prediction error than the historical mean.
The out-of-sample R² is constructed as a comparison against the historical mean as a benchmark. The numerator is the sum of squared errors of the model forecast; the denominator is the sum of squared errors of the simplest possible alternative — predicting next period’s return as the average return seen so far.
Why this benchmark? Because the historical mean is almost impossible to beat in equity-premium prediction. Stock returns are heavily noise-dominated; a model that just guesses the long-run average gets you most of the way there. Beating that benchmark by even a small amount over thousands of periods is meaningful evidence of genuine forecasting ability.
Reading the formula: \(R^2_{OS} > 0\) means lower forecast error than the historical mean (good). \(R^2_{OS} < 0\) means worse than just predicting the average, which is a damning result — your model is not adding value, it is destroying it.
The \(R^2_{OS}\) measure is from Welch and Goyal (2008 ) and is now standard in the equity-premium-prediction literature. The strict comparison to a simple benchmark is what makes it a credible measure — easier benchmarks (zero, or last period’s value) would let weaker models look good.
Out-of-sample results — Welch and Goyal (2008 )
In WG08, only one predictor (eqis) remains significant out-of-sample:
eqis Pct equity issuing
2.04**
All other predictors deliver negative OOS R² — including the kitchen-sink combination at −139.03 .
This is the single most important result in the equity-premium-prediction literature. Of the 14 predictors that look statistically significant in-sample (previous slide), only one survives the out-of-sample test: eqis (percent equity issuing), with a modest 2.04 % R²_OS that is itself only marginally significant.
The kitchen-sink result — adjusted in-sample R² of 13.81, adjusted OOS R² of −139.03 — is the most damning. Stacking 14 predictors into one regression looks great in-sample (every parameter is fit to maximise in-sample fit) but is catastrophic out-of-sample (every parameter overfits to noise that doesn’t repeat). Negative R²_OS at this magnitude means the model’s predictions are dramatically worse than just predicting the average return.
The lesson is the foundational empirical-asset-pricing result: in-sample fit is almost no evidence of true predictive power . The literature has spent two decades absorbing this result; the modern best-practice empirical-finance paper now reports both IS and OOS R², with the OOS test treated as the credibility-defining number.
The implications for your project: when you build an indicator, the IS regression on its own tells you very little. The OOS R² (or equivalently, the walk-forward backtested return) is the number that matters. If your indicator has positive IS but negative OOS, you have not found a real signal.
Plots — IS vs OOS R² over time
IS line : cumulative squared demeaned equity premium − cumulative squared regression residual.
OOS line : cumulative squared error of the prevailing mean − cumulative squared error of the predictor’s regression.
Line rising ⇒ predictor gaining forecasting ability.
Line falling ⇒ historical mean predicts better.
Plotting cumulative-SSE-difference over time (rather than just reporting a single R²_OS for the full sample) reveals time-variation in forecasting power that a single number hides. Three patterns to recognise:
Smoothly rising — consistent forecasting ability across the sample. Rare in equity-premium prediction; sometimes seen with macro variables in shorter samples.
Rising then falling — predictor worked historically but lost power in recent periods. Common pattern: many anomalies that worked in the 1970s–80s have weakened or vanished post-publication. The rising part of the curve flatters the full-sample R²_OS; the falling part is a warning sign about live deployment.
Mostly flat then a sudden jump — the predictor’s apparent forecasting power comes from one specific window (often a crisis or structural break). Remove that window and the predictor is useless. The next slide’s book-to-market example shows exactly this pattern around the 1974 oil shock.
For your project, plot the cumulative SSE difference rather than just reporting a single backtested return — the time-series view tells you much more about whether your indicator is robust.
Useful predictors — requirements
A predictor is useful if it shows:
Both significant IS and reasonably good OOS performance over the entire sample.
A generally upward drift (irregular is fine).
Drift in more than one short or unusual sample period (not just two years around the Oil Shock).
Drift that remains positive over recent decades — predictors often lose forecasting power after publication.
The four criteria are increasingly demanding ways of guarding against the failure modes seen in the WG08 results.
(1) IS and OOS — both must work. IS-only is the trap that book-to-market and the kitchen-sink fell into.
(2) Generally upward drift — the cumulative SSE difference should be rising over the sample, even if not monotonically. A predictor that delivers most of its OOS R² in two specific years is fragile.
(3) Drift in more than one short or unusual period — relate this to (2). If the predictor only works during, say, the 2008 crisis or the 1974 oil shock, you’ve found a regime indicator, not a generic forecaster.
(4) Drift positive in recent decades — the strongest test, because it’s the closest thing to a live test you can do retrospectively. Many predictors that satisfy criteria 1–3 fail criterion 4 because the literature has caught up with them.
For your project’s indicator, demand all four. If your indicator only works on one sub-period, or only over the full historical sample but not the last decade, treat that as evidence the indicator isn’t real, not evidence the recent sample is “different”.
P-hacking / data mining
Multiple-testing fallacy — running enough backtests on a single dataset is certain to produce a result that meets any pre-specified statistical-significance threshold.
Solutions :
Paper performance — track strategy returns publicly online; raises the bar for parallel strategies.
Real-money performance — gold standard; serious investors require ≥ 3 years of live, public performance (managed account or fund).
When you back-test five indicators and one “wins”, ask yourself how many silent losers it crowds out.
The multiple-testing problem in backtesting is the same statistical issue as multiple-testing in academic research, but more acute because backtests can be run cheaply: a typical practitioner can try hundreds of indicator variations in an afternoon. With α = 0.05 and 100 independent tests, you expect 5 to look “significant” by pure chance.
The standard academic correction (Bonferroni, FDR — see the Research in Finance Lecture 3 handout) is to require a tighter significance threshold when running many tests. In a practitioner setting, this typically means: if you tried 20 indicator variations to find one that works, the threshold for “this is real” should be much higher than the threshold for the first one you tried.
The two solutions on the slide are the gold standards:
Paper performance — publish the strategy on a public-tracking website before deploying it. Each subsequent month of out-of-performance is a genuine out-of-sample observation; after a year of strong out-of-tracking performance, the strategy has built credibility.
Real-money performance — actually deploy the strategy with capital, ideally as a managed account or fund with public reporting. This is the standard professional investors require because it forecloses the easy escape hatches (you can’t selectively report only the winning periods).
For your project, the project report is your “paper performance” — by writing the indicator, the back-test, and the result publicly, you commit to the result. Don’t try ten indicators and report only the one that worked; report what you actually tested and let the marker see the honest picture.