Lecture 1: Foundations
Course outline · Backtesting fundamentals
1.1 Course objectives
- 1.1 Course objectives
- 1.2 Aim & organisation
- 1.3 Backtesting fundamentals
- 1.4 Conclusion of Lecture 1
Welcome to Finance Project — Asset Management
- This is a project course: there is no central exam to register for. Sign up on the course Moodle page by 15 April 2026 so you receive announcements and the data link.
- Submit the project by 30 June 2026 as a single zip — name pattern:
Asset2026_surname1_surname2_surname3. Email it to oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de and your team-mates.- Ask questions during or right after each session — that is the preferred channel.
- Admin / studies / exam-eligibility questions go to the registrar’s office (Studiensekretariat) at studiensekretariat@uni-ulm.de.
- Course-content questions outside class: email oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de.
- We also recommend the student advisory service.
Course Objective
Scope
We will:
- Build an end-to-end empirical pipeline in R: load, explore, model, back-test
- Cover the core ML toolbox for asset-management research: linear models, Ridge, Lasso, Elastic Net, cross-validation
- Apply it to a non-traditional asset class: prediction markets
- Develop your own indicator library and trading strategy in groups of three
We will NOT:
- Drift into deep-learning or reinforcement-learning methods
- Cover prediction markets in depth
- Provide a “ready-to-fork” backtest — the demo code is intentionally basic
Approach
Part I — Foundations
- L1: Motivation, organisation, backtesting fundamentals
- L2: Hands-on R intro — RStudio, live coding, etc.
- L3 + L4: Statistical learning — model accuracy, regularisation, resampling
Part II — Application
- L5: Prediction-markets primer + the Polymarket dataset + assignment briefing
- Project work in groups of three (≈ 7 weeks of self-organised work)
- Final session (1 July): 20-minute presentations per team
Course at a glance (1/2)
Foundations
Course outline · Backtesting fundamentals
- Course aim & organisation
- Backtesting overview & case study
- In-sample tests (Welch & Goyal 2008)
- Out-of-sample (walk-forward, R²_OS)
- Useful predictors & p-hacking
Introduction to R
RStudio · variables · vectors · data frames · live coding
- Why R for empirical asset-management research
- RStudio and the script editor
- Variables, vectors, matrices, data frames, lists
- Functions and loops
- Data import and export
Assessing model accuracy & Ridge regression
Statistical learning · MSE · bias-variance · linear model selection · Ridge
- Statistical learning: Y = f(X) + ε
- Quality of fit and the train/test MSE distinction
- Bias-variance trade-off and overfitting
- OLS limits: prediction accuracy & interpretability
- Ridge regression and the L2 penalty
Lasso, cross-validation & Elastic Net
Sparse regularisation · resampling for honest test error · choosing λ
- Lasso: L1 penalty and exact-zero coefficients
- Cross-validation: validation set, LOOCV, K-fold
- Choosing the optimal λ for Lasso
- OLS post-Lasso for cleaner coefficient inference
- Elastic Net — combining Ridge and Lasso
Prediction markets, the Polymarket Quant Bench & your project
From Welch-Goyal to event-resolved binary contracts
- Prediction markets — definition and Polymarket as the canonical venue
- How prices form: liquidity, resolution, mechanics
- The Polymarket Quant Bench dataset (HuggingFace): access and schema
- First look at the data in R
- Your project: indicator design, back-test, deliverables, R toolbox
Course at a glance (2/2)
Final presentations
Group presentations · Q&A · wrap-up
- Presentation order and time budget
- Q&A rules
- Closing thoughts and feedback
Assignments / Exams
Project (Code + Report) 50% of your grade
Rmd code + knitr-rendered PDF report. Build a library of indicators over the Polymarket Quant Bench dataset (curated OHLCV bars on HuggingFace, derived from Jon Becker’s polymarket-data dump), derive trade signals, back-test, and write a critical reflection.
Group of up to 3.
Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Finance Project — Asset Management_assignment-1-project-report_surname1_surname2_…
30 June 2026
Final Presentation 50% of your grade
20-minute group presentation in class on 1 July 2026; submit slides as PDF together with the project zip.
Group of up to 3.
Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Finance Project — Asset Management_assignment-2-final-presentation_surname1_surname2_…
1 July 2026
1.2 Aim & organisation
- 1.1 Course objectives
- 1.2 Aim & organisation
- 1.3 Backtesting fundamentals
- 1.4 Conclusion of Lecture 1
Aim of the Finance Project
- Students develop their own indicator-based trading strategies for prediction markets using machine-learning approaches in R.
- Combination of lectures (key concepts in Machine Learning, backtesting, basic R) plus hands-on empirical implementations of your own strategy.
- Ideally, you can run your strategy on a live dashboard.
Notes
The course is structured backwards from a deliverable: by the end of the term you will have built and back-tested an indicator-based trading strategy on prediction-market data. Everything in the lectures — R fundamentals, regression, regularisation, cross-validation — is the toolkit you need to make that strategy non-trivial.
Prediction markets are exchanges where contracts pay out based on the resolution of real-world events (will candidate X win the election? will GDP exceed 3% next year?). They have two properties that make them an interesting empirical playground: (a) prices have a clean economic interpretation as the market’s probability estimate of the event, so you can build indicators on probability dynamics rather than abstract returns; (b) the dataset has rich cross-sectional structure (thousands of markets resolving over different horizons), which means standard backtesting and machine-learning methods all apply but on data that is unfamiliar enough to force you to think carefully about what the indicators mean.
The “live dashboard” goal at the end is aspirational — most groups produce a static back-test in Quarto / Rmd. A small number push further into a Shiny dashboard that updates as new market data arrives; that is excellent thesis-level work and is encouraged but not required.
Course outline
- Motivation & Organisation (today)
- Backtesting Fundamentals (today)
- Statistical Learning with Applications in R
- Introduction to R — Lecture 2 (Oliver)
- Assessing Model Accuracy — Lecture 3 (Andre)
- Linear Model Selection & Regularisation — Lectures 3–4 (Andre)
- Resampling Methods — Lecture 4 (Andre)
- Practical Implementation — Lecture 5 (Oliver)
- Prediction-markets primer & the Polymarket Quant Bench dataset
- Your project: build indicators & back-test on prediction-market data
Notes
The four parts map onto a learn-then-apply rhythm:
- Today (Lectures 1) lays the conceptual ground: why we backtest, what makes a backtest credible (in-sample vs out-of-sample), and the canonical empirical benchmark — Welch and Goyal’s exhaustive in-sample-vs-out-of-sample comparison (Welch and Goyal 2008).
- Lecture 2 brings R fluency to a level where you can implement everything that follows. If you have R experience already (e.g., from the Research in Finance course), parts of L2 will be revision; if you don’t, this is the lecture to attend in person and run the code along with.
- Lectures 3–4 introduce the modelling toolkit: linear regression as the baseline; ridge and lasso as the regularisation extensions; cross-validation as the principled way to choose tuning parameters and estimate out-of-sample error. An Introduction to Statistical Learning (James et al. 2021) chapters 3 and 6 are the textbook for these lectures.
- Lecture 5 introduces the Polymarket Quant Bench dataset and turns you loose on the project — by the end of L5 you should have an indicator and a working backtest skeleton; the rest is iteration with our consultation hours.
- Lecture 6 is final presentations.
The hand-off across instructors (Andre for the ML lectures, Oliver for the R + practical lectures) reflects who’s most fluent in each topic; either of us is the right contact for general questions.
General course information
- All lectures: Wednesdays at 12:15 in Helmholtzstraße 18, room E60, Ulm.
- See dates in the course overview or on Moodle (first lecture on 15 April 2026).
- After Lecture 5, the practical phase starts and we offer regular consultation hours.
- Important deadlines:
- Written project (Rmd + PDF + slides as zip): 30 June 2026, 18:00.
- Final presentations: 1 July 2026.
Reading & grading
An Introduction to Statistical Learning with Applications in R — James, Witten, Hastie, Tibshirani (free eBook).
- Companion videos: https://www.dataschool.io/15-hours-of-expert-machine-learning-videos
- Companion site & data: https://www.statlearning.com
- Project report (10–15 pages, R Markdown → PDF): 50%
- Final presentation (~20 minutes, PDF slides): 50%
- Group-based: groups of 3 (we allocate if you don’t form one).
Notes
Mandatory reading. An Introduction to Statistical Learning (James et al. 2021) is the canonical textbook for the methods this course uses — the second edition is free as a PDF at https://www.statlearning.com, with R lab code and exercises at the end of each chapter. The chapters most directly relevant to this course:
- Chapter 2 — basic statistical learning concepts (bias-variance trade-off, training vs test error, classification vs regression).
- Chapter 3 — linear regression. Most of Lecture 3 is a focused walk through this chapter.
- Chapter 5 — resampling methods (cross-validation, bootstrap). Lecture 4.
- Chapter 6 — linear model selection and regularisation (best subset selection, ridge regression, lasso). Lectures 3–4.
You don’t need to read everything front-to-back, but the JWHT chapter on whatever lecture is happening next is the right preparation reading. The companion videos linked on the slide are useful when the textbook explanation is too dense — the authors recorded their own walkthroughs of every chapter.
Grading mechanics. The 50/50 split between report and presentation reflects that both deliverables matter. The report is graded on the rigour of the analysis (clean methodology, honest evaluation, defensible conclusions); the presentation on clarity (can you explain the strategy in 20 minutes to colleagues who haven’t read the report?). Forming groups of 3 early is high-leverage — solo work is allowed but rarely produces the best work given the project’s scope.
Slides, materials & contact
- Slides + teaching material: posted to Moodle and the course homepage.
- Dataset link: shared via Moodle ahead of Lecture 5 (do not start early — we may update the dataset).
- Lecture content — Andre Guettler (andre.guettler@uni-ulm.de)
- Practical implementations — Oliver Padmaperuma (oliver.padmaperuma@uni-ulm.de)
1.3 Backtesting fundamentals
- 1.1 Course objectives
- 1.2 Aim & organisation
- 1.3 Backtesting fundamentals
- 1.4 Conclusion of Lecture 1
Backtesting — overview
- Case study — what does a polished backtest look like in practice?
- In-sample tests — how do we know a predictor matters on the historical sample?
- Out-of-sample forecasts — does the predictor still work outside the estimation window?
Notes
Backtesting is the standard quantitative-finance workflow for evaluating a strategy idea: take a historical dataset, simulate trading the strategy as if you had been deploying it in real time, and measure the resulting risk-adjusted return.
The fundamental challenge is data snooping. If you tweak the strategy parameters and re-run the backtest until performance looks good, you’re optimising on a single sample and the apparent edge will not survive in live trading. The whole field of empirical asset-pricing has spent two decades developing tools to detect and correct for this problem; today’s lecture introduces the most basic and most important one — out-of-sample evaluation.
The three-part structure (case study → in-sample tests → out-of-sample tests) starts concrete and gets more abstract: the SRA case study shows what good backtesting output looks like; the in-sample slide shows the standard regression test; the out-of-sample slide shows why in-sample alone is not enough.
Case study — SRA Credit Spread Trading
- Forecasting horizon: 1 day
- Trading instruments: $HYG ETF
- Risk-free: US Treasury (1y)
- Benchmark: $HYG buy-and-hold
- Rebalancing: daily
- Trading costs: bid/ask + 2 bp ETF + 0.5 bp TSY
- Sharpe ratio: 1.38 (strategy) vs 0.40 (benchmark)
- Std. dev. (annualised): 8.3% vs 21.0%
- Max draw-down: −7.5% vs −33.5%
- Annualised return: 16.73% vs 7.99%
Notes
The example SRA credit-spread strategy is a polished demonstration of what a complete backtest looks like — every cell on this slide is a number you should be able to produce for your own project by the end of the term.
What the numbers tell you:
- Sharpe 1.38 vs 0.40 — the strategy delivers more than three times the risk-adjusted return of the buy-and-hold benchmark. Sharpe is the universal scaling-free measure: \((\\bar r - r_f) / \\sigma\). A Sharpe above 1 is “good”; above 2 is “exceptional”; above 3 is “almost certainly overfit”.
- Standard deviation 8.3% vs 21.0% — the strategy is much less volatile than holding the underlying. The strategy spends meaningful time in cash (the risk-free asset), which mechanically reduces variance.
- Max drawdown −7.5% vs −33.5% — the worst peak-to-trough loss. Drawdown matters because it determines whether an investor would actually stay invested through the strategy’s losing periods. A 33.5 % drawdown on a buy-and-hold (the 2008 low) is what makes investors capitulate; the strategy’s −7.5 % is bearable.
- Annualised return 16.7% vs 8.0% — the strategy delivers higher absolute return and lower risk. That combination — positive on both axes — is the hallmark of a strategy with an actual edge, not just leverage.
The transaction-cost line (bid/ask + 2bp ETF + 0.5bp Treasury) is the most underweighted part of academic backtests. Many published “anomalies” disappear once realistic costs are baked in. Your project will be required to include realistic costs.
In-sample tests — general setup
\[ r_{t,t+h} \;=\; a + b\,X_t + u_t \]
- \(r\) — log excess return (\(=\) asset return − risk-free rate)
- \(X\) — vector of predictor(s)
- \(t\) — end of period; \(h\) — forecast horizon (e.g. 6 months)
Example.
- \(X_t\) = Dividend yield as of 31.12.2020
- \(r_{t,t+6}\) = excess return for 31.12.2020 → 30.06.2021
Notes
This is the workhorse predictive regression of empirical asset pricing: regress future returns on a predictor observed today. The slope \(b\) is the effect we care about — if it’s significantly non-zero (and the right sign), the predictor appears to forecast returns.
Two technical features that matter:
- Excess returns (\(r\) = asset return − risk-free rate). The risk-free rate is what an investor could earn without bearing market risk; subtracting it isolates the compensation for risk. All academic predictability work uses excess returns; raw-return predictability is mostly mechanical (real interest rates correlate with macro variables that correlate with returns).
- Forecast horizon \(h\). With \(h = 1\) month and monthly data you have \(T\) non-overlapping observations and standard t-statistics work. With \(h = 12\) months and monthly data you have heavily overlapping observations (the regression at \(t\) and \(t+1\) share 11 months of returns) which inflates t-statistics enormously if you ignore the overlap. The bootstrap procedure on the next-but-one slide is the (Welch and Goyal 2008) correction for this.
In practice the equation is run as lm(future_return ~ predictor) in R; the slope, t-statistic, and adjusted R² are the headline outputs. We’ll come back to this regression form repeatedly in Lectures 3 and 4 with progressively more sophisticated estimators.
In-sample tests — properties
- In-sample OLS estimators have high power (when the model specification is valid).
- Results often depend heavily on sample period (start and end dates).
- Inference is tricky, especially with multi-period excess returns and persistent predictors.
Notes
“In-sample OLS estimators have high power” means: if a predictor genuinely affects returns, the in-sample regression will reliably detect it. The OLS coefficient is unbiased and the standard t-statistic asymptotically delivers the right rejection rate.
The two caveats are what make in-sample tests dangerous in practice:
- Sample-period sensitivity. Predictors that look strong on 1947–1990 may be weak on 1947–2024 (or vice versa). Empirical asset-pricing literature has repeatedly found that “anomalies” weaken or disappear after publication, partly because of arbitrage activity but also because the original sample may have been atypical. Always report what your result looks like on a hold-out sub-period.
- Inference is tricky. When the predictor is highly persistent (e.g., dividend yield with autocorrelation > 0.95) and the dependent variable involves overlapping returns, the t-statistic from
lm()is badly miscalibrated — it can suggest p < 0.01 when the true p-value is 0.20. The Welch-Goyal bootstrap (next slide) and the Newey-West / Hodrick standard errors are the standard corrections.
These issues are why a positive in-sample t-statistic is not the end of the story. The out-of-sample test on the slide after next is the much more credible robustness check.
Welch & Goyal (2008) — bootstrap inference
- Bootstrap procedure in Welch and Goyal (2008) follows Mark (1995) and Kilian (1999).
- Construct 10 000 bootstrapped time series by drawing residuals with replacement.
- Initial observation: pick one date from the actual data at random.
- The procedure preserves the autocorrelation structure of the predictor.
Notes
The Welch-Goyal bootstrap (Welch and Goyal 2008), building on Mark (Mark 1995) and Kilian (Kilian 1999), is a residual-resampling procedure that simulates many counterfactual histories of the predictor-and-return system. The crucial design choice: instead of resampling observations independently (which would destroy the autocorrelation structure of a persistent predictor), the bootstrap resamples residuals and reconstructs the time series with the original AR dynamics. That preserves the persistence and gives correctly-sized inference.
Operationally:
- Fit
r_t = a + b X_t + u_ton the actual data; estimatea,b, residualsu_t. - Fit
X_t = c + d X_{t-1} + v_t(or a higher AR order); estimatec,d, residualsv_t. - Under the null (\(b = 0\)): generate a counterfactual return series as
r*_t = a + 0 × X_t + u*_twhereu*_tis a random draw with replacement from the actual residuals; generate a counterfactual predictor from the AR equation and resampledv*_t. - Repeat 10 000 times. The distribution of bootstrapped t-statistics is the null distribution that the actual t-statistic should be compared to.
The result: a much more conservative p-value than the OLS t-statistic. Many predictors that look highly significant in the standard regression turn out to have only marginal significance under the bootstrap — and a few that look marginal under OLS turn out to be significant. The next slide reports the bootstrap-corrected results from WG08.
In-sample results — Welch and Goyal (2008)
In-sample adjusted \(R^2\) over the full sample — significant predictors (* = p<0.10, ** = p<0.05, *** = p<0.01):
| Predictor | Sample | Adj. R² |
|---|---|---|
| b/m Book-to-market | 1921–2005 | 3.20* |
| i/k Investment / capital | 1947–2005 | 6.63** |
| ntis Net equity expansion | 1927–2005 | 8.15*** |
| eqis Pct equity issuing | 1927–2005 | 9.15*** |
| all Kitchen sink | 1927–2005 | 13.81** |
Many other predictors have negative adjusted R² — see Welch and Goyal (2008).
Notes
The headline numbers from Welch and Goyal’s exhaustive in-sample analysis (Welch and Goyal 2008). Five predictors stand out as having both economically meaningful and statistically significant in-sample adjusted R²:
- b/m (book-to-market) — value-vs-growth signal.
- i/k (investment / capital) — corporate investment rate as a state variable.
- ntis (net equity expansion) — net new equity issuance scaled by market cap.
- eqis (percent equity issuing activity) — closely related to
ntis. - kitchen sink — all predictors stacked into one regression.
R² of 3–14 % over multi-year horizons is large by financial-economics standards (single-factor models on monthly returns explain 1–2 %). The kitchen-sink R² of 13.81 % is suggestively high — but it’s also where the multiple-testing concern is most acute. Putting many predictors in the same regression all but guarantees one will look significant by chance.
The footnote — “many other predictors have negative adjusted R²” — is the silent companion. The full WG08 sample includes ~14 predictors; only the five listed clear the bar. The other 9 fail. This is the core “publication bias” concern: papers tend to publish the predictors that worked, leaving the failures unreported.
The next slide pulls the rug out from under most of these in-sample wins by showing the out-of-sample picture.
Out-of-sample tests — overview
- OOS = robustness check for in-sample estimation.
- Predictive performance is evaluated on data outside the training window.
- Walk-forward: estimate parameters \(\hat a_{t-1}, \hat b_{t-1}\) using data through \(t-1\); forecast \(\hat r_t = \hat a_{t-1} + \hat b_{t-1} X_{t-1}\); advance one step; repeat.
Notes
Walk-forward (also called expanding-window or rolling-window) backtesting is the discipline of estimating the model only on data the strategy could plausibly have known at the time. At each forecast date \(t\), you re-fit the model on data ending at \(t-1\) and forecast \(\hat r_t\) using that just-fitted model. No future information leaks into past forecasts.
Why it matters: the alternative — fit the model on the full sample, then evaluate the predictions — uses information from the future to estimate the model parameters that produce the past forecasts. That’s look-ahead bias and it inflates apparent forecast accuracy. Empirical-finance papers that don’t use walk-forward (or an explicit train/test split) are routinely overconfident.
There are two flavours of walk-forward:
- Expanding window — re-fit on all data through \(t-1\) each step. Uses more data but recent observations are diluted.
- Rolling window — re-fit on a fixed-size window (e.g., last 60 months). More responsive to regime change; smaller sample per fit.
For the project, expanding window is the simpler default; if your indicator has time-varying behaviour you may want to switch to rolling.
Implementation in R: it’s mostly a loop over time indices, with lm() re-fit on each iteration. The slider and tsibble packages have helpers; for production work, the tidymodels framework’s rsample::rolling_origin() automates the window-management.
Out-of-sample R²
Compare model forecast \(\hat r_t\) vs realisation \(r_t\), and historical-mean forecast \(\bar r_t\) vs realisation:
\[ R^2_{OS} \;=\; 1 \;-\; \dfrac{\sum_{t=1}^T (r_t - \hat r_t)^2}{\sum_{t=1}^T (r_t - \bar r_t)^2} \]
A positive \(R^2_{OS}\) means the model has lower prediction error than the historical mean.
Notes
The out-of-sample R² is constructed as a comparison against the historical mean as a benchmark. The numerator is the sum of squared errors of the model forecast; the denominator is the sum of squared errors of the simplest possible alternative — predicting next period’s return as the average return seen so far.
Why this benchmark? Because the historical mean is almost impossible to beat in equity-premium prediction. Stock returns are heavily noise-dominated; a model that just guesses the long-run average gets you most of the way there. Beating that benchmark by even a small amount over thousands of periods is meaningful evidence of genuine forecasting ability.
Reading the formula: \(R^2_{OS} > 0\) means lower forecast error than the historical mean (good). \(R^2_{OS} < 0\) means worse than just predicting the average, which is a damning result — your model is not adding value, it is destroying it.
The \(R^2_{OS}\) measure is from Welch and Goyal (2008) and is now standard in the equity-premium-prediction literature. The strict comparison to a simple benchmark is what makes it a credible measure — easier benchmarks (zero, or last period’s value) would let weaker models look good.
Out-of-sample results — Welch and Goyal (2008)
In WG08, only one predictor (eqis) remains significant out-of-sample:
| Predictor | Adj. R²_OS |
|---|---|
| eqis Pct equity issuing | 2.04** |
All other predictors deliver negative OOS R² — including the kitchen-sink combination at −139.03.
Notes
This is the single most important result in the equity-premium-prediction literature. Of the 14 predictors that look statistically significant in-sample (previous slide), only one survives the out-of-sample test: eqis (percent equity issuing), with a modest 2.04 % R²_OS that is itself only marginally significant.
The kitchen-sink result — adjusted in-sample R² of 13.81, adjusted OOS R² of −139.03 — is the most damning. Stacking 14 predictors into one regression looks great in-sample (every parameter is fit to maximise in-sample fit) but is catastrophic out-of-sample (every parameter overfits to noise that doesn’t repeat). Negative R²_OS at this magnitude means the model’s predictions are dramatically worse than just predicting the average return.
The lesson is the foundational empirical-asset-pricing result: in-sample fit is almost no evidence of true predictive power. The literature has spent two decades absorbing this result; the modern best-practice empirical-finance paper now reports both IS and OOS R², with the OOS test treated as the credibility-defining number.
The implications for your project: when you build an indicator, the IS regression on its own tells you very little. The OOS R² (or equivalently, the walk-forward backtested return) is the number that matters. If your indicator has positive IS but negative OOS, you have not found a real signal.
Plots — IS vs OOS R² over time
- IS line: cumulative squared demeaned equity premium − cumulative squared regression residual.
- OOS line: cumulative squared error of the prevailing mean − cumulative squared error of the predictor’s regression.
- Line rising ⇒ predictor gaining forecasting ability.
- Line falling ⇒ historical mean predicts better.
Notes
Plotting cumulative-SSE-difference over time (rather than just reporting a single R²_OS for the full sample) reveals time-variation in forecasting power that a single number hides. Three patterns to recognise:
- Smoothly rising — consistent forecasting ability across the sample. Rare in equity-premium prediction; sometimes seen with macro variables in shorter samples.
- Rising then falling — predictor worked historically but lost power in recent periods. Common pattern: many anomalies that worked in the 1970s–80s have weakened or vanished post-publication. The rising part of the curve flatters the full-sample R²_OS; the falling part is a warning sign about live deployment.
- Mostly flat then a sudden jump — the predictor’s apparent forecasting power comes from one specific window (often a crisis or structural break). Remove that window and the predictor is useless. The next slide’s book-to-market example shows exactly this pattern around the 1974 oil shock.
For your project, plot the cumulative SSE difference rather than just reporting a single backtested return — the time-series view tells you much more about whether your indicator is robust.
Example — Book-to-market
Positive IS, but negative OOS — fails the robustness check.
- IS prediction (dotted line) trends positive — the predictor looks useful in-sample.
- OOS prediction (solid black line) collapses post-1974 — the predictor loses forecasting power.
- Lesson: in-sample alpha alone is not enough. The same logic applies to your prediction-market indicators.
Notes
Book-to-market is the textbook example of a historically strong predictor that has lost power. The story:
- 1900–1974: the book-to-market ratio (book equity / market equity) appeared to predict future stock returns. Value stocks (high B/M) outperformed growth stocks (low B/M). The cumulative SSE difference rose steadily — the predictor was beating the historical mean.
- 1974: structural break, often associated with the first oil crisis and the resulting macro regime change.
- Post-1974: the predictor’s edge shrinks and eventually reverses. By the time the WG08 sample ends in 2005, the OOS R² is negative — book-to-market actively hurts forecasts compared to just predicting the average.
Two takeaways. First, the in-sample-only story would have been “book-to-market is a strong predictor of returns” — backed by decades of data. The OOS test reveals that the predictor’s apparent power was substantially driven by the pre-1974 sub-sample. Second, even if a predictor was genuinely useful in the past, publication and arbitrage erode its power. The Fama-French value factor is the academically prominent example; its excess return has compressed dramatically since the original 1993 paper. Any indicator your project finds “works” should be checked: was the strategy known? has it been arbitraged away?
Useful predictors — requirements
A predictor is useful if it shows:
- Both significant IS and reasonably good OOS performance over the entire sample.
- A generally upward drift (irregular is fine).
- Drift in more than one short or unusual sample period (not just two years around the Oil Shock).
- Drift that remains positive over recent decades — predictors often lose forecasting power after publication.
Notes
The four criteria are increasingly demanding ways of guarding against the failure modes seen in the WG08 results.
- (1) IS and OOS — both must work. IS-only is the trap that book-to-market and the kitchen-sink fell into.
- (2) Generally upward drift — the cumulative SSE difference should be rising over the sample, even if not monotonically. A predictor that delivers most of its OOS R² in two specific years is fragile.
- (3) Drift in more than one short or unusual period — relate this to (2). If the predictor only works during, say, the 2008 crisis or the 1974 oil shock, you’ve found a regime indicator, not a generic forecaster.
- (4) Drift positive in recent decades — the strongest test, because it’s the closest thing to a live test you can do retrospectively. Many predictors that satisfy criteria 1–3 fail criterion 4 because the literature has caught up with them.
For your project’s indicator, demand all four. If your indicator only works on one sub-period, or only over the full historical sample but not the last decade, treat that as evidence the indicator isn’t real, not evidence the recent sample is “different”.
P-hacking / data mining
- Multiple-testing fallacy — running enough backtests on a single dataset is certain to produce a result that meets any pre-specified statistical-significance threshold.
- Solutions:
- Paper performance — track strategy returns publicly online; raises the bar for parallel strategies.
- Real-money performance — gold standard; serious investors require ≥ 3 years of live, public performance (managed account or fund).
When you back-test five indicators and one “wins”, ask yourself how many silent losers it crowds out.
Notes
The multiple-testing problem in backtesting is the same statistical issue as multiple-testing in academic research, but more acute because backtests can be run cheaply: a typical practitioner can try hundreds of indicator variations in an afternoon. With α = 0.05 and 100 independent tests, you expect 5 to look “significant” by pure chance.
The standard academic correction (Bonferroni, FDR — see the Research in Finance Lecture 3 handout) is to require a tighter significance threshold when running many tests. In a practitioner setting, this typically means: if you tried 20 indicator variations to find one that works, the threshold for “this is real” should be much higher than the threshold for the first one you tried.
The two solutions on the slide are the gold standards:
- Paper performance — publish the strategy on a public-tracking website before deploying it. Each subsequent month of out-of-performance is a genuine out-of-sample observation; after a year of strong out-of-tracking performance, the strategy has built credibility.
- Real-money performance — actually deploy the strategy with capital, ideally as a managed account or fund with public reporting. This is the standard professional investors require because it forecloses the easy escape hatches (you can’t selectively report only the winning periods).
For your project, the project report is your “paper performance” — by writing the indicator, the back-test, and the result publicly, you commit to the result. Don’t try ten indicators and report only the one that worked; report what you actually tested and let the marker see the honest picture.
1.4 Conclusion of Lecture 1
- 1.1 Course objectives
- 1.2 Aim & organisation
- 1.3 Backtesting fundamentals
- 1.4 Conclusion of Lecture 1
Course at a glance (1/2)
Foundations
Course outline · Backtesting fundamentals
- Course aim & organisation
- Backtesting overview & case study
- In-sample tests (Welch & Goyal 2008)
- Out-of-sample (walk-forward, R²_OS)
- Useful predictors & p-hacking
Introduction to R
RStudio · variables · vectors · data frames · live coding
- Why R for empirical asset-management research
- RStudio and the script editor
- Variables, vectors, matrices, data frames, lists
- Functions and loops
- Data import and export
Assessing model accuracy & Ridge regression
Statistical learning · MSE · bias-variance · linear model selection · Ridge
- Statistical learning: Y = f(X) + ε
- Quality of fit and the train/test MSE distinction
- Bias-variance trade-off and overfitting
- OLS limits: prediction accuracy & interpretability
- Ridge regression and the L2 penalty
Lasso, cross-validation & Elastic Net
Sparse regularisation · resampling for honest test error · choosing λ
- Lasso: L1 penalty and exact-zero coefficients
- Cross-validation: validation set, LOOCV, K-fold
- Choosing the optimal λ for Lasso
- OLS post-Lasso for cleaner coefficient inference
- Elastic Net — combining Ridge and Lasso
Prediction markets, the Polymarket Quant Bench & your project
From Welch-Goyal to event-resolved binary contracts
- Prediction markets — definition and Polymarket as the canonical venue
- How prices form: liquidity, resolution, mechanics
- The Polymarket Quant Bench dataset (HuggingFace): access and schema
- First look at the data in R
- Your project: indicator design, back-test, deliverables, R toolbox
Course at a glance (2/2)
Final presentations
Group presentations · Q&A · wrap-up
- Presentation order and time budget
- Q&A rules
- Closing thoughts and feedback
Further reading
- James et al. (2021) — free PDF + exercises at https://www.statlearning.com.
- Welch and Goyal (2008) — the canonical IS-vs-OOS empirical benchmark we’ll keep coming back to.
- Mark (1995) and Kilian (1999) — long-horizon regressions and the bootstrap procedure WG08 build on.
Notes
- JWHT (James et al. 2021) — your reference textbook for the rest of the course. Read chapter 2 before Lecture 3.
- Welch and Goyal (Welch and Goyal 2008) — the empirical-asset-pricing paper that defines the IS-vs-OOS distinction we used today. The paper is exhaustive (~50 pages); skim the introduction, Table 1, and the conclusion to understand the headline result. The full appendix material is reference-grade — useful when you want to check whether someone else’s proposed predictor was already tested by WG08 (most have been).
- Mark (Mark 1995) and Kilian (Kilian 1999) — the methodological precursors WG08 builds on for long-horizon regressions and bootstrap inference. Read these only if you want to understand the technical underpinnings of the bootstrap procedure; for the project, the WG08 result is enough.
Prepare before next lecture
- Install R and RStudio from https://posit.co/download/rstudio-desktop.
- Make sure your installation works — open RStudio and run
1 + 1in the Console. - Bring a laptop to Lecture 2; we’ll do live coding.
Notes
Lecture 2 is the R fluency session — the only one where having R installed and working in advance is critical. If install or first-run issues come up, post in the course Moodle ahead of the lecture so we can sort it out before the live coding starts.
The check (“run 1 + 1”) is intentionally trivial: if R produces [1] 2 in the console, your install is fine. The common Windows pitfall is that R is installed but RStudio doesn’t see it — Tools → Global Options → General → R version lets you point RStudio at the right R binary.
Bring the laptop to Lecture 2 — typing along during live coding is the only way to internalise the syntax. Reading along is much less effective. If you don’t have a laptop available that day, pair up with a classmate.
See you next time
- Sign up on the course Moodle page so you receive announcements and the dataset link before Lecture 5.
- Lecture 2 (22 April 2026): Introduction to R — RStudio, variables, vectors, data frames, live coding.