Lecture 1: Foundations

Course outline · Backtesting fundamentals

Prof. Dr. Andre Guettler Director of the Institute
Helmholtzstraße 22, Room 205
andre.guettler@uni-ulm.de
+49 731 50 31 030

Oliver Padmaperuma Doctoral Candidate
Helmholtzstraße 22, Room 203
oliver.padmaperuma@uni-ulm.de
+49 731 50 31 036

1.1 Course objectives

1.1 Course objectives
1.2 Aim & organisation
1.3 Backtesting fundamentals
1.4 Conclusion of Lecture 1

Welcome to
Course Objective
Course at a glance (1/2)
Course at a glance (2/2)
Assignments / Exams

Welcome to Finance Project — Asset Management

This is a project course: there is no central exam to register for. Sign up on the course Moodle page by 15 April 2026 so you receive announcements and the data link.
Submit the project by 30 June 2026 as a single zip — name pattern: Asset2026_surname1_surname2_surname3. Email it to oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de and your team-mates.
Ask questions during or right after each session — that is the preferred channel.
Admin / studies / exam-eligibility questions go to the registrar’s office (Studiensekretariat) at studiensekretariat@uni-ulm.de.
Course-content questions outside class: email oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de.
We also recommend the student advisory service.

Course Objective

Scope

We will:

Build an end-to-end empirical pipeline in R: load, explore, model, back-test
Cover the core ML toolbox for asset-management research: linear models, Ridge, Lasso, Elastic Net, cross-validation
Apply it to a non-traditional asset class: prediction markets
Develop your own indicator library and trading strategy in groups of three

We will NOT:

Drift into deep-learning or reinforcement-learning methods
Cover prediction markets in depth
Provide a “ready-to-fork” backtest — the demo code is intentionally basic

Approach

Part I — Foundations

L1: Motivation, organisation, backtesting fundamentals
L2: Hands-on R intro — RStudio, live coding, etc.
L3 + L4: Statistical learning — model accuracy, regularisation, resampling

Part II — Application

L5: Prediction-markets primer + the Polymarket dataset + assignment briefing
Project work in groups of three (≈ 7 weeks of self-organised work)
Final session (1 July): 20-minute presentations per team

Course at a glance (1/2)

Foundations

Week 1

15.04.2026

Course outline · Backtesting fundamentals

Course aim & organisation
Backtesting overview & case study
In-sample tests (Welch & Goyal 2008)
Out-of-sample (walk-forward, R²_OS)
Useful predictors & p-hacking

Introduction to R

Week 2

22.04.2026

RStudio · variables · vectors · data frames · live coding

Why R for empirical asset-management research
RStudio and the script editor
Variables, vectors, matrices, data frames, lists
Functions and loops
Data import and export

Assessing model accuracy & Ridge regression

Week 3

29.04.2026

Statistical learning · MSE · bias-variance · linear model selection · Ridge

Statistical learning: Y = f(X) + ε
Quality of fit and the train/test MSE distinction
Bias-variance trade-off and overfitting
OLS limits: prediction accuracy & interpretability
Ridge regression and the L2 penalty

Lasso, cross-validation & Elastic Net

Week 4

06.05.2026

Sparse regularisation · resampling for honest test error · choosing λ

Lasso: L1 penalty and exact-zero coefficients
Cross-validation: validation set, LOOCV, K-fold
Choosing the optimal λ for Lasso
OLS post-Lasso for cleaner coefficient inference
Elastic Net — combining Ridge and Lasso

Prediction markets, the Polymarket Quant Bench & your project

Week 5

13.05.2026

From Welch-Goyal to event-resolved binary contracts

Prediction markets — definition and Polymarket as the canonical venue
How prices form: liquidity, resolution, mechanics
The Polymarket Quant Bench dataset (HuggingFace): access and schema
First look at the data in R
Your project: indicator design, back-test, deliverables, R toolbox

Course at a glance (2/2)

Final presentations

Week 13

01.07.2026

Group presentations · Q&A · wrap-up

Presentation order and time budget
Q&A rules
Closing thoughts and feedback

Assignments / Exams

Project (Code + Report) 50% of your grade

Rmd code + knitr-rendered PDF report. Build a library of indicators over the Polymarket Quant Bench dataset (curated OHLCV bars on HuggingFace, derived from Jon Becker’s polymarket-data dump), derive trade signals, back-test, and write a critical reflection.

Group of up to 3.

Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Finance Project — Asset Management_assignment-1-project-report_surname1_surname2_…

30 June 2026

Final Presentation 50% of your grade

20-minute group presentation in class on 1 July 2026; submit slides as PDF together with the project zip.

Group of up to 3.

Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Finance Project — Asset Management_assignment-2-final-presentation_surname1_surname2_…

1 July 2026

1.2 Aim & organisation

1.1 Course objectives
1.2 Aim & organisation
1.3 Backtesting fundamentals
1.4 Conclusion of Lecture 1

Aim of the Finance Project
Course outline
General course information
Reading & grading
Slides, materials & contact

Aim of the Finance Project

Students develop their own indicator-based trading strategies for prediction markets using machine-learning approaches in R.
Combination of lectures (key concepts in Machine Learning, backtesting, basic R) plus hands-on empirical implementations of your own strategy.
Ideally, you can run your strategy on a live dashboard.

The course is structured backwards from a deliverable: by the end of the term you will have built and back-tested an indicator-based trading strategy on prediction-market data. Everything in the lectures — R fundamentals, regression, regularisation, cross-validation — is the toolkit you need to make that strategy non-trivial.

Prediction markets are exchanges where contracts pay out based on the resolution of real-world events (will candidate X win the election? will GDP exceed 3% next year?). They have two properties that make them an interesting empirical playground: (a) prices have a clean economic interpretation as the market’s probability estimate of the event, so you can build indicators on probability dynamics rather than abstract returns; (b) the dataset has rich cross-sectional structure (thousands of markets resolving over different horizons), which means standard backtesting and machine-learning methods all apply but on data that is unfamiliar enough to force you to think carefully about what the indicators mean.

The “live dashboard” goal at the end is aspirational — most groups produce a static back-test in Quarto / Rmd. A small number push further into a Shiny dashboard that updates as new market data arrives; that is excellent thesis-level work and is encouraged but not required.

Course outline

Motivation & Organisation (today)
Backtesting Fundamentals (today)
Statistical Learning with Applications in R
1. Introduction to R — Lecture 2 (Oliver)
2. Assessing Model Accuracy — Lecture 3 (Andre)
3. Linear Model Selection & Regularisation — Lectures 3–4 (Andre)
4. Resampling Methods — Lecture 4 (Andre)
Practical Implementation — Lecture 5 (Oliver)
1. Prediction-markets primer & the Polymarket Quant Bench dataset
2. Your project: build indicators & back-test on prediction-market data

The four parts map onto a learn-then-apply rhythm:

Today (Lectures 1) lays the conceptual ground: why we backtest, what makes a backtest credible (in-sample vs out-of-sample), and the canonical empirical benchmark — Welch and Goyal’s exhaustive in-sample-vs-out-of-sample comparison (Welch and Goyal 2008).
Lecture 2 brings R fluency to a level where you can implement everything that follows. If you have R experience already (e.g., from the Research in Finance course), parts of L2 will be revision; if you don’t, this is the lecture to attend in person and run the code along with.
Lectures 3–4 introduce the modelling toolkit: linear regression as the baseline; ridge and lasso as the regularisation extensions; cross-validation as the principled way to choose tuning parameters and estimate out-of-sample error. An Introduction to Statistical Learning (James et al. 2021) chapters 3 and 6 are the textbook for these lectures.
Lecture 5 introduces the Polymarket Quant Bench dataset and turns you loose on the project — by the end of L5 you should have an indicator and a working backtest skeleton; the rest is iteration with our consultation hours.
Lecture 6 is final presentations.

The hand-off across instructors (Andre for the ML lectures, Oliver for the R + practical lectures) reflects who’s most fluent in each topic; either of us is the right contact for general questions.

General course information

All lectures: Wednesdays at 12:15 in Helmholtzstraße 18, room E60, Ulm.
See dates in the course overview or on Moodle (first lecture on 15 April 2026).
After Lecture 5, the practical phase starts and we offer regular consultation hours.
Important deadlines:
- Written project (Rmd + PDF + slides as zip): 30 June 2026, 18:00.
- Final presentations: 1 July 2026.

Reading & grading

An Introduction to Statistical Learning with Applications in R — James, Witten, Hastie, Tibshirani (free eBook).

Companion videos: https://www.dataschool.io/15-hours-of-expert-machine-learning-videos
Companion site & data: https://www.statlearning.com

Project report (10–15 pages, R Markdown → PDF): 50%
Final presentation (~20 minutes, PDF slides): 50%
Group-based: groups of 3 (we allocate if you don’t form one).

Mandatory reading. An Introduction to Statistical Learning (James et al. 2021) is the canonical textbook for the methods this course uses — the second edition is free as a PDF at https://www.statlearning.com, with R lab code and exercises at the end of each chapter. The chapters most directly relevant to this course:

Chapter 2 — basic statistical learning concepts (bias-variance trade-off, training vs test error, classification vs regression).
Chapter 3 — linear regression. Most of Lecture 3 is a focused walk through this chapter.
Chapter 5 — resampling methods (cross-validation, bootstrap). Lecture 4.
Chapter 6 — linear model selection and regularisation (best subset selection, ridge regression, lasso). Lectures 3–4.

You don’t need to read everything front-to-back, but the JWHT chapter on whatever lecture is happening next is the right preparation reading. The companion videos linked on the slide are useful when the textbook explanation is too dense — the authors recorded their own walkthroughs of every chapter.

Grading mechanics. The 50/50 split between report and presentation reflects that both deliverables matter. The report is graded on the rigour of the analysis (clean methodology, honest evaluation, defensible conclusions); the presentation on clarity (can you explain the strategy in 20 minutes to colleagues who haven’t read the report?). Forming groups of 3 early is high-leverage — solo work is allowed but rarely produces the best work given the project’s scope.

Slides, materials & contact

Slides + teaching material: posted to Moodle and the course homepage.
Dataset link: shared via Moodle ahead of Lecture 5 (do not start early — we may update the dataset).

Lecture content — Andre Guettler (andre.guettler@uni-ulm.de)
Practical implementations — Oliver Padmaperuma (oliver.padmaperuma@uni-ulm.de)

1.3 Backtesting fundamentals

1.1 Course objectives
1.2 Aim & organisation
1.3 Backtesting fundamentals
1.4 Conclusion of Lecture 1

Backtesting — overview
Case study — SRA Credit Spread Trading
In-sample tests — general setup
In-sample tests — properties
Welch & Goyal (2008) — bootstrap inference
In-sample results —
Out-of-sample tests — overview
Out-of-sample R²
Out-of-sample results —
Plots — IS vs OOS R² over time
Example — Book-to-market
Useful predictors — requirements
P-hacking / data mining

Backtesting — overview

Case study — what does a polished backtest look like in practice?
In-sample tests — how do we know a predictor matters on the historical sample?
Out-of-sample forecasts — does the predictor still work outside the estimation window?

Backtesting is the standard quantitative-finance workflow for evaluating a strategy idea: take a historical dataset, simulate trading the strategy as if you had been deploying it in real time, and measure the resulting risk-adjusted return.

The fundamental challenge is data snooping. If you tweak the strategy parameters and re-run the backtest until performance looks good, you’re optimising on a single sample and the apparent edge will not survive in live trading. The whole field of empirical asset-pricing has spent two decades developing tools to detect and correct for this problem; today’s lecture introduces the most basic and most important one — out-of-sample evaluation.

The three-part structure (case study → in-sample tests → out-of-sample tests) starts concrete and gets more abstract: the SRA case study shows what good backtesting output looks like; the in-sample slide shows the standard regression test; the out-of-sample slide shows why in-sample alone is not enough.

Case study — SRA Credit Spread Trading

Forecasting horizon: 1 day
Trading instruments: $HYG ETF
Risk-free: US Treasury (1y)
Benchmark: $HYG buy-and-hold
Rebalancing: daily
Trading costs: bid/ask + 2 bp ETF + 0.5 bp TSY

Sharpe ratio: 1.38 (strategy) vs 0.40 (benchmark)
Std. dev. (annualised): 8.3% vs 21.0%
Max draw-down: −7.5% vs −33.5%
Annualised return: 16.73% vs 7.99%

The example SRA credit-spread strategy is a polished demonstration of what a complete backtest looks like — every cell on this slide is a number you should be able to produce for your own project by the end of the term.

What the numbers tell you:

Sharpe 1.38 vs 0.40 — the strategy delivers more than three times the risk-adjusted return of the buy-and-hold benchmark. Sharpe is the universal scaling-free measure: $(\\bar r - r_f) / \\sigma$. A Sharpe above 1 is “good”; above 2 is “exceptional”; above 3 is “almost certainly overfit”.
Standard deviation 8.3% vs 21.0% — the strategy is much less volatile than holding the underlying. The strategy spends meaningful time in cash (the risk-free asset), which mechanically reduces variance.
Max drawdown −7.5% vs −33.5% — the worst peak-to-trough loss. Drawdown matters because it determines whether an investor would actually stay invested through the strategy’s losing periods. A 33.5 % drawdown on a buy-and-hold (the 2008 low) is what makes investors capitulate; the strategy’s −7.5 % is bearable.
Annualised return 16.7% vs 8.0% — the strategy delivers higher absolute return and lower risk. That combination — positive on both axes — is the hallmark of a strategy with an actual edge, not just leverage.

The transaction-cost line (bid/ask + 2bp ETF + 0.5bp Treasury) is the most underweighted part of academic backtests. Many published “anomalies” disappear once realistic costs are baked in. Your project will be required to include realistic costs.

In-sample tests — general setup

\[ r_{t,t+h} \;=\; a + b\,X_t + u_t \]

$r$ — log excess return ($=$ asset return − risk-free rate)
$X$ — vector of predictor(s)
$t$ — end of period; $h$ — forecast horizon (e.g. 6 months)

Example.

$X_t$ = Dividend yield as of 31.12.2020
$r_{t,t+6}$ = excess return for 31.12.2020 → 30.06.2021

This is the workhorse predictive regression of empirical asset pricing: regress future returns on a predictor observed today. The slope $b$ is the effect we care about — if it’s significantly non-zero (and the right sign), the predictor appears to forecast returns.

Two technical features that matter:

Excess returns ($r$ = asset return − risk-free rate). The risk-free rate is what an investor could earn without bearing market risk; subtracting it isolates the compensation for risk. All academic predictability work uses excess returns; raw-return predictability is mostly mechanical (real interest rates correlate with macro variables that correlate with returns).
Forecast horizon $h$. With $h = 1$ month and monthly data you have $T$ non-overlapping observations and standard t-statistics work. With $h = 12$ months and monthly data you have heavily overlapping observations (the regression at $t$ and $t+1$ share 11 months of returns) which inflates t-statistics enormously if you ignore the overlap. The bootstrap procedure on the next-but-one slide is the (Welch and Goyal 2008) correction for this.

In practice the equation is run as lm(future_return ~ predictor) in R; the slope, t-statistic, and adjusted R² are the headline outputs. We’ll come back to this regression form repeatedly in Lectures 3 and 4 with progressively more sophisticated estimators.

In-sample tests — properties

In-sample OLS estimators have high power (when the model specification is valid).
Results often depend heavily on sample period (start and end dates).
Inference is tricky, especially with multi-period excess returns and persistent predictors.

“In-sample OLS estimators have high power” means: if a predictor genuinely affects returns, the in-sample regression will reliably detect it. The OLS coefficient is unbiased and the standard t-statistic asymptotically delivers the right rejection rate.

The two caveats are what make in-sample tests dangerous in practice:

Sample-period sensitivity. Predictors that look strong on 1947–1990 may be weak on 1947–2024 (or vice versa). Empirical asset-pricing literature has repeatedly found that “anomalies” weaken or disappear after publication, partly because of arbitrage activity but also because the original sample may have been atypical. Always report what your result looks like on a hold-out sub-period.
Inference is tricky. When the predictor is highly persistent (e.g., dividend yield with autocorrelation > 0.95) and the dependent variable involves overlapping returns, the t-statistic from lm() is badly miscalibrated — it can suggest p < 0.01 when the true p-value is 0.20. The Welch-Goyal bootstrap (next slide) and the Newey-West / Hodrick standard errors are the standard corrections.

These issues are why a positive in-sample t-statistic is not the end of the story. The out-of-sample test on the slide after next is the much more credible robustness check.

Welch & Goyal (2008) — bootstrap inference

Bootstrap procedure in Welch and Goyal (2008) follows Mark (1995) and Kilian (1999).
Construct 10 000 bootstrapped time series by drawing residuals with replacement.
Initial observation: pick one date from the actual data at random.
The procedure preserves the autocorrelation structure of the predictor.

The Welch-Goyal bootstrap (Welch and Goyal 2008), building on Mark (Mark 1995) and Kilian (Kilian 1999), is a residual-resampling procedure that simulates many counterfactual histories of the predictor-and-return system. The crucial design choice: instead of resampling observations independently (which would destroy the autocorrelation structure of a persistent predictor), the bootstrap resamples residuals and reconstructs the time series with the original AR dynamics. That preserves the persistence and gives correctly-sized inference.

Operationally:

Fit r_t = a + b X_t + u_t on the actual data; estimate a, b, residuals u_t.
Fit X_t = c + d X_{t-1} + v_t (or a higher AR order); estimate c, d, residuals v_t.
Under the null ($b = 0$): generate a counterfactual return series as r*_t = a + 0 × X_t + u*_t where u*_t is a random draw with replacement from the actual residuals; generate a counterfactual predictor from the AR equation and resampled v*_t.
Repeat 10 000 times. The distribution of bootstrapped t-statistics is the null distribution that the actual t-statistic should be compared to.

The result: a much more conservative p-value than the OLS t-statistic. Many predictors that look highly significant in the standard regression turn out to have only marginal significance under the bootstrap — and a few that look marginal under OLS turn out to be significant. The next slide reports the bootstrap-corrected results from WG08.

In-sample results — Welch and Goyal (2008)

In-sample adjusted $R^2$ over the full sample — significant predictors (* = p<0.10, ** = p<0.05, *** = p<0.01):

Predictor	Sample	Adj. R²
b/m Book-to-market	1921–2005	3.20*
i/k Investment / capital	1947–2005	6.63**
ntis Net equity expansion	1927–2005	8.15***
eqis Pct equity issuing	1927–2005	9.15***
all Kitchen sink	1927–2005	13.81**

Many other predictors have negative adjusted R² — see Welch and Goyal (2008).

The headline numbers from Welch and Goyal’s exhaustive in-sample analysis (Welch and Goyal 2008). Five predictors stand out as having both economically meaningful and statistically significant in-sample adjusted R²:

b/m (book-to-market) — value-vs-growth signal.
i/k (investment / capital) — corporate investment rate as a state variable.
ntis (net equity expansion) — net new equity issuance scaled by market cap.
eqis (percent equity issuing activity) — closely related to ntis.
kitchen sink — all predictors stacked into one regression.

R² of 3–14 % over multi-year horizons is large by financial-economics standards (single-factor models on monthly returns explain 1–2 %). The kitchen-sink R² of 13.81 % is suggestively high — but it’s also where the multiple-testing concern is most acute. Putting many predictors in the same regression all but guarantees one will look significant by chance.

The footnote — “many other predictors have negative adjusted R²” — is the silent companion. The full WG08 sample includes ~14 predictors; only the five listed clear the bar. The other 9 fail. This is the core “publication bias” concern: papers tend to publish the predictors that worked, leaving the failures unreported.

The next slide pulls the rug out from under most of these in-sample wins by showing the out-of-sample picture.

Out-of-sample tests — overview

OOS = robustness check for in-sample estimation.
Predictive performance is evaluated on data outside the training window.
Walk-forward: estimate parameters $\hat a_{t-1}, \hat b_{t-1}$ using data through $t-1$; forecast $\hat r_t = \hat a_{t-1} + \hat b_{t-1} X_{t-1}$; advance one step; repeat.

Walk-forward (also called expanding-window or rolling-window) backtesting is the discipline of estimating the model only on data the strategy could plausibly have known at the time. At each forecast date $t$, you re-fit the model on data ending at $t-1$ and forecast $\hat r_t$ using that just-fitted model. No future information leaks into past forecasts.

Why it matters: the alternative — fit the model on the full sample, then evaluate the predictions — uses information from the future to estimate the model parameters that produce the past forecasts. That’s look-ahead bias and it inflates apparent forecast accuracy. Empirical-finance papers that don’t use walk-forward (or an explicit train/test split) are routinely overconfident.

There are two flavours of walk-forward:

Expanding window — re-fit on all data through $t-1$ each step. Uses more data but recent observations are diluted.
Rolling window — re-fit on a fixed-size window (e.g., last 60 months). More responsive to regime change; smaller sample per fit.

For the project, expanding window is the simpler default; if your indicator has time-varying behaviour you may want to switch to rolling.

Implementation in R: it’s mostly a loop over time indices, with lm() re-fit on each iteration. The slider and tsibble packages have helpers; for production work, the tidymodels framework’s rsample::rolling_origin() automates the window-management.

Out-of-sample R²

Compare model forecast $\hat r_t$ vs realisation $r_t$, and historical-mean forecast $\bar r_t$ vs realisation:

\[ R^2_{OS} \;=\; 1 \;-\; \dfrac{\sum_{t=1}^T (r_t - \hat r_t)^2}{\sum_{t=1}^T (r_t - \bar r_t)^2} \]

A positive $R^2_{OS}$ means the model has lower prediction error than the historical mean.

The out-of-sample R² is constructed as a comparison against the historical mean as a benchmark. The numerator is the sum of squared errors of the model forecast; the denominator is the sum of squared errors of the simplest possible alternative — predicting next period’s return as the average return seen so far.

Why this benchmark? Because the historical mean is almost impossible to beat in equity-premium prediction. Stock returns are heavily noise-dominated; a model that just guesses the long-run average gets you most of the way there. Beating that benchmark by even a small amount over thousands of periods is meaningful evidence of genuine forecasting ability.

Reading the formula: $R^2_{OS} > 0$ means lower forecast error than the historical mean (good). $R^2_{OS} < 0$ means worse than just predicting the average, which is a damning result — your model is not adding value, it is destroying it.

The $R^2_{OS}$ measure is from Welch and Goyal (2008) and is now standard in the equity-premium-prediction literature. The strict comparison to a simple benchmark is what makes it a credible measure — easier benchmarks (zero, or last period’s value) would let weaker models look good.

Out-of-sample results — Welch and Goyal (2008)

In WG08, only one predictor (eqis) remains significant out-of-sample:

Predictor	Adj. R²_OS
eqis Pct equity issuing	2.04**

All other predictors deliver negative OOS R² — including the kitchen-sink combination at −139.03.

This is the single most important result in the equity-premium-prediction literature. Of the 14 predictors that look statistically significant in-sample (previous slide), only one survives the out-of-sample test: eqis (percent equity issuing), with a modest 2.04 % R²_OS that is itself only marginally significant.

The kitchen-sink result — adjusted in-sample R² of 13.81, adjusted OOS R² of −139.03 — is the most damning. Stacking 14 predictors into one regression looks great in-sample (every parameter is fit to maximise in-sample fit) but is catastrophic out-of-sample (every parameter overfits to noise that doesn’t repeat). Negative R²_OS at this magnitude means the model’s predictions are dramatically worse than just predicting the average return.

The lesson is the foundational empirical-asset-pricing result: in-sample fit is almost no evidence of true predictive power. The literature has spent two decades absorbing this result; the modern best-practice empirical-finance paper now reports both IS and OOS R², with the OOS test treated as the credibility-defining number.

The implications for your project: when you build an indicator, the IS regression on its own tells you very little. The OOS R² (or equivalently, the walk-forward backtested return) is the number that matters. If your indicator has positive IS but negative OOS, you have not found a real signal.

Plots — IS vs OOS R² over time

IS line: cumulative squared demeaned equity premium − cumulative squared regression residual.
OOS line: cumulative squared error of the prevailing mean − cumulative squared error of the predictor’s regression.
Line rising ⇒ predictor gaining forecasting ability.
Line falling ⇒ historical mean predicts better.

Plotting cumulative-SSE-difference over time (rather than just reporting a single R²_OS for the full sample) reveals time-variation in forecasting power that a single number hides. Three patterns to recognise:

Smoothly rising — consistent forecasting ability across the sample. Rare in equity-premium prediction; sometimes seen with macro variables in shorter samples.
Rising then falling — predictor worked historically but lost power in recent periods. Common pattern: many anomalies that worked in the 1970s–80s have weakened or vanished post-publication. The rising part of the curve flatters the full-sample R²_OS; the falling part is a warning sign about live deployment.
Mostly flat then a sudden jump — the predictor’s apparent forecasting power comes from one specific window (often a crisis or structural break). Remove that window and the predictor is useless. The next slide’s book-to-market example shows exactly this pattern around the 1974 oil shock.

For your project, plot the cumulative SSE difference rather than just reporting a single backtested return — the time-series view tells you much more about whether your indicator is robust.

Example — Book-to-market

Positive IS, but negative OOS — fails the robustness check.

IS prediction (dotted line) trends positive — the predictor looks useful in-sample.
OOS prediction (solid black line) collapses post-1974 — the predictor loses forecasting power.
Lesson: in-sample alpha alone is not enough. The same logic applies to your prediction-market indicators.

Book-to-market is the textbook example of a historically strong predictor that has lost power. The story:

1900–1974: the book-to-market ratio (book equity / market equity) appeared to predict future stock returns. Value stocks (high B/M) outperformed growth stocks (low B/M). The cumulative SSE difference rose steadily — the predictor was beating the historical mean.
1974: structural break, often associated with the first oil crisis and the resulting macro regime change.
Post-1974: the predictor’s edge shrinks and eventually reverses. By the time the WG08 sample ends in 2005, the OOS R² is negative — book-to-market actively hurts forecasts compared to just predicting the average.

Two takeaways. First, the in-sample-only story would have been “book-to-market is a strong predictor of returns” — backed by decades of data. The OOS test reveals that the predictor’s apparent power was substantially driven by the pre-1974 sub-sample. Second, even if a predictor was genuinely useful in the past, publication and arbitrage erode its power. The Fama-French value factor is the academically prominent example; its excess return has compressed dramatically since the original 1993 paper. Any indicator your project finds “works” should be checked: was the strategy known? has it been arbitraged away?

Useful predictors — requirements

A predictor is useful if it shows:

Both significant IS and reasonably good OOS performance over the entire sample.
A generally upward drift (irregular is fine).
Drift in more than one short or unusual sample period (not just two years around the Oil Shock).
Drift that remains positive over recent decades — predictors often lose forecasting power after publication.

The four criteria are increasingly demanding ways of guarding against the failure modes seen in the WG08 results.

(1) IS and OOS — both must work. IS-only is the trap that book-to-market and the kitchen-sink fell into.
(2) Generally upward drift — the cumulative SSE difference should be rising over the sample, even if not monotonically. A predictor that delivers most of its OOS R² in two specific years is fragile.
(3) Drift in more than one short or unusual period — relate this to (2). If the predictor only works during, say, the 2008 crisis or the 1974 oil shock, you’ve found a regime indicator, not a generic forecaster.
(4) Drift positive in recent decades — the strongest test, because it’s the closest thing to a live test you can do retrospectively. Many predictors that satisfy criteria 1–3 fail criterion 4 because the literature has caught up with them.

For your project’s indicator, demand all four. If your indicator only works on one sub-period, or only over the full historical sample but not the last decade, treat that as evidence the indicator isn’t real, not evidence the recent sample is “different”.

P-hacking / data mining

Multiple-testing fallacy — running enough backtests on a single dataset is certain to produce a result that meets any pre-specified statistical-significance threshold.
Solutions:
- Paper performance — track strategy returns publicly online; raises the bar for parallel strategies.
- Real-money performance — gold standard; serious investors require ≥ 3 years of live, public performance (managed account or fund).

When you back-test five indicators and one “wins”, ask yourself how many silent losers it crowds out.

The multiple-testing problem in backtesting is the same statistical issue as multiple-testing in academic research, but more acute because backtests can be run cheaply: a typical practitioner can try hundreds of indicator variations in an afternoon. With α = 0.05 and 100 independent tests, you expect 5 to look “significant” by pure chance.

The standard academic correction (Bonferroni, FDR — see the Research in Finance Lecture 3 handout) is to require a tighter significance threshold when running many tests. In a practitioner setting, this typically means: if you tried 20 indicator variations to find one that works, the threshold for “this is real” should be much higher than the threshold for the first one you tried.

The two solutions on the slide are the gold standards:

Paper performance — publish the strategy on a public-tracking website before deploying it. Each subsequent month of out-of-performance is a genuine out-of-sample observation; after a year of strong out-of-tracking performance, the strategy has built credibility.
Real-money performance — actually deploy the strategy with capital, ideally as a managed account or fund with public reporting. This is the standard professional investors require because it forecloses the easy escape hatches (you can’t selectively report only the winning periods).

For your project, the project report is your “paper performance” — by writing the indicator, the back-test, and the result publicly, you commit to the result. Don’t try ten indicators and report only the one that worked; report what you actually tested and let the marker see the honest picture.

1.4 Conclusion of Lecture 1

1.1 Course objectives
1.2 Aim & organisation
1.3 Backtesting fundamentals
1.4 Conclusion of Lecture 1

Course at a glance (1/2)
Course at a glance (2/2)
Further reading
Prepare before next lecture
See you next time
References

Course at a glance (1/2)

Foundations

Week 1

15.04.2026

Course outline · Backtesting fundamentals

Course aim & organisation
Backtesting overview & case study
In-sample tests (Welch & Goyal 2008)
Out-of-sample (walk-forward, R²_OS)
Useful predictors & p-hacking

Introduction to R

Week 2

22.04.2026

RStudio · variables · vectors · data frames · live coding

Why R for empirical asset-management research
RStudio and the script editor
Variables, vectors, matrices, data frames, lists
Functions and loops
Data import and export

Assessing model accuracy & Ridge regression

Week 3

29.04.2026

Statistical learning · MSE · bias-variance · linear model selection · Ridge

Statistical learning: Y = f(X) + ε
Quality of fit and the train/test MSE distinction
Bias-variance trade-off and overfitting
OLS limits: prediction accuracy & interpretability
Ridge regression and the L2 penalty

Lasso, cross-validation & Elastic Net

Week 4

06.05.2026

Sparse regularisation · resampling for honest test error · choosing λ

Lasso: L1 penalty and exact-zero coefficients
Cross-validation: validation set, LOOCV, K-fold
Choosing the optimal λ for Lasso
OLS post-Lasso for cleaner coefficient inference
Elastic Net — combining Ridge and Lasso

Prediction markets, the Polymarket Quant Bench & your project

Week 5

13.05.2026

From Welch-Goyal to event-resolved binary contracts

Prediction markets — definition and Polymarket as the canonical venue
How prices form: liquidity, resolution, mechanics
The Polymarket Quant Bench dataset (HuggingFace): access and schema
First look at the data in R
Your project: indicator design, back-test, deliverables, R toolbox

Course at a glance (2/2)

Final presentations

Week 13

01.07.2026

Group presentations · Q&A · wrap-up

Presentation order and time budget
Q&A rules
Closing thoughts and feedback

Prepare before next lecture

Install R and RStudio from https://posit.co/download/rstudio-desktop.
Make sure your installation works — open RStudio and run 1 + 1 in the Console.
Bring a laptop to Lecture 2; we’ll do live coding.

See you next time

Reminder

Sign up on the course Moodle page so you receive announcements and the dataset link before Lecture 5.
Lecture 2 (22 April 2026): Introduction to R — RStudio, variables, vectors, data frames, live coding.

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning with Applications in R. 2nd ed. New York, NY: Springer. https://www.statlearning.com/.

Kilian, Lutz. 1999. “Exchange Rates and Monetary Fundamentals: What Do We Learn from Long-Horizon Regressions?” Journal of Applied Econometrics 14 (5): 491–510. https://doi.org/10.1002/(SICI)1099-1255(199909/10)14:5<491::AID-JAE527>3.0.CO;2-D.

Mark, Nelson C. 1995. “Exchange Rates and Fundamentals: Evidence on Long-Horizon Predictability.” American Economic Review 85 (1): 201–18.

Welch, Ivo, and Amit Goyal. 2008. “A Comprehensive Look at the Empirical Performance of Equity Premium Prediction.” Review of Financial Studies 21 (4): 1455–1508. https://doi.org/10.1093/rfs/hhm014.

Lecture 1: Foundations

1.1 Course objectives

Welcome to Finance Project — Asset Management

Course Objective

Course at a glance (1/2)

Course at a glance (2/2)

Assignments / Exams

1.2 Aim & organisation

Aim of the Finance Project

Course outline

General course information

Reading & grading

Slides, materials & contact

1.3 Backtesting fundamentals

Backtesting — overview

Case study — SRA Credit Spread Trading

In-sample tests — general setup

In-sample tests — properties

Welch & Goyal (2008) — bootstrap inference

In-sample results — Welch and Goyal (2008)

Out-of-sample tests — overview

Out-of-sample R²

Out-of-sample results — Welch and Goyal (2008)

Plots — IS vs OOS R² over time

Example — Book-to-market

Useful predictors — requirements

P-hacking / data mining

1.4 Conclusion of Lecture 1

Course at a glance (1/2)

Course at a glance (2/2)

Further reading

Prepare before next lecture

See you next time

References