Lecture 3: Assessing model accuracy & Ridge regression

Statistical learning · MSE · bias-variance · linear model selection · Ridge

Prof. Dr. Andre Guettler Director of the Institute
Helmholtzstraße 22, Room 205
andre.guettler@uni-ulm.de
+49 731 50 31 030

Oliver Padmaperuma Doctoral Candidate
Helmholtzstraße 22, Room 203
oliver.padmaperuma@uni-ulm.de
+49 731 50 31 036

3.1 Course objectives

3.1 Course objectives
3.2 Recap from Lectures 1 & 2
3.3 Assessing model accuracy
3.4 Linear model selection & regularisation
3.M Conclusion of Lecture 3

Welcome to
Course Objective
Course at a glance (1/2)
Course at a glance (2/2)
Assignments / Exams

Welcome to Finance Project — Asset Management

This is a project course: there is no central exam to register for. Sign up on the course Moodle page by 15 April 2026 so you receive announcements and the data link.
Submit the project by 30 June 2026 as a single zip — name pattern: Asset2026_surname1_surname2_surname3. Email it to oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de and your team-mates.
Ask questions during or right after each session — that is the preferred channel.
Admin / studies / exam-eligibility questions go to the registrar’s office (Studiensekretariat) at studiensekretariat@uni-ulm.de.
Course-content questions outside class: email oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de.
We also recommend the student advisory service.

Course Objective

Scope

We will:

Build an end-to-end empirical pipeline in R: load, explore, model, back-test
Cover the core ML toolbox for asset-management research: linear models, Ridge, Lasso, Elastic Net, cross-validation
Apply it to a non-traditional asset class: prediction markets
Develop your own indicator library and trading strategy in groups of three

We will NOT:

Drift into deep-learning or reinforcement-learning methods
Cover prediction markets in depth
Provide a “ready-to-fork” backtest — the demo code is intentionally basic

Approach

Part I — Foundations

L1: Motivation, organisation, backtesting fundamentals
L2: Hands-on R intro — RStudio, live coding, etc.
L3 + L4: Statistical learning — model accuracy, regularisation, resampling

Part II — Application

L5: Prediction-markets primer + the Polymarket dataset + assignment briefing
Project work in groups of three (≈ 7 weeks of self-organised work)
Final session (1 July): 20-minute presentations per team

Course at a glance (1/2)

Foundations

Week 1

15.04.2026

Course outline · Backtesting fundamentals

Course aim & organisation
Backtesting overview & case study
In-sample tests (Welch & Goyal 2008)
Out-of-sample (walk-forward, R²_OS)
Useful predictors & p-hacking

Introduction to R

Week 2

22.04.2026

RStudio · variables · vectors · data frames · live coding

Why R for empirical asset-management research
RStudio and the script editor
Variables, vectors, matrices, data frames, lists
Functions and loops
Data import and export

Assessing model accuracy & Ridge regression

Week 3

29.04.2026

Statistical learning · MSE · bias-variance · linear model selection · Ridge

Statistical learning: Y = f(X) + ε
Quality of fit and the train/test MSE distinction
Bias-variance trade-off and overfitting
OLS limits: prediction accuracy & interpretability
Ridge regression and the L2 penalty

Lasso, cross-validation & Elastic Net

Week 4

06.05.2026

Sparse regularisation · resampling for honest test error · choosing λ

Lasso: L1 penalty and exact-zero coefficients
Cross-validation: validation set, LOOCV, K-fold
Choosing the optimal λ for Lasso
OLS post-Lasso for cleaner coefficient inference
Elastic Net — combining Ridge and Lasso

Prediction markets, the Polymarket Quant Bench & your project

Week 5

13.05.2026

From Welch-Goyal to event-resolved binary contracts

Prediction markets — definition and Polymarket as the canonical venue
How prices form: liquidity, resolution, mechanics
The Polymarket Quant Bench dataset (HuggingFace): access and schema
First look at the data in R
Your project: indicator design, back-test, deliverables, R toolbox

Course at a glance (2/2)

Final presentations

Week 13

01.07.2026

Group presentations · Q&A · wrap-up

Presentation order and time budget
Q&A rules
Closing thoughts and feedback

Assignments / Exams

Project (Code + Report) 50% of your grade

Rmd code + knitr-rendered PDF report. Build a library of indicators over the Polymarket Quant Bench dataset (curated OHLCV bars on HuggingFace, derived from Jon Becker’s polymarket-data dump), derive trade signals, back-test, and write a critical reflection.

Group of up to 3.

Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Finance Project — Asset Management_assignment-1-project-report_surname1_surname2_…

30 June 2026

Final Presentation 50% of your grade

20-minute group presentation in class on 1 July 2026; submit slides as PDF together with the project zip.

Group of up to 3.

Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Finance Project — Asset Management_assignment-2-final-presentation_surname1_surname2_…

1 July 2026

3.2 Recap from Lectures 1 & 2

3.1 Course objectives
3.2 Recap from Lectures 1 & 2
3.3 Assessing model accuracy
3.4 Linear model selection & regularisation
3.M Conclusion of Lecture 3

Where we are

Where we are

L1: backtesting fundamentals — IS vs OOS, \(R^2_{OS}\), useful-predictor checklist, p-hacking.
L2: R fundamentals — RStudio, vectors, data frames, functions, loops, import/export.
Today: how to assess how good a model really is, and meet the first regularised regression — Ridge.

Today is the conceptual core of the course. Two ideas to internalise:

The training MSE is a misleading measure of model quality. Models that minimise training MSE often perform poorly on new data. The bias-variance decomposition (third of today’s slides) makes this precise.
Ridge regression introduces a penalty for large coefficients that trades a small amount of bias for a meaningful reduction in variance. It is the simplest example of regularisation — the principle that constraining a model’s complexity often improves out-of-sample performance.

Both ideas extend directly to lasso (Lecture 4) and to almost any modern machine-learning method (gradient-boosted trees, neural networks). The lecture is long but the concepts compound — getting the bias-variance picture in your head is the most valuable thing you’ll take away from the course.

3.3 Assessing model accuracy

3.1 Course objectives
3.2 Recap from Lectures 1 & 2
3.3 Assessing model accuracy
3.4 Linear model selection & regularisation
3.M Conclusion of Lecture 3

What is statistical learning?
Measuring quality of fit — MSE
A problem
Training vs test MSE
Example I — splines on a noisy curve
Example II — train vs test MSE curve
Bias-variance trade-off
Bias of learning methods
Variance of learning methods
The trade-off — formula
Over- vs underfitting
A fundamental picture

What is statistical learning?

We observe \(Y\) and \(X = (X_1, \ldots, X_p)\) for \(p\) predictors and \(i\) observations.

We believe a relationship exists (e.g. excess return of S&P 500 vs dividend yield):

\[Y = f(X) + \varepsilon\]

\(f\) — unknown function
\(\varepsilon\) — random error term

Statistical learning is all about how to estimate \(f\). In this class we use predictors \(X\) to forecast \(Y\).

The decomposition \(Y = f(X) + \varepsilon\) is the foundational equation of supervised learning. It says any outcome we want to predict can be split into two parts:

A systematic component \(f(X)\) — the part of \(Y\) that is determined by the observed predictors. This is what statistical learning estimates.
A noise component \(\varepsilon\) — the part of \(Y\) that is genuinely unpredictable from \(X\), even with the best possible model. This is the irreducible error.

The distinction matters because it sets a hard ceiling on prediction accuracy. No matter how clever your method, you cannot reduce error below the variance of \(\varepsilon\). In equity-premium prediction, \(\varepsilon\) is huge (returns are dominated by noise that doesn’t depend on observable predictors), which is why even the best predictors deliver tiny R²s.

The course’s project goal — predicting prediction-market dynamics — has the same structure: there is some systematic relationship between observable indicators and future market behaviour (\(f\)), buried in a lot of noise (\(\varepsilon\)). Your indicator’s job is to recover as much of \(f\) as possible. JWHT (James et al. 2021) chapter 2.1 is the classical exposition.

Measuring quality of fit — MSE

A common measure of accuracy in regression is mean squared error:

\[ MSE \;=\; \dfrac{1}{n} \sum_{i=1}^{n} (y_i - \hat y_i)^2 \]

where \(\hat y_i\) is the prediction for observation \(i\) in the training data.

Mean squared error is the workhorse loss function for regression problems. It measures the average squared distance between predictions and actual values. Squaring (rather than taking the absolute value) does two things: (a) penalises large errors more heavily than small ones (fitting an “important to get right” criterion), (b) makes the analytical math tractable (the derivative is linear, leading to closed-form solutions for OLS).

Two related metrics you’ll see: - RMSE — root mean squared error, \(\\sqrt{\\text{MSE}}\). Same units as \(Y\), so easier to interpret. RMSE of 5 on a return series in percentage points means typical errors are about 5 percentage points. - MAE — mean absolute error. Less sensitive to outliers than MSE; some applications prefer it.

For Ridge and most classical methods, MSE is the natural target. Forecasting research occasionally uses out-of-sample R² (Lecture 1) which is a normalised MSE comparison against a benchmark — same underlying loss function, different normalisation.

A problem

Methods are designed to minimise MSE on training data (e.g., OLS picks the line that does so).
What we really care about is performance on new data — we call this test data.
There is no guarantee that the smallest training MSE delivers the smallest test MSE.

This is the deepest insight of statistical learning: optimising a model for the data you trained on does not optimise its performance on data you haven’t seen yet. A flexible enough method can drive training MSE arbitrarily close to zero (just memorise the training labels) while having atrocious test MSE (the memorised labels don’t apply to new data).

The disconnect is universal — it shows up in linear regression with too many predictors, in polynomial fits with too many degrees, in random forests with too many splits, and in deep learning with too many parameters. The solution is also universal: separate the data into training and test sets and evaluate on data the model never saw during fitting.

For your project: when you tune your indicator, never use the same data to both fit the indicator and evaluate it. Either hold out a test sub-period, or use cross-validation (Lecture 4 covers it formally), or use the walk-forward backtest from Lecture 1. Reporting backtest results computed on data that was used to design the indicator is the most common pitfall in undergraduate / master’s quant projects.

Training vs test MSE

The more flexible a method, the lower its training MSE — flexible methods can generate richer shapes for \(f\) than restrictive ones (e.g. linear regression).
But test MSE may rise for a more flexible method than for a simple approach like linear regression.
Less flexible ⇒ easier to interpret. Trade-off: flexibility vs interpretability.

Flexibility is loosely the model’s capacity to fit complex shapes. A linear regression with one predictor is very inflexible (the relationship is forced to be a straight line). A polynomial regression of degree 10 is highly flexible (it can fit any wiggly curve). A neural network with millions of parameters is even more flexible.

The empirical regularity from this and the next slide:

Training MSE always decreases with flexibility. More flexibility means the model can hug the training data more tightly.
Test MSE has a U-shape — it decreases initially (the model captures real signal) then rises (the model starts capturing noise that doesn’t generalise).

The “sweet spot” — the flexibility level that minimises test MSE — is what you actually want, but you can’t see test MSE during training (by definition). The standard tools to estimate it from training data alone are cross-validation (Lecture 4) and regularisation (Ridge/Lasso, today and Lecture 4).

The interpretability trade-off is also real: a regression with 5 coefficients is publishable in a paper because you can describe each coefficient’s economic meaning. A random forest with 100 trees is uninterpretable in the same way. For most empirical-finance applications, interpretable + slightly worse is preferred to uninterpretable + slightly better — because reviewers and readers want to know why the model makes the predictions it makes.

Example I — splines on a noisy curve

Reproduce with StatLearning.R (splines, OLS, train/test MSE loop).
Black = truth, orange = OLS, blue = smoothing spline (less flexible), green = smoothing spline (more flexible).
Higher flexibility hugs the data closer — but track training vs test MSE separately.

JWHT Figure 2.9 (James et al. 2021) is the canonical visual demonstration of overfitting. The setup: data generated from a known smooth black curve plus noise; three models fit:

Orange (OLS) — too inflexible. The straight line misses the curvature in the true relationship.
Blue (low-flex spline) — about right. Captures the broad curve without chasing local noise.
Green (high-flex spline) — too flexible. Wiggles to pass through almost every point, including the noise.

A casual eye would say green is the “best fit” — it’s closest to every training observation. But on new data drawn from the same process, the green spline’s wiggles make terrible predictions. The blue spline’s smoother estimate generalises much better.

Because we know the truth in this synthetic example, we can compute test MSE by drawing fresh data — the next slide does exactly that.

Example II — train vs test MSE curve

Grey = training MSE: declines monotonically with flexibility.
Red = test MSE: U-shape — falls, then rises.
Vertical dashed line marks the minimum test MSE — the optimal flexibility.

This figure is the empirical demonstration of the U-shape claim. As you increase model flexibility from very low (left) to very high (right), training MSE drops monotonically — every additional degree of freedom lets the model fit the training points more tightly. Test MSE drops too, but only up to a point; past the optimum, every additional degree of freedom buys you noise-fitting that hurts on new data.

The optimal flexibility (the dashed line in JWHT Fig. 2.9 right panel, around 7 degrees of freedom for the example) is exactly the flexibility level where the model has captured the systematic signal but not yet started chasing noise. To find it on real data — where you don’t have the luxury of drawing fresh test data — you use cross-validation (Lecture 4).

The vertical gap between training MSE (always low) and test MSE (U-shaped) widens as flexibility grows. That gap is essentially the variance penalty for over-flexibility. Ridge regression, today’s main technique, reduces that gap by penalising large coefficients.

Bias-variance trade-off

The previous figure illustrates the trade-off that governs every choice of statistical learning method:

There are always two competing forces — bias and variance.

Bias of learning methods

Modelling complicated real-life problems may induce error called bias.
Linear regression assumes \(Y\) and \(X\) are linear; in reality the relationship is rarely exactly linear, so some bias is present.
The more flexible / complex a method, the less bias it generally has.

Bias is the error introduced by the model’s structural assumptions. If you fit a straight line to a curve, no matter how much data you collect, your fitted line will systematically miss the curvature — that systematic miss is bias. Linear regression has high bias whenever the true relationship has substantial nonlinearity.

The relationship between flexibility and bias: more flexible methods can represent more shapes, so they have less structural limitation and consequently less bias. A degree-10 polynomial has lower bias than a straight line because it can represent quadratic, cubic, etc., relationships exactly. A neural network has very low bias because it can approximate essentially any continuous function.

For your project, “bias” shows up as: choosing a linear regression when the true indicator-vs-outcome relationship is, say, threshold-dependent. The fitted line averages across the threshold and loses the structure that actually matters.

Variance of learning methods

Variance measures how much your estimate for \(f\) would change with a different training data set.
Generally, the more flexible a method, the more variance it has.

Variance is the sensitivity of the fitted model to the specific training data. A method with high variance produces wildly different fits when trained on slightly different data; a method with low variance produces nearly the same fit regardless of which sample you give it.

A flexible model has many parameters, each of which is fit to the noise in the training set as well as the signal. Different training sets have different noise, so the parameters move around — high variance. An inflexible model has few parameters and ignores most of the noise, producing stable estimates — low variance.

For your project: variance is what makes a “great” backtest on one historical sub-period turn into a disappointing live performance. The model’s parameters were tuned to the specific noise of the training period; on new data, that tuning isn’t useful and may actively hurt.

The trade-off — formula

For any given \(X = x_0\), the expected test MSE on a new \(Y\) at \(x_0\) is:

\[ \mathrm{Expected\_Test\_MSE} \;=\; E\!\bigl(Y - \hat f(x_0)\bigr)^2 \;=\; \mathrm{Bias}^2 + \mathrm{Var} + \underbrace{\sigma^2}_{\text{Irreducible Error}} \]

As complexity rises, bias falls and variance grows — but expected test MSE may go either way.

The decomposition is the precise mathematical statement of the trade-off:

\(\\text{Bias}^2\) — the systematic offset between the average fit and the true function. Decreases with model flexibility.
\(\\text{Var}\) — how much the fit jumps around when trained on different samples. Increases with model flexibility.
\(\\sigma^2\) (irreducible error) — the variance of \(\\varepsilon\), fundamentally unpredictable. Sets a floor on test MSE that no method can beat.

The total expected test MSE is the sum. Reducing bias requires more flexibility; reducing variance requires less flexibility. The optimal model balances them — total MSE is minimised at the flexibility level where the marginal reduction in bias equals the marginal increase in variance.

The irreducible-error floor is important to remember when interpreting your project’s backtest. If the OOS R² of your indicator is 2 %, that’s not necessarily a bad model — financial returns are dominated by noise, and 98 % irreducible error is normal. The right comparison is to the no-information benchmark (zero R²_OS) and to the best-known predictors in the literature, not to a notional 100 % R².

Over- vs underfitting

Ideal (low bias, low variance): tight cluster on the bull’s-eye.
Overfitting (low bias, high variance): scattered around the centre.
Underfitting (high bias, low variance): tight cluster off-centre.
Worst (high bias, high variance): scattered and off-centre.

The bull’s-eye visualisation makes the bias / variance distinction concrete. Imagine running the fitting procedure many times on different samples and plotting each fitted prediction as a dart on the target. The true value is the bull’s-eye:

Low bias, low variance — darts cluster on the centre. Ideal.
Low bias, high variance — darts scattered around the centre. Average is right but any single fit is unreliable. Overfitting.
High bias, low variance — darts cluster off-centre. Reliable but systematically wrong. Underfitting.
High bias, high variance — darts scattered and off-centre. Worst of both worlds.

For a real project, you only get one shot at the data — but the bull’s-eye still describes the population from which your one shot was drawn. Choosing between methods is choosing between bull’s-eye targets.

A fundamental picture

Training error: monotonically declines with complexity.
Test error: declines first (bias dominates), then rises (variance dominates).
More flexible / complicated is not always better — keep this picture in mind when choosing a learning method.

This is the picture to internalise from the entire lecture. Whenever you face a model-selection choice — degree of polynomial, depth of decision tree, number of features in regression, regularisation strength — ask:

Where am I on the complexity axis?
Is training error still close to zero, suggesting I’m in the overfitting region?
What method would let me move complexity slightly down or up to find the test-error minimum?

The training-error curve always slopes down; the test-error curve has a U-shape. Your job is to find the dashed-vertical-line spot where test error is minimised. Cross-validation (Lecture 4) is the principled way to estimate the test-error curve from training data alone. Regularisation (Ridge today, Lasso next week) is the principled way to move along the curve toward the optimum.

3.4 Linear model selection & regularisation

3.1 Course objectives
3.2 Recap from Lectures 1 & 2
3.3 Assessing model accuracy
3.4 Linear model selection & regularisation
3.M Conclusion of Lecture 3

Starting point — OLS
Why might we improve on OLS?
1 · Prediction accuracy
2 · Model interpretability
Solutions — three families
Ridge regression — the equation
Ridge — what the penalty does
Manual calculation of betas
Live demo — Ridge by hand
Hitters data — coefficient paths vs λ
Why shrinking towards zero helps
Ridge bias / variance trade-off
Computational advantages of Ridge

Starting point — OLS

\[ Y_i = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \varepsilon \]

\(\beta_0\) — intercept (mean of \(Y\) when all \(X\)’s are zero).
\(\beta_j\) — average increase in \(Y\) when \(X_j\) increases by 1, holding other \(X\)’s constant.
Closed form (matrix notation):

\[ \beta = (X'X)^{-1} X' y \]

If you need to refresh OLS, read Chapter 3 of the textbook (ISLR).

OLS is the standard linear regression — the slopes are chosen to minimise the residual sum of squares, which in matrix form has the closed-form solution \(\\beta = (X'X)^{-1}X'y\). JWHT chapter 3 (James et al. 2021) derives this; if your linear-algebra refresher has lapsed, work through that chapter before Lecture 4.

Two technical reminders:

The closed-form requires \(X'X\) to be invertible. When \(p \\geq n\) or when predictors are perfectly collinear, \(X'X\) is singular and OLS has no unique solution. Real-world high-dimensional data (p large, n small) hits this constantly — Ridge’s \(\\lambda I\) penalty term solves it directly by ensuring \(X'X + \\lambda I\) is always invertible.
OLS is BLUE (Best Linear Unbiased Estimator) under the Gauss-Markov assumptions. Among unbiased estimators, OLS has the lowest variance. Ridge gives up the unbiased property in exchange for further variance reduction — that’s the central trade-off to come.

Why might we improve on OLS?

We want to improve OLS by replacing least-squares fitting with an alternative procedure. Two reasons to consider alternatives:

Prediction accuracy
Model interpretability

1 · Prediction accuracy

OLS estimates have low bias and low variability when \(Y\) and \(X\) are linear and \(n \gg p\).
When \(n \approx p\), OLS has high variance — possible overfitting and poor estimates on unseen data.
When \(n < p\), OLS fails completely: no unique solution; variance is infinite.

Three regimes mapped to the bias-variance picture:

\(n \\gg p\) (lots of data, few predictors) — OLS is the right answer. Low bias (linear model assumed), low variance (each parameter estimated from many observations). Reaching for Ridge buys little.
\(n \\approx p\) — the danger zone. OLS still works mathematically but the variance of each \(\\beta\) estimate is large. Ridge’s shrinkage delivers meaningful variance reduction.
\(n < p\) (more predictors than observations) — OLS has no unique solution because \(X'X\) is singular. Ridge is one of the few tools that handles this gracefully (\(X'X + \\lambda I\) is always invertible for \(\\lambda > 0\)). Modern empirical-finance work often hits this regime when using many candidate predictors on relatively short return histories.

Tidy Finance with R (Scheuch, Voigt, and Weiss 2023) has worked examples of OLS-vs-Ridge comparisons in the equity-premium-prediction context.

2 · Model interpretability

With a large number of predictors, many often have little or no effect on \(Y\).
Leaving them in obscures the important variables.
Removing them (setting coefficients to zero) makes the model easier to interpret.
Simpler models also imply lower information costs and faster run times.

A model with 30 predictors of which 3 are economically meaningful has the same prediction accuracy as a model with just those 3, but is much harder to interpret. The 27 noise predictors have non-zero estimated coefficients (random-noise estimates of zero are essentially never exactly zero) and visually compete for the reader’s attention.

The “drop the irrelevant predictors” goal is what motivates Lasso (next lecture) — Lasso’s L1 penalty has the property of forcing some coefficients to exactly zero, automating the variable-selection step. Ridge’s L2 penalty shrinks but doesn’t zero out, so it improves prediction but doesn’t help with interpretability in the same way.

For your project, both effects matter: Ridge / Lasso typically deliver better OOS predictions than OLS and a Lasso-fit model with 5 active coefficients is easier to write up than an OLS model with 30 estimated coefficients.

Solutions — three families

Subset selection — identify a subset of predictors \(X\) believed to relate to \(Y\), then fit on that subset (best subset, stepwise — covered in ISLR §6.1).
Shrinkage (Ridge and Lasso — our focus) — shrink coefficient estimates towards zero to reduce variance; some may shrink to exactly zero, performing variable selection.
Dimension reduction — e.g. principal-components regression (PCR).

Subset selection — exhaustive search over all \(2^p\) subsets of predictors. Best subset is conceptually clean but computationally infeasible for \(p > 20\). Stepwise (forward / backward) is a greedy approximation that’s tractable but can miss the true best subset.

Shrinkage — fit on all predictors but penalise large coefficients. Ridge (today) and Lasso (next lecture) are the two canonical shrinkage methods. Computationally cheap (one optimisation per \(\\lambda\) value) and works well even when \(p > n\). This is what we focus on for the project.

Dimension reduction — replace the \(p\) predictors with a smaller number of constructed features (e.g., principal components of \(X\)), then regress on the constructed features. PCR is the most common version. Useful when the predictors have highly correlated structure that compresses well; less common in finance because the constructed components have no economic interpretation.

For the project, Ridge and Lasso are the two methods you should know how to fit, tune, and report. Subset selection is intellectually interesting but slow; PCR is occasionally useful but harder to interpret.

Ridge regression — the equation

OLS minimises the residual sum of squares:

\[ \mathrm{RSS} \;=\; \sum_{i=1}^n \Bigl( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \Bigr)^2 \]

Ridge regression adds a penalty on the coefficients:

\[ \sum_{i=1}^n \Bigl( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \Bigr)^2 \;+\; \boxed{\lambda \sum_{j=1}^p \beta_j^2} \]

The Ridge objective adds an L2 penalty (\(\\sum \\beta_j^2\)) to the OLS RSS. The penalty term is small when the \(\\beta\)s are small; large when any \(\\beta\) is large. So the optimiser, in choosing \(\\beta\) to minimise the total, has an incentive to keep \(\\beta\)s small.

The trade-off parameter is \(\\lambda \\geq 0\):

\(\\lambda = 0\) — no penalty, Ridge degenerates to OLS.
\(\\lambda \\to \\infty\) — infinite penalty, all \(\\beta\)s shrink to zero, the model becomes the intercept-only “predict-the-mean” benchmark.
Intermediate \(\\lambda\) — a balance between fitting the data and keeping \(\\beta\)s small.

Choosing \(\\lambda\) is a tuning-parameter problem — the next lecture covers cross-validation as the principled way to pick the optimal \(\\lambda\) from data. For now, just understand that \(\\lambda\) controls model complexity and slides you along the bias-variance trade-off curve.

The L2 specifically (squared \(\\beta\)s, not absolute values) is what makes Ridge “Ridge”. The L1 alternative (\(\\sum |\\beta_j|\)) is Lasso, covered next lecture; it has the additional property of zeroing some \(\\beta\)s entirely.

Ridge — what the penalty does

Tuning parameter \(\lambda > 0\).
The penalty shrinks large \(|\beta|\) towards zero.
The intercept is not penalised.
The constraint should improve fit because shrinking coefficients reduces their variance.
When \(\lambda = 0\), Ridge collapses back to OLS.

Two implementation details worth noting:

The intercept is not penalised. The intercept is the predicted value when all predictors are zero — shrinking it to zero would force the regression line through the origin, which has nothing to do with controlling overfitting. Standard implementations (glmnet in R, Ridge in scikit-learn, MASS::lm.ridge) handle this internally; you don’t need to do anything special.
Standardise predictors before fitting. Ridge penalises \(\\sum \\beta_j^2\), which is sensitive to the scale of each \(X_j\). A predictor measured in millions has tiny \(\\beta\)s; a predictor measured in unit-fractions has large \(\\beta\)s — purely because of measurement scale, not economic importance. Scaling all predictors to unit variance makes the penalty fair across them. Again, glmnet does this by default (standardize = TRUE).

The “shrinking reduces variance” claim is the empirical content of Ridge: by constraining each \(\\beta\) to be small, you reduce the influence of any single noisy observation on the parameter estimate. The cost is bias — small \(\\beta\)s underestimate the true effect of strongly-relevant predictors. The bias-variance trade-off picture from earlier in the lecture explains exactly when this trade is favourable.

Manual calculation of betas

\[ \beta \;=\; (X'X + \lambda I)^{-1} X' y \]

Penalty term: \(\lambda I\) — \(\lambda\) times the identity matrix, so dimensions match the \(\beta\) vector (\(\beta_1, \ldots, \beta_p\)).
If predictors are centered (mean zero), \(\beta_0 = \bar Y\) — no need to include the intercept in the equation.

This is the closed-form Ridge solution — exactly the OLS formula with \(\\lambda I\) added inside the inverse. The \(\\lambda I\) term has two effects:

It makes \(X'X + \\lambda I\) invertible even when \(X'X\) is singular (which happens whenever \(p > n\) or when predictors are perfectly collinear). This is why Ridge works in regimes where OLS fails.
It shrinks the resulting \(\\beta\)s towards zero. Larger \(\\lambda\) means a “stronger” identity matrix added, which dominates the \(X'X\) structure and pushes coefficients down.

The closed-form is computationally cheap (just a matrix inverse, \(O(p^3)\) for a \(p\)-dimensional system). In practice, we use glmnet::glmnet with alpha = 0 (alpha is the L1/L2 mixing parameter; alpha = 0 is pure Ridge), which is even faster because it uses a coordinate-descent algorithm optimised for the case where you want the solution at many \(\\lambda\) values simultaneously.

Live demo — Ridge by hand

library(ISLR);  library(glmnet)
Hitters <- na.omit(Hitters)

# simplified to two predictors
x  <- as.matrix(data.frame(Hitters$AtBat, Hitters$Hits))
y  <- Hitters$Salary
xs <- scale(x, center = TRUE, scale = FALSE)   # centre predictors
n      <- nrow(x)
sd_y   <- sqrt(var(y) * (n - 1) / n)[1]
iden   <- diag(2)

# lambda = 0  — should recover OLS
lam <- 0
ridge.mod <- glmnet(xs, y, alpha = 0,
                    lambda = lam * sd_y / n,
                    standardize = FALSE, thresh = 1e-20)
ridge.man <- solve(t(xs) %*% xs + lam * iden) %*% t(xs) %*% y
beta_0    <- mean(y)

cbind(coef(ridge.mod),
      coef(lm(y ~ xs)),
      c(beta_0, ridge.man))   # all three columns match

We compute Ridge three ways at \(\lambda = 0\) — glmnet, base lm, and the closed-form formula — to verify they coincide.
glmnet’s lambda is on a different scale than the textbook formula — multiply by sd_y / n to align (see stats.stackexchange).
thresh = 1e-20 tightens convergence so the comparison is numerically tight.
Centering predictors removes the need to include the intercept in the matrix formula.

The point of running Ridge three different ways and verifying they agree is to demystify the algorithm. glmnet is a black box for many users; computing the same answer by hand from the closed-form formula confirms there is no magic — Ridge is just OLS with one extra matrix term.

A few practical pointers from this code:

glmnet’s lambda parameter is on a different scale than the textbook formula. glmnet minimises \(\\frac{1}{2n} \\text{RSS} + \\lambda \\sum \\beta_j^2\) rather than \(\\text{RSS} + \\lambda \\sum \\beta_j^2\). To get a textbook-equivalent fit, multiply by sd_y / n (or equivalently set glmnet’s lambda to \(\\lambda^{\\text{textbook}} \\cdot n / \\text{sd}_y\)). This catches almost everyone the first time.
thresh = 1e-20 tightens convergence — useful when you need numerically tight comparisons but slows the algorithm. For production fits the default thresh = 1e-7 is fine.
The Hitters dataset (from ISLR) is a baseball-salary panel that JWHT uses throughout chapter 6 for examples. It’s not a finance dataset but it’s the canonical pedagogical reference for Ridge / Lasso. Once you understand the workflow on Hitters, applying it to the project’s prediction-market data is mechanical.

Hitters data — coefficient paths vs λ

Reproduce with Ridge_figures.R (loop over 0–1000 λ values, plot standardised coefficients).
As \(\lambda\) increases, standardised coefficients shrink towards zero.
Bar at the bottom: flexibility decreases as \(\lambda\) grows.

Coefficient paths are the fundamental visualisation for Ridge / Lasso. For each \(\\lambda\) value (x-axis), plot each \(\\beta_j\) as a separate line (y-axis). Watching how the lines move as \(\\lambda\) grows tells you how the model is changing.

Ridge paths have a characteristic shape: all coefficients shrink smoothly toward zero as \(\\lambda\) grows, but never reach exactly zero. The relative ordering of coefficients (which is biggest, which is smallest) is roughly preserved.

Lasso paths (next lecture) look different — coefficients reach zero at finite \(\\lambda\) values and stay there. Comparing the two visualisations side-by-side is the most direct way to internalise the L1-vs-L2 distinction.

The “flexibility decreases” annotation at the bottom of the chart maps to the bias-variance picture from earlier: high \(\\lambda\) = simpler model = lower variance, higher bias. The optimal \(\\lambda\) (selected by cross-validation in the next lecture) sits at the bias-variance sweet spot.

Why shrinking towards zero helps

OLS estimates have low bias but can be highly variable, especially when \(n \approx p\).
The penalty makes Ridge estimates biased, but substantially reduces variance.
Net effect: a bias / variance trade-off that often improves test MSE.

The mechanism is precisely the bias-variance picture we built earlier in the lecture, applied to OLS-vs-Ridge specifically:

OLS is unbiased but has high variance when \(p\) is large or predictors are correlated. Each \(\\beta\) wobbles considerably from one training sample to the next.
Ridge introduces a small amount of bias (every \(\\beta\) is shrunk slightly toward zero) in exchange for substantially lower variance.

When OLS variance is the dominant component of test MSE, the Ridge trade is favourable — the small bias added is far outweighed by the variance reduction. This is exactly the regime where Ridge wins: many predictors, modest sample size, predictors correlated with each other.

When OLS variance is already low (lots of data, few well-conditioned predictors), Ridge’s bias hurts more than its variance reduction helps, and OLS wins. The next slide shows this trade-off graphically.

Ridge bias / variance trade-off

Bias² (black) rises with \(\lambda\).
Variance (green) falls with \(\lambda\).
Test MSE (purple) is U-shaped with a clear minimum — pick the \(\lambda\) that minimises it.
Ridge wins most when OLS estimates have high variance.

Same picture as the earlier flexibility-vs-MSE chart, now specifically over \(\\lambda\):

Bias² (black) starts at zero (Ridge with \(\\lambda = 0\) is OLS, which is unbiased) and grows as \(\\lambda\) increases.
Variance (green) starts high (OLS variance) and drops as \(\\lambda\) increases (smaller \(\\beta\)s have lower variance).
Test MSE (purple) is the sum, plus the irreducible error floor. U-shape: drops as variance reduction outweighs bias growth, then rises as bias growth dominates.

The optimal \(\\lambda\) is where the U-shape bottoms out. In real applications you don’t see this curve directly — you estimate it via cross-validation (Lecture 4) and pick the \(\\lambda\) that minimises CV-MSE.

Ridge’s biggest gains come in regimes where the OLS variance curve starts very high — i.e., when \(p\) is large, predictors are correlated, or sample size is small. In those regimes the U-shape is deep and the gain from Ridge over OLS is large. When OLS variance is already small, the U-shape barely dips below the OLS point and Ridge offers little.

Computational advantages of Ridge

For large \(p\), best-subset selection would search through \(2^p\) models — combinatorially expensive.
With Ridge, for any given \(\lambda\), fit one model — the computations are very simple.
Ridge even works when \(p > n\), where OLS fails completely.

The computational story is straightforward but worth noting because it makes Ridge production-feasible in regimes where best-subset selection isn’t:

Best-subset searches all \(2^p\) models — at \(p = 30\), that’s a billion fits. Even with greedy stepwise alternatives, the search is expensive and can miss the true best subset.
Ridge is one matrix inversion per \(\\lambda\) value. Modern implementations (glmnet’s coordinate descent) compute the full \(\\lambda\)-path in roughly the cost of one OLS fit.

The “\(p > n\)” capability is the killer feature for modern empirical work. With 100 candidate predictors and 60 monthly observations, OLS literally cannot fit. Ridge happily fits — the \(\\lambda I\) regularisation makes \(X'X + \\lambda I\) invertible regardless. This is why Ridge / Lasso are workhorses of the empirical-finance toolkit; classical OLS no longer covers the cases that come up.

For your project, the dataset will not necessarily be \(p > n\), but the principle still applies: if you have many candidate predictors, Ridge / Lasso are the right tools. OLS works but will likely overfit.

3.M Conclusion of Lecture 3

3.1 Course objectives
3.2 Recap from Lectures 1 & 2
3.3 Assessing model accuracy
3.4 Linear model selection & regularisation
3.M Conclusion of Lecture 3

Course at a glance (1/2)
Course at a glance (2/2)
Further reading
Prepare before next lecture
See you next time
References

Course at a glance (1/2)

Foundations

Week 1

15.04.2026

Course outline · Backtesting fundamentals

Course aim & organisation
Backtesting overview & case study
In-sample tests (Welch & Goyal 2008)
Out-of-sample (walk-forward, R²_OS)
Useful predictors & p-hacking

Introduction to R

Week 2

22.04.2026

RStudio · variables · vectors · data frames · live coding

Why R for empirical asset-management research
RStudio and the script editor
Variables, vectors, matrices, data frames, lists
Functions and loops
Data import and export

Assessing model accuracy & Ridge regression

Week 3

29.04.2026

Statistical learning · MSE · bias-variance · linear model selection · Ridge

Statistical learning: Y = f(X) + ε
Quality of fit and the train/test MSE distinction
Bias-variance trade-off and overfitting
OLS limits: prediction accuracy & interpretability
Ridge regression and the L2 penalty

Lasso, cross-validation & Elastic Net

Week 4

06.05.2026

Sparse regularisation · resampling for honest test error · choosing λ

Lasso: L1 penalty and exact-zero coefficients
Cross-validation: validation set, LOOCV, K-fold
Choosing the optimal λ for Lasso
OLS post-Lasso for cleaner coefficient inference
Elastic Net — combining Ridge and Lasso

Prediction markets, the Polymarket Quant Bench & your project

Week 5

13.05.2026

From Welch-Goyal to event-resolved binary contracts

Prediction markets — definition and Polymarket as the canonical venue
How prices form: liquidity, resolution, mechanics
The Polymarket Quant Bench dataset (HuggingFace): access and schema
First look at the data in R
Your project: indicator design, back-test, deliverables, R toolbox

Course at a glance (2/2)

Final presentations

Week 13

01.07.2026

Group presentations · Q&A · wrap-up

Presentation order and time budget
Q&A rules
Closing thoughts and feedback

Prepare before next lecture

Run StatLearning.R locally — confirm you can reproduce Figure 2.9.
Run Ridge_comparison.R and Ridge_figures.R — verify all three Ridge implementations agree at \(\lambda = 0\).
Read ISLR §2.2 (assessing model accuracy) and §6.2 (Ridge & Lasso).

See you next time

Reminder

Lecture 4 (6 May 2026): Lasso, Elastic Net, cross-validation — selecting the optimal \(\lambda\) honestly.

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning with Applications in R. 2nd ed. New York, NY: Springer. https://www.statlearning.com/.

Scheuch, Christoph, Stefan Voigt, and Patrick Weiss. 2023. Tidy Finance with R. Chapman & Hall/CRC. https://www.tidy-finance.org/r/.

Welch, Ivo, and Amit Goyal. 2008. “A Comprehensive Look at the Empirical Performance of Equity Premium Prediction.” Review of Financial Studies 21 (4): 1455–1508. https://doi.org/10.1093/rfs/hhm014.

Lecture 3: Assessing model accuracy & Ridge regression

3.1 Course objectives

Welcome to Finance Project — Asset Management

Course Objective

Course at a glance (1/2)

Course at a glance (2/2)

Assignments / Exams

3.2 Recap from Lectures 1 & 2

Where we are

3.3 Assessing model accuracy

What is statistical learning?

Measuring quality of fit — MSE

A problem

Training vs test MSE

Example I — splines on a noisy curve

Example II — train vs test MSE curve

Bias-variance trade-off

Bias of learning methods

Variance of learning methods

The trade-off — formula

Over- vs underfitting

A fundamental picture

3.4 Linear model selection & regularisation

Starting point — OLS

Why might we improve on OLS?

1 · Prediction accuracy

2 · Model interpretability

Solutions — three families

Ridge regression — the equation

Ridge — what the penalty does

Manual calculation of betas

Live demo — Ridge by hand

Hitters data — coefficient paths vs λ

Why shrinking towards zero helps

Ridge bias / variance trade-off

Computational advantages of Ridge

3.M Conclusion of Lecture 3

Course at a glance (1/2)

Course at a glance (2/2)

Further reading

Prepare before next lecture

See you next time

References