Lecture 3: Assessing model accuracy & Ridge regression
Statistical learning · MSE · bias-variance · linear model selection · Ridge
3.1 Course objectives
- 3.1 Course objectives
- 3.2 Recap from Lectures 1 & 2
- 3.3 Assessing model accuracy
- 3.4 Linear model selection & regularisation
- 3.M Conclusion of Lecture 3
Welcome to Finance Project — Asset Management
- This is a project course: there is no central exam to register for. Sign up on the course Moodle page by 15 April 2026 so you receive announcements and the data link.
- Submit the project by 30 June 2026 as a single zip — name pattern:
Asset2026_surname1_surname2_surname3. Email it to oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de and your team-mates.- Ask questions during or right after each session — that is the preferred channel.
- Admin / studies / exam-eligibility questions go to the registrar’s office (Studiensekretariat) at studiensekretariat@uni-ulm.de.
- Course-content questions outside class: email oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de.
- We also recommend the student advisory service.
Course Objective
Scope
We will:
- Build an end-to-end empirical pipeline in R: load, explore, model, back-test
- Cover the core ML toolbox for asset-management research: linear models, Ridge, Lasso, Elastic Net, cross-validation
- Apply it to a non-traditional asset class: prediction markets
- Develop your own indicator library and trading strategy in groups of three
We will NOT:
- Drift into deep-learning or reinforcement-learning methods
- Cover prediction markets in depth
- Provide a “ready-to-fork” backtest — the demo code is intentionally basic
Approach
Part I — Foundations
- L1: Motivation, organisation, backtesting fundamentals
- L2: Hands-on R intro — RStudio, live coding, etc.
- L3 + L4: Statistical learning — model accuracy, regularisation, resampling
Part II — Application
- L5: Prediction-markets primer + the Polymarket dataset + assignment briefing
- Project work in groups of three (≈ 7 weeks of self-organised work)
- Final session (1 July): 20-minute presentations per team
Course at a glance (1/2)
Foundations
Course outline · Backtesting fundamentals
- Course aim & organisation
- Backtesting overview & case study
- In-sample tests (Welch & Goyal 2008)
- Out-of-sample (walk-forward, R²_OS)
- Useful predictors & p-hacking
Introduction to R
RStudio · variables · vectors · data frames · live coding
- Why R for empirical asset-management research
- RStudio and the script editor
- Variables, vectors, matrices, data frames, lists
- Functions and loops
- Data import and export
Assessing model accuracy & Ridge regression
Statistical learning · MSE · bias-variance · linear model selection · Ridge
- Statistical learning: Y = f(X) + ε
- Quality of fit and the train/test MSE distinction
- Bias-variance trade-off and overfitting
- OLS limits: prediction accuracy & interpretability
- Ridge regression and the L2 penalty
Lasso, cross-validation & Elastic Net
Sparse regularisation · resampling for honest test error · choosing λ
- Lasso: L1 penalty and exact-zero coefficients
- Cross-validation: validation set, LOOCV, K-fold
- Choosing the optimal λ for Lasso
- OLS post-Lasso for cleaner coefficient inference
- Elastic Net — combining Ridge and Lasso
Prediction markets, the Polymarket Quant Bench & your project
From Welch-Goyal to event-resolved binary contracts
- Prediction markets — definition and Polymarket as the canonical venue
- How prices form: liquidity, resolution, mechanics
- The Polymarket Quant Bench dataset (HuggingFace): access and schema
- First look at the data in R
- Your project: indicator design, back-test, deliverables, R toolbox
Course at a glance (2/2)
Final presentations
Group presentations · Q&A · wrap-up
- Presentation order and time budget
- Q&A rules
- Closing thoughts and feedback
Assignments / Exams
Project (Code + Report) 50% of your grade
Rmd code + knitr-rendered PDF report. Build a library of indicators over the Polymarket Quant Bench dataset (curated OHLCV bars on HuggingFace, derived from Jon Becker’s polymarket-data dump), derive trade signals, back-test, and write a critical reflection.
Group of up to 3.
Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Finance Project — Asset Management_assignment-1-project-report_surname1_surname2_…
30 June 2026
Final Presentation 50% of your grade
20-minute group presentation in class on 1 July 2026; submit slides as PDF together with the project zip.
Group of up to 3.
Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Finance Project — Asset Management_assignment-2-final-presentation_surname1_surname2_…
1 July 2026
3.2 Recap from Lectures 1 & 2
- 3.1 Course objectives
- 3.2 Recap from Lectures 1 & 2
- 3.3 Assessing model accuracy
- 3.4 Linear model selection & regularisation
- 3.M Conclusion of Lecture 3
Where we are
- L1: backtesting fundamentals — IS vs OOS, \(R^2_{OS}\), useful-predictor checklist, p-hacking.
- L2: R fundamentals — RStudio, vectors, data frames, functions, loops, import/export.
- Today: how to assess how good a model really is, and meet the first regularised regression — Ridge.
Notes
Today is the conceptual core of the course. Two ideas to internalise:
- The training MSE is a misleading measure of model quality. Models that minimise training MSE often perform poorly on new data. The bias-variance decomposition (third of today’s slides) makes this precise.
- Ridge regression introduces a penalty for large coefficients that trades a small amount of bias for a meaningful reduction in variance. It is the simplest example of regularisation — the principle that constraining a model’s complexity often improves out-of-sample performance.
Both ideas extend directly to lasso (Lecture 4) and to almost any modern machine-learning method (gradient-boosted trees, neural networks). The lecture is long but the concepts compound — getting the bias-variance picture in your head is the most valuable thing you’ll take away from the course.
3.3 Assessing model accuracy
- 3.1 Course objectives
- 3.2 Recap from Lectures 1 & 2
- 3.3 Assessing model accuracy
- 3.4 Linear model selection & regularisation
- 3.M Conclusion of Lecture 3
What is statistical learning?
We observe \(Y\) and \(X = (X_1, \ldots, X_p)\) for \(p\) predictors and \(i\) observations.
We believe a relationship exists (e.g. excess return of S&P 500 vs dividend yield):
\[Y = f(X) + \varepsilon\]
- \(f\) — unknown function
- \(\varepsilon\) — random error term
Statistical learning is all about how to estimate \(f\). In this class we use predictors \(X\) to forecast \(Y\).
Notes
The decomposition \(Y = f(X) + \varepsilon\) is the foundational equation of supervised learning. It says any outcome we want to predict can be split into two parts:
- A systematic component \(f(X)\) — the part of \(Y\) that is determined by the observed predictors. This is what statistical learning estimates.
- A noise component \(\varepsilon\) — the part of \(Y\) that is genuinely unpredictable from \(X\), even with the best possible model. This is the irreducible error.
The distinction matters because it sets a hard ceiling on prediction accuracy. No matter how clever your method, you cannot reduce error below the variance of \(\varepsilon\). In equity-premium prediction, \(\varepsilon\) is huge (returns are dominated by noise that doesn’t depend on observable predictors), which is why even the best predictors deliver tiny R²s.
The course’s project goal — predicting prediction-market dynamics — has the same structure: there is some systematic relationship between observable indicators and future market behaviour (\(f\)), buried in a lot of noise (\(\varepsilon\)). Your indicator’s job is to recover as much of \(f\) as possible. JWHT (James et al. 2021) chapter 2.1 is the classical exposition.
Measuring quality of fit — MSE
A common measure of accuracy in regression is mean squared error:
\[ MSE \;=\; \dfrac{1}{n} \sum_{i=1}^{n} (y_i - \hat y_i)^2 \]
where \(\hat y_i\) is the prediction for observation \(i\) in the training data.
Notes
Mean squared error is the workhorse loss function for regression problems. It measures the average squared distance between predictions and actual values. Squaring (rather than taking the absolute value) does two things: (a) penalises large errors more heavily than small ones (fitting an “important to get right” criterion), (b) makes the analytical math tractable (the derivative is linear, leading to closed-form solutions for OLS).
Two related metrics you’ll see: - RMSE — root mean squared error, \(\\sqrt{\\text{MSE}}\). Same units as \(Y\), so easier to interpret. RMSE of 5 on a return series in percentage points means typical errors are about 5 percentage points. - MAE — mean absolute error. Less sensitive to outliers than MSE; some applications prefer it.
For Ridge and most classical methods, MSE is the natural target. Forecasting research occasionally uses out-of-sample R² (Lecture 1) which is a normalised MSE comparison against a benchmark — same underlying loss function, different normalisation.
A problem
- Methods are designed to minimise MSE on training data (e.g., OLS picks the line that does so).
- What we really care about is performance on new data — we call this test data.
- There is no guarantee that the smallest training MSE delivers the smallest test MSE.
Notes
This is the deepest insight of statistical learning: optimising a model for the data you trained on does not optimise its performance on data you haven’t seen yet. A flexible enough method can drive training MSE arbitrarily close to zero (just memorise the training labels) while having atrocious test MSE (the memorised labels don’t apply to new data).
The disconnect is universal — it shows up in linear regression with too many predictors, in polynomial fits with too many degrees, in random forests with too many splits, and in deep learning with too many parameters. The solution is also universal: separate the data into training and test sets and evaluate on data the model never saw during fitting.
For your project: when you tune your indicator, never use the same data to both fit the indicator and evaluate it. Either hold out a test sub-period, or use cross-validation (Lecture 4 covers it formally), or use the walk-forward backtest from Lecture 1. Reporting backtest results computed on data that was used to design the indicator is the most common pitfall in undergraduate / master’s quant projects.
Training vs test MSE
- The more flexible a method, the lower its training MSE — flexible methods can generate richer shapes for \(f\) than restrictive ones (e.g. linear regression).
- But test MSE may rise for a more flexible method than for a simple approach like linear regression.
- Less flexible ⇒ easier to interpret. Trade-off: flexibility vs interpretability.
Notes
Flexibility is loosely the model’s capacity to fit complex shapes. A linear regression with one predictor is very inflexible (the relationship is forced to be a straight line). A polynomial regression of degree 10 is highly flexible (it can fit any wiggly curve). A neural network with millions of parameters is even more flexible.
The empirical regularity from this and the next slide:
- Training MSE always decreases with flexibility. More flexibility means the model can hug the training data more tightly.
- Test MSE has a U-shape — it decreases initially (the model captures real signal) then rises (the model starts capturing noise that doesn’t generalise).
The “sweet spot” — the flexibility level that minimises test MSE — is what you actually want, but you can’t see test MSE during training (by definition). The standard tools to estimate it from training data alone are cross-validation (Lecture 4) and regularisation (Ridge/Lasso, today and Lecture 4).
The interpretability trade-off is also real: a regression with 5 coefficients is publishable in a paper because you can describe each coefficient’s economic meaning. A random forest with 100 trees is uninterpretable in the same way. For most empirical-finance applications, interpretable + slightly worse is preferred to uninterpretable + slightly better — because reviewers and readers want to know why the model makes the predictions it makes.
Example I — splines on a noisy curve
- Reproduce with
StatLearning.R(splines, OLS, train/test MSE loop). - Black = truth, orange = OLS, blue = smoothing spline (less flexible), green = smoothing spline (more flexible).
- Higher flexibility hugs the data closer — but track training vs test MSE separately.
Notes
JWHT Figure 2.9 (James et al. 2021) is the canonical visual demonstration of overfitting. The setup: data generated from a known smooth black curve plus noise; three models fit:
- Orange (OLS) — too inflexible. The straight line misses the curvature in the true relationship.
- Blue (low-flex spline) — about right. Captures the broad curve without chasing local noise.
- Green (high-flex spline) — too flexible. Wiggles to pass through almost every point, including the noise.
A casual eye would say green is the “best fit” — it’s closest to every training observation. But on new data drawn from the same process, the green spline’s wiggles make terrible predictions. The blue spline’s smoother estimate generalises much better.
Because we know the truth in this synthetic example, we can compute test MSE by drawing fresh data — the next slide does exactly that.
Example II — train vs test MSE curve
- Grey = training MSE: declines monotonically with flexibility.
- Red = test MSE: U-shape — falls, then rises.
- Vertical dashed line marks the minimum test MSE — the optimal flexibility.
Notes
This figure is the empirical demonstration of the U-shape claim. As you increase model flexibility from very low (left) to very high (right), training MSE drops monotonically — every additional degree of freedom lets the model fit the training points more tightly. Test MSE drops too, but only up to a point; past the optimum, every additional degree of freedom buys you noise-fitting that hurts on new data.
The optimal flexibility (the dashed line in JWHT Fig. 2.9 right panel, around 7 degrees of freedom for the example) is exactly the flexibility level where the model has captured the systematic signal but not yet started chasing noise. To find it on real data — where you don’t have the luxury of drawing fresh test data — you use cross-validation (Lecture 4).
The vertical gap between training MSE (always low) and test MSE (U-shaped) widens as flexibility grows. That gap is essentially the variance penalty for over-flexibility. Ridge regression, today’s main technique, reduces that gap by penalising large coefficients.
Bias-variance trade-off
The previous figure illustrates the trade-off that governs every choice of statistical learning method:
There are always two competing forces — bias and variance.
Notes
The U-shape is not a contingent feature of the spline example — it’s a consequence of two competing forces that are always present whenever a model is fit to noisy data. The next two slides define each force; the slide after that gives the formal decomposition.
Bias of learning methods
- Modelling complicated real-life problems may induce error called bias.
- Linear regression assumes \(Y\) and \(X\) are linear; in reality the relationship is rarely exactly linear, so some bias is present.
- The more flexible / complex a method, the less bias it generally has.
Notes
Bias is the error introduced by the model’s structural assumptions. If you fit a straight line to a curve, no matter how much data you collect, your fitted line will systematically miss the curvature — that systematic miss is bias. Linear regression has high bias whenever the true relationship has substantial nonlinearity.
The relationship between flexibility and bias: more flexible methods can represent more shapes, so they have less structural limitation and consequently less bias. A degree-10 polynomial has lower bias than a straight line because it can represent quadratic, cubic, etc., relationships exactly. A neural network has very low bias because it can approximate essentially any continuous function.
For your project, “bias” shows up as: choosing a linear regression when the true indicator-vs-outcome relationship is, say, threshold-dependent. The fitted line averages across the threshold and loses the structure that actually matters.
Variance of learning methods
- Variance measures how much your estimate for \(f\) would change with a different training data set.
- Generally, the more flexible a method, the more variance it has.
Notes
Variance is the sensitivity of the fitted model to the specific training data. A method with high variance produces wildly different fits when trained on slightly different data; a method with low variance produces nearly the same fit regardless of which sample you give it.
A flexible model has many parameters, each of which is fit to the noise in the training set as well as the signal. Different training sets have different noise, so the parameters move around — high variance. An inflexible model has few parameters and ignores most of the noise, producing stable estimates — low variance.
For your project: variance is what makes a “great” backtest on one historical sub-period turn into a disappointing live performance. The model’s parameters were tuned to the specific noise of the training period; on new data, that tuning isn’t useful and may actively hurt.
The trade-off — formula
For any given \(X = x_0\), the expected test MSE on a new \(Y\) at \(x_0\) is:
\[ \mathrm{Expected\_Test\_MSE} \;=\; E\!\bigl(Y - \hat f(x_0)\bigr)^2 \;=\; \mathrm{Bias}^2 + \mathrm{Var} + \underbrace{\sigma^2}_{\text{Irreducible Error}} \]
As complexity rises, bias falls and variance grows — but expected test MSE may go either way.
Notes
The decomposition is the precise mathematical statement of the trade-off:
- \(\\text{Bias}^2\) — the systematic offset between the average fit and the true function. Decreases with model flexibility.
- \(\\text{Var}\) — how much the fit jumps around when trained on different samples. Increases with model flexibility.
- \(\\sigma^2\) (irreducible error) — the variance of \(\\varepsilon\), fundamentally unpredictable. Sets a floor on test MSE that no method can beat.
The total expected test MSE is the sum. Reducing bias requires more flexibility; reducing variance requires less flexibility. The optimal model balances them — total MSE is minimised at the flexibility level where the marginal reduction in bias equals the marginal increase in variance.
The irreducible-error floor is important to remember when interpreting your project’s backtest. If the OOS R² of your indicator is 2 %, that’s not necessarily a bad model — financial returns are dominated by noise, and 98 % irreducible error is normal. The right comparison is to the no-information benchmark (zero R²_OS) and to the best-known predictors in the literature, not to a notional 100 % R².
Over- vs underfitting
- Ideal (low bias, low variance): tight cluster on the bull’s-eye.
- Overfitting (low bias, high variance): scattered around the centre.
- Underfitting (high bias, low variance): tight cluster off-centre.
- Worst (high bias, high variance): scattered and off-centre.
Notes
The bull’s-eye visualisation makes the bias / variance distinction concrete. Imagine running the fitting procedure many times on different samples and plotting each fitted prediction as a dart on the target. The true value is the bull’s-eye:
- Low bias, low variance — darts cluster on the centre. Ideal.
- Low bias, high variance — darts scattered around the centre. Average is right but any single fit is unreliable. Overfitting.
- High bias, low variance — darts cluster off-centre. Reliable but systematically wrong. Underfitting.
- High bias, high variance — darts scattered and off-centre. Worst of both worlds.
For a real project, you only get one shot at the data — but the bull’s-eye still describes the population from which your one shot was drawn. Choosing between methods is choosing between bull’s-eye targets.
A fundamental picture
- Training error: monotonically declines with complexity.
- Test error: declines first (bias dominates), then rises (variance dominates).
- More flexible / complicated is not always better — keep this picture in mind when choosing a learning method.
Notes
This is the picture to internalise from the entire lecture. Whenever you face a model-selection choice — degree of polynomial, depth of decision tree, number of features in regression, regularisation strength — ask:
- Where am I on the complexity axis?
- Is training error still close to zero, suggesting I’m in the overfitting region?
- What method would let me move complexity slightly down or up to find the test-error minimum?
The training-error curve always slopes down; the test-error curve has a U-shape. Your job is to find the dashed-vertical-line spot where test error is minimised. Cross-validation (Lecture 4) is the principled way to estimate the test-error curve from training data alone. Regularisation (Ridge today, Lasso next week) is the principled way to move along the curve toward the optimum.
3.4 Linear model selection & regularisation
- 3.1 Course objectives
- 3.2 Recap from Lectures 1 & 2
- 3.3 Assessing model accuracy
- 3.4 Linear model selection & regularisation
- 3.M Conclusion of Lecture 3
Starting point — OLS
\[ Y_i = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \varepsilon \]
- \(\beta_0\) — intercept (mean of \(Y\) when all \(X\)’s are zero).
- \(\beta_j\) — average increase in \(Y\) when \(X_j\) increases by 1, holding other \(X\)’s constant.
- Closed form (matrix notation):
\[ \beta = (X'X)^{-1} X' y \]
If you need to refresh OLS, read Chapter 3 of the textbook (ISLR).
Notes
OLS is the standard linear regression — the slopes are chosen to minimise the residual sum of squares, which in matrix form has the closed-form solution \(\\beta = (X'X)^{-1}X'y\). JWHT chapter 3 (James et al. 2021) derives this; if your linear-algebra refresher has lapsed, work through that chapter before Lecture 4.
Two technical reminders:
- The closed-form requires \(X'X\) to be invertible. When \(p \\geq n\) or when predictors are perfectly collinear, \(X'X\) is singular and OLS has no unique solution. Real-world high-dimensional data (
plarge,nsmall) hits this constantly — Ridge’s \(\\lambda I\) penalty term solves it directly by ensuring \(X'X + \\lambda I\) is always invertible. - OLS is BLUE (Best Linear Unbiased Estimator) under the Gauss-Markov assumptions. Among unbiased estimators, OLS has the lowest variance. Ridge gives up the unbiased property in exchange for further variance reduction — that’s the central trade-off to come.
Why might we improve on OLS?
We want to improve OLS by replacing least-squares fitting with an alternative procedure. Two reasons to consider alternatives:
- Prediction accuracy
- Model interpretability
Notes
The next two slides develop each motivation — prediction accuracy is about reducing variance (variance is what makes OLS perform badly when \(p\) is large), interpretability is about identifying which predictors matter. Both are concrete, compelling reasons to consider Ridge / Lasso / similar techniques.
For the project: prediction accuracy is the primary driver — your indicator’s value comes from how well it predicts future market behaviour, and Ridge / Lasso typically deliver better OOS R² than OLS when you have many candidate predictors. Interpretability matters too: Lasso’s tendency to set some coefficients to exactly zero (next lecture) gives you a built-in feature-selection mechanism that produces a more publishable indicator.
1 · Prediction accuracy
- OLS estimates have low bias and low variability when \(Y\) and \(X\) are linear and \(n \gg p\).
- When \(n \approx p\), OLS has high variance — possible overfitting and poor estimates on unseen data.
- When \(n < p\), OLS fails completely: no unique solution; variance is infinite.
Notes
Three regimes mapped to the bias-variance picture:
- \(n \\gg p\) (lots of data, few predictors) — OLS is the right answer. Low bias (linear model assumed), low variance (each parameter estimated from many observations). Reaching for Ridge buys little.
- \(n \\approx p\) — the danger zone. OLS still works mathematically but the variance of each \(\\beta\) estimate is large. Ridge’s shrinkage delivers meaningful variance reduction.
- \(n < p\) (more predictors than observations) — OLS has no unique solution because \(X'X\) is singular. Ridge is one of the few tools that handles this gracefully (\(X'X + \\lambda I\) is always invertible for \(\\lambda > 0\)). Modern empirical-finance work often hits this regime when using many candidate predictors on relatively short return histories.
Tidy Finance with R (Scheuch, Voigt, and Weiss 2023) has worked examples of OLS-vs-Ridge comparisons in the equity-premium-prediction context.
2 · Model interpretability
- With a large number of predictors, many often have little or no effect on \(Y\).
- Leaving them in obscures the important variables.
- Removing them (setting coefficients to zero) makes the model easier to interpret.
- Simpler models also imply lower information costs and faster run times.
Notes
A model with 30 predictors of which 3 are economically meaningful has the same prediction accuracy as a model with just those 3, but is much harder to interpret. The 27 noise predictors have non-zero estimated coefficients (random-noise estimates of zero are essentially never exactly zero) and visually compete for the reader’s attention.
The “drop the irrelevant predictors” goal is what motivates Lasso (next lecture) — Lasso’s L1 penalty has the property of forcing some coefficients to exactly zero, automating the variable-selection step. Ridge’s L2 penalty shrinks but doesn’t zero out, so it improves prediction but doesn’t help with interpretability in the same way.
For your project, both effects matter: Ridge / Lasso typically deliver better OOS predictions than OLS and a Lasso-fit model with 5 active coefficients is easier to write up than an OLS model with 30 estimated coefficients.
Solutions — three families
- Subset selection — identify a subset of predictors \(X\) believed to relate to \(Y\), then fit on that subset (best subset, stepwise — covered in ISLR §6.1).
- Shrinkage (Ridge and Lasso — our focus) — shrink coefficient estimates towards zero to reduce variance; some may shrink to exactly zero, performing variable selection.
- Dimension reduction — e.g. principal-components regression (PCR).
Notes
Subset selection — exhaustive search over all \(2^p\) subsets of predictors. Best subset is conceptually clean but computationally infeasible for \(p > 20\). Stepwise (forward / backward) is a greedy approximation that’s tractable but can miss the true best subset.
Shrinkage — fit on all predictors but penalise large coefficients. Ridge (today) and Lasso (next lecture) are the two canonical shrinkage methods. Computationally cheap (one optimisation per \(\\lambda\) value) and works well even when \(p > n\). This is what we focus on for the project.
Dimension reduction — replace the \(p\) predictors with a smaller number of constructed features (e.g., principal components of \(X\)), then regress on the constructed features. PCR is the most common version. Useful when the predictors have highly correlated structure that compresses well; less common in finance because the constructed components have no economic interpretation.
For the project, Ridge and Lasso are the two methods you should know how to fit, tune, and report. Subset selection is intellectually interesting but slow; PCR is occasionally useful but harder to interpret.
Ridge regression — the equation
OLS minimises the residual sum of squares:
\[ \mathrm{RSS} \;=\; \sum_{i=1}^n \Bigl( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \Bigr)^2 \]
Ridge regression adds a penalty on the coefficients:
\[ \sum_{i=1}^n \Bigl( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \Bigr)^2 \;+\; \boxed{\lambda \sum_{j=1}^p \beta_j^2} \]
Notes
The Ridge objective adds an L2 penalty (\(\\sum \\beta_j^2\)) to the OLS RSS. The penalty term is small when the \(\\beta\)s are small; large when any \(\\beta\) is large. So the optimiser, in choosing \(\\beta\) to minimise the total, has an incentive to keep \(\\beta\)s small.
The trade-off parameter is \(\\lambda \\geq 0\):
- \(\\lambda = 0\) — no penalty, Ridge degenerates to OLS.
- \(\\lambda \\to \\infty\) — infinite penalty, all \(\\beta\)s shrink to zero, the model becomes the intercept-only “predict-the-mean” benchmark.
- Intermediate \(\\lambda\) — a balance between fitting the data and keeping \(\\beta\)s small.
Choosing \(\\lambda\) is a tuning-parameter problem — the next lecture covers cross-validation as the principled way to pick the optimal \(\\lambda\) from data. For now, just understand that \(\\lambda\) controls model complexity and slides you along the bias-variance trade-off curve.
The L2 specifically (squared \(\\beta\)s, not absolute values) is what makes Ridge “Ridge”. The L1 alternative (\(\\sum |\\beta_j|\)) is Lasso, covered next lecture; it has the additional property of zeroing some \(\\beta\)s entirely.
Ridge — what the penalty does
- Tuning parameter \(\lambda > 0\).
- The penalty shrinks large \(|\beta|\) towards zero.
- The intercept is not penalised.
- The constraint should improve fit because shrinking coefficients reduces their variance.
- When \(\lambda = 0\), Ridge collapses back to OLS.
Notes
Two implementation details worth noting:
- The intercept is not penalised. The intercept is the predicted value when all predictors are zero — shrinking it to zero would force the regression line through the origin, which has nothing to do with controlling overfitting. Standard implementations (
glmnetin R,Ridgein scikit-learn,MASS::lm.ridge) handle this internally; you don’t need to do anything special. - Standardise predictors before fitting. Ridge penalises \(\\sum \\beta_j^2\), which is sensitive to the scale of each \(X_j\). A predictor measured in millions has tiny \(\\beta\)s; a predictor measured in unit-fractions has large \(\\beta\)s — purely because of measurement scale, not economic importance. Scaling all predictors to unit variance makes the penalty fair across them. Again,
glmnetdoes this by default (standardize = TRUE).
The “shrinking reduces variance” claim is the empirical content of Ridge: by constraining each \(\\beta\) to be small, you reduce the influence of any single noisy observation on the parameter estimate. The cost is bias — small \(\\beta\)s underestimate the true effect of strongly-relevant predictors. The bias-variance trade-off picture from earlier in the lecture explains exactly when this trade is favourable.
Manual calculation of betas
\[ \beta \;=\; (X'X + \lambda I)^{-1} X' y \]
- Penalty term: \(\lambda I\) — \(\lambda\) times the identity matrix, so dimensions match the \(\beta\) vector (\(\beta_1, \ldots, \beta_p\)).
- If predictors are centered (mean zero), \(\beta_0 = \bar Y\) — no need to include the intercept in the equation.
Notes
This is the closed-form Ridge solution — exactly the OLS formula with \(\\lambda I\) added inside the inverse. The \(\\lambda I\) term has two effects:
- It makes \(X'X + \\lambda I\) invertible even when \(X'X\) is singular (which happens whenever \(p > n\) or when predictors are perfectly collinear). This is why Ridge works in regimes where OLS fails.
- It shrinks the resulting \(\\beta\)s towards zero. Larger \(\\lambda\) means a “stronger” identity matrix added, which dominates the \(X'X\) structure and pushes coefficients down.
The closed-form is computationally cheap (just a matrix inverse, \(O(p^3)\) for a \(p\)-dimensional system). In practice, we use glmnet::glmnet with alpha = 0 (alpha is the L1/L2 mixing parameter; alpha = 0 is pure Ridge), which is even faster because it uses a coordinate-descent algorithm optimised for the case where you want the solution at many \(\\lambda\) values simultaneously.
Live demo — Ridge by hand
library(ISLR); library(glmnet)
Hitters <- na.omit(Hitters)
# simplified to two predictors
x <- as.matrix(data.frame(Hitters$AtBat, Hitters$Hits))
y <- Hitters$Salary
xs <- scale(x, center = TRUE, scale = FALSE) # centre predictors
n <- nrow(x)
sd_y <- sqrt(var(y) * (n - 1) / n)[1]
iden <- diag(2)
# lambda = 0 — should recover OLS
lam <- 0
ridge.mod <- glmnet(xs, y, alpha = 0,
lambda = lam * sd_y / n,
standardize = FALSE, thresh = 1e-20)
ridge.man <- solve(t(xs) %*% xs + lam * iden) %*% t(xs) %*% y
beta_0 <- mean(y)
cbind(coef(ridge.mod),
coef(lm(y ~ xs)),
c(beta_0, ridge.man)) # all three columns match- We compute Ridge three ways at \(\lambda = 0\) —
glmnet, baselm, and the closed-form formula — to verify they coincide. glmnet’s lambda is on a different scale than the textbook formula — multiply bysd_y / nto align (see stats.stackexchange).thresh = 1e-20tightens convergence so the comparison is numerically tight.- Centering predictors removes the need to include the intercept in the matrix formula.
Notes
The point of running Ridge three different ways and verifying they agree is to demystify the algorithm. glmnet is a black box for many users; computing the same answer by hand from the closed-form formula confirms there is no magic — Ridge is just OLS with one extra matrix term.
A few practical pointers from this code:
glmnet’slambdaparameter is on a different scale than the textbook formula.glmnetminimises \(\\frac{1}{2n} \\text{RSS} + \\lambda \\sum \\beta_j^2\) rather than \(\\text{RSS} + \\lambda \\sum \\beta_j^2\). To get a textbook-equivalent fit, multiply bysd_y / n(or equivalently setglmnet’s lambda to \(\\lambda^{\\text{textbook}} \\cdot n / \\text{sd}_y\)). This catches almost everyone the first time.thresh = 1e-20tightens convergence — useful when you need numerically tight comparisons but slows the algorithm. For production fits the defaultthresh = 1e-7is fine.- The Hitters dataset (from
ISLR) is a baseball-salary panel that JWHT uses throughout chapter 6 for examples. It’s not a finance dataset but it’s the canonical pedagogical reference for Ridge / Lasso. Once you understand the workflow on Hitters, applying it to the project’s prediction-market data is mechanical.
Hitters data — coefficient paths vs λ
- Reproduce with
Ridge_figures.R(loop over 0–1000 λ values, plot standardised coefficients). - As \(\lambda\) increases, standardised coefficients shrink towards zero.
- Bar at the bottom: flexibility decreases as \(\lambda\) grows.
Notes
Coefficient paths are the fundamental visualisation for Ridge / Lasso. For each \(\\lambda\) value (x-axis), plot each \(\\beta_j\) as a separate line (y-axis). Watching how the lines move as \(\\lambda\) grows tells you how the model is changing.
Ridge paths have a characteristic shape: all coefficients shrink smoothly toward zero as \(\\lambda\) grows, but never reach exactly zero. The relative ordering of coefficients (which is biggest, which is smallest) is roughly preserved.
Lasso paths (next lecture) look different — coefficients reach zero at finite \(\\lambda\) values and stay there. Comparing the two visualisations side-by-side is the most direct way to internalise the L1-vs-L2 distinction.
The “flexibility decreases” annotation at the bottom of the chart maps to the bias-variance picture from earlier: high \(\\lambda\) = simpler model = lower variance, higher bias. The optimal \(\\lambda\) (selected by cross-validation in the next lecture) sits at the bias-variance sweet spot.
Why shrinking towards zero helps
- OLS estimates have low bias but can be highly variable, especially when \(n \approx p\).
- The penalty makes Ridge estimates biased, but substantially reduces variance.
- Net effect: a bias / variance trade-off that often improves test MSE.
Notes
The mechanism is precisely the bias-variance picture we built earlier in the lecture, applied to OLS-vs-Ridge specifically:
- OLS is unbiased but has high variance when \(p\) is large or predictors are correlated. Each \(\\beta\) wobbles considerably from one training sample to the next.
- Ridge introduces a small amount of bias (every \(\\beta\) is shrunk slightly toward zero) in exchange for substantially lower variance.
When OLS variance is the dominant component of test MSE, the Ridge trade is favourable — the small bias added is far outweighed by the variance reduction. This is exactly the regime where Ridge wins: many predictors, modest sample size, predictors correlated with each other.
When OLS variance is already low (lots of data, few well-conditioned predictors), Ridge’s bias hurts more than its variance reduction helps, and OLS wins. The next slide shows this trade-off graphically.
Ridge bias / variance trade-off
- Bias² (black) rises with \(\lambda\).
- Variance (green) falls with \(\lambda\).
- Test MSE (purple) is U-shaped with a clear minimum — pick the \(\lambda\) that minimises it.
- Ridge wins most when OLS estimates have high variance.
Notes
Same picture as the earlier flexibility-vs-MSE chart, now specifically over \(\\lambda\):
- Bias² (black) starts at zero (Ridge with \(\\lambda = 0\) is OLS, which is unbiased) and grows as \(\\lambda\) increases.
- Variance (green) starts high (OLS variance) and drops as \(\\lambda\) increases (smaller \(\\beta\)s have lower variance).
- Test MSE (purple) is the sum, plus the irreducible error floor. U-shape: drops as variance reduction outweighs bias growth, then rises as bias growth dominates.
The optimal \(\\lambda\) is where the U-shape bottoms out. In real applications you don’t see this curve directly — you estimate it via cross-validation (Lecture 4) and pick the \(\\lambda\) that minimises CV-MSE.
Ridge’s biggest gains come in regimes where the OLS variance curve starts very high — i.e., when \(p\) is large, predictors are correlated, or sample size is small. In those regimes the U-shape is deep and the gain from Ridge over OLS is large. When OLS variance is already small, the U-shape barely dips below the OLS point and Ridge offers little.
Computational advantages of Ridge
- For large \(p\), best-subset selection would search through \(2^p\) models — combinatorially expensive.
- With Ridge, for any given \(\lambda\), fit one model — the computations are very simple.
- Ridge even works when \(p > n\), where OLS fails completely.
Notes
The computational story is straightforward but worth noting because it makes Ridge production-feasible in regimes where best-subset selection isn’t:
- Best-subset searches all \(2^p\) models — at \(p = 30\), that’s a billion fits. Even with greedy stepwise alternatives, the search is expensive and can miss the true best subset.
- Ridge is one matrix inversion per \(\\lambda\) value. Modern implementations (
glmnet’s coordinate descent) compute the full \(\\lambda\)-path in roughly the cost of one OLS fit.
The “\(p > n\)” capability is the killer feature for modern empirical work. With 100 candidate predictors and 60 monthly observations, OLS literally cannot fit. Ridge happily fits — the \(\\lambda I\) regularisation makes \(X'X + \\lambda I\) invertible regardless. This is why Ridge / Lasso are workhorses of the empirical-finance toolkit; classical OLS no longer covers the cases that come up.
For your project, the dataset will not necessarily be \(p > n\), but the principle still applies: if you have many candidate predictors, Ridge / Lasso are the right tools. OLS works but will likely overfit.
3.M Conclusion of Lecture 3
- 3.1 Course objectives
- 3.2 Recap from Lectures 1 & 2
- 3.3 Assessing model accuracy
- 3.4 Linear model selection & regularisation
- 3.M Conclusion of Lecture 3
Course at a glance (1/2)
Foundations
Course outline · Backtesting fundamentals
- Course aim & organisation
- Backtesting overview & case study
- In-sample tests (Welch & Goyal 2008)
- Out-of-sample (walk-forward, R²_OS)
- Useful predictors & p-hacking
Introduction to R
RStudio · variables · vectors · data frames · live coding
- Why R for empirical asset-management research
- RStudio and the script editor
- Variables, vectors, matrices, data frames, lists
- Functions and loops
- Data import and export
Assessing model accuracy & Ridge regression
Statistical learning · MSE · bias-variance · linear model selection · Ridge
- Statistical learning: Y = f(X) + ε
- Quality of fit and the train/test MSE distinction
- Bias-variance trade-off and overfitting
- OLS limits: prediction accuracy & interpretability
- Ridge regression and the L2 penalty
Lasso, cross-validation & Elastic Net
Sparse regularisation · resampling for honest test error · choosing λ
- Lasso: L1 penalty and exact-zero coefficients
- Cross-validation: validation set, LOOCV, K-fold
- Choosing the optimal λ for Lasso
- OLS post-Lasso for cleaner coefficient inference
- Elastic Net — combining Ridge and Lasso
Prediction markets, the Polymarket Quant Bench & your project
From Welch-Goyal to event-resolved binary contracts
- Prediction markets — definition and Polymarket as the canonical venue
- How prices form: liquidity, resolution, mechanics
- The Polymarket Quant Bench dataset (HuggingFace): access and schema
- First look at the data in R
- Your project: indicator design, back-test, deliverables, R toolbox
Course at a glance (2/2)
Final presentations
Group presentations · Q&A · wrap-up
- Presentation order and time budget
- Q&A rules
- Closing thoughts and feedback
Further reading
- James et al. (2021) — Chapter 2 (statistical learning), Chapter 6 (linear model selection & regularisation).
- Welch and Goyal (2008) — bias / variance arguments mirror the IS-vs-OOS results we saw in Lecture 1.
Notes
JWHT chapters 2 and 6 (James et al. 2021) are the textbook reading for today’s lecture. Chapter 2 covers the conceptual material — statistical learning, MSE, bias-variance trade-off. Chapter 6 covers the methods — best-subset selection, Ridge, Lasso. Each chapter ends with R labs that walk through the implementation.
The connection to Welch and Goyal (Welch and Goyal 2008) is conceptual: their finding that in-sample fit doesn’t survive out-of-sample testing is the empirical-finance manifestation of the bias-variance picture today. The kitchen-sink regression had near-zero training MSE (lots of flexibility) but catastrophic test MSE (overfit to noise) — exactly the right tail of the U-shape.
Prepare before next lecture
- Run
StatLearning.Rlocally — confirm you can reproduce Figure 2.9. - Run
Ridge_comparison.RandRidge_figures.R— verify all three Ridge implementations agree at \(\lambda = 0\). - Read ISLR §2.2 (assessing model accuracy) and §6.2 (Ridge & Lasso).
Notes
The two scripts above (posted on Moodle) are the reference implementations of today’s pictures. Running them locally confirms two things: (a) your R install can fit Ridge with glmnet, (b) the textbook formulas reduce to OLS at \(\\lambda = 0\). Both are sanity checks before next lecture’s harder material.
JWHT §6.2 is the Ridge / Lasso section — the first half is recap of today, the second half introduces Lasso, which is Lecture 4’s main topic. Reading it once before the lecture means the live coding sessions don’t have to pause for definitional explanation.
See you next time
- Lecture 4 (6 May 2026): Lasso, Elastic Net, cross-validation — selecting the optimal \(\lambda\) honestly.