Lecture 4: Lasso, cross-validation & Elastic Net

Sparse regularisation · resampling for honest test error · choosing λ

Authors
Affiliation

Prof. Dr. Andre Guettler

Institute of Strategic Management and Finance, Ulm University

Oliver Padmaperuma

Institute of Strategic Management and Finance, Ulm University

Published

May 6, 2026

4.1 Course objectives

  • 4.1 Course objectives
  • 4.2 Recap from Lecture 3
  • 4.3 The Lasso
  • 4.4 Resampling methods
  • 4.5 Selecting λ for Lasso
  • 4.6 Refinements
  • 4.M Conclusion of Lecture 4
  • Welcome to
  • Course Objective
  • Course at a glance (1/2)
  • Course at a glance (2/2)
  • Assignments / Exams

Welcome to Finance Project — Asset Management

  • This is a project course: there is no central exam to register for. Sign up on the course Moodle page by 15 April 2026 so you receive announcements and the data link.
  • Submit the project by 30 June 2026 as a single zip — name pattern: Asset2026_surname1_surname2_surname3. Email it to oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de and your team-mates.
  • Ask questions during or right after each session — that is the preferred channel.
  • Admin / studies / exam-eligibility questions go to the registrar’s office (Studiensekretariat) at studiensekretariat@uni-ulm.de.
  • Course-content questions outside class: email oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de.
  • We also recommend the student advisory service.

Course Objective

Scope

We will:

  • Build an end-to-end empirical pipeline in R: load, explore, model, back-test
  • Cover the core ML toolbox for asset-management research: linear models, Ridge, Lasso, Elastic Net, cross-validation
  • Apply it to a non-traditional asset class: prediction markets
  • Develop your own indicator library and trading strategy in groups of three

We will NOT:

  • Drift into deep-learning or reinforcement-learning methods
  • Cover prediction markets in depth
  • Provide a “ready-to-fork” backtest — the demo code is intentionally basic

Approach

Part I — Foundations

  • L1: Motivation, organisation, backtesting fundamentals
  • L2: Hands-on R intro — RStudio, live coding, etc.
  • L3 + L4: Statistical learning — model accuracy, regularisation, resampling

Part II — Application

  • L5: Prediction-markets primer + the Polymarket dataset + assignment briefing
  • Project work in groups of three (≈ 7 weeks of self-organised work)
  • Final session (1 July): 20-minute presentations per team

Course at a glance (1/2)

Foundations

Week 1

15.04.2026

Course outline · Backtesting fundamentals

  • Course aim & organisation
  • Backtesting overview & case study
  • In-sample tests (Welch & Goyal 2008)
  • Out-of-sample (walk-forward, R²_OS)
  • Useful predictors & p-hacking

Introduction to R

Week 2

22.04.2026

RStudio · variables · vectors · data frames · live coding

  • Why R for empirical asset-management research
  • RStudio and the script editor
  • Variables, vectors, matrices, data frames, lists
  • Functions and loops
  • Data import and export

Assessing model accuracy & Ridge regression

Week 3

29.04.2026

Statistical learning · MSE · bias-variance · linear model selection · Ridge

  • Statistical learning: Y = f(X) + ε
  • Quality of fit and the train/test MSE distinction
  • Bias-variance trade-off and overfitting
  • OLS limits: prediction accuracy & interpretability
  • Ridge regression and the L2 penalty

Lasso, cross-validation & Elastic Net

Week 4

06.05.2026

Sparse regularisation · resampling for honest test error · choosing λ

  • Lasso: L1 penalty and exact-zero coefficients
  • Cross-validation: validation set, LOOCV, K-fold
  • Choosing the optimal λ for Lasso
  • OLS post-Lasso for cleaner coefficient inference
  • Elastic Net — combining Ridge and Lasso

Prediction markets, the Polymarket Quant Bench & your project

Week 5

13.05.2026

From Welch-Goyal to event-resolved binary contracts

  • Prediction markets — definition and Polymarket as the canonical venue
  • How prices form: liquidity, resolution, mechanics
  • The Polymarket Quant Bench dataset (HuggingFace): access and schema
  • First look at the data in R
  • Your project: indicator design, back-test, deliverables, R toolbox

Course at a glance (2/2)

Final presentations

Week 13

01.07.2026

Group presentations · Q&A · wrap-up

  • Presentation order and time budget
  • Q&A rules
  • Closing thoughts and feedback

Assignments / Exams

Project (Code + Report) 50% of your grade

Rmd code + knitr-rendered PDF report. Build a library of indicators over the Polymarket Quant Bench dataset (curated OHLCV bars on HuggingFace, derived from Jon Becker’s polymarket-data dump), derive trade signals, back-test, and write a critical reflection.

Group of up to 3.

Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Finance Project — Asset Management_assignment-1-project-report_surname1_surname2_…

30 June 2026

Final Presentation 50% of your grade

20-minute group presentation in class on 1 July 2026; submit slides as PDF together with the project zip.

Group of up to 3.

Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Finance Project — Asset Management_assignment-2-final-presentation_surname1_surname2_…

1 July 2026

4.2 Recap from Lecture 3

  • 4.1 Course objectives
  • 4.2 Recap from Lecture 3
  • 4.3 The Lasso
  • 4.4 Resampling methods
  • 4.5 Selecting λ for Lasso
  • 4.6 Refinements
  • 4.M Conclusion of Lecture 4
  • Where we are

Where we are

  • Statistical learning as estimation of \(f\) from \(Y = f(X) + \varepsilon\).
  • MSE: training vs test MSE — they diverge as flexibility grows.
  • Bias / variance trade-off: more flexible ⇒ less bias, more variance; expected test MSE has a minimum.
  • Ridge regression: L2 penalty \(\lambda \sum \beta_j^2\) shrinks coefficients but never to exactly zero.

Notes

Today completes the regularisation toolkit started last week. Three concepts:

  1. Lasso — Ridge’s L1 cousin, with the additional feature of zeroing out some coefficients (variable selection).
  2. Cross-validation — the principled way to choose the tuning parameter \(\\lambda\) from data alone, without a separate test set.
  3. Elastic net — a hybrid of Ridge and Lasso, combining L1 and L2 penalties.

By the end of the lecture you’ll have the pieces to build the project’s modelling pipeline: candidate predictors → CV-tuned Lasso (or Elastic Net) → walk-forward backtest → reported OOS R².

4.3 The Lasso

  • 4.1 Course objectives
  • 4.2 Recap from Lecture 3
  • 4.3 The Lasso
  • 4.4 Resampling methods
  • 4.5 Selecting λ for Lasso
  • 4.6 Refinements
  • 4.M Conclusion of Lecture 4
  • Why move past Ridge?
  • Lasso’s penalty term
  • What’s the big deal?
  • Hitters data — Lasso coefficient paths

Why move past Ridge?

  • Ridge’s penalty never forces any coefficient to be exactly zero.
  • The final model always includes all variables — harder to interpret with many predictors.
  • A more modern alternative is the Lasso (Tibshirani, 1996).
  • Lasso works similarly to Ridge — but with a different penalty.

Notes

Ridge does prediction well but does not do variable selection — every predictor stays in the model with a non-zero (if shrunken) coefficient. For empirical-finance work where interpretation matters and where you want to be able to say “the strategy uses these 5 predictors”, Ridge is incomplete.

The Lasso (Tibshirani 1996, (Tibshirani 1996)) — Least Absolute Shrinkage and Selection Operator — is Ridge’s variable-selecting cousin. The acronym is awkward but accurately summarises the two effects: shrinkage (like Ridge) plus selection (zeroing some coefficients). Tibshirani’s 1996 paper is one of the most-cited statistics papers of the past 30 years; the method has become a default tool across machine learning, biostatistics, and empirical economics.

Lasso’s penalty term

Ridge minimises:

\[ \sum_{i=1}^n \Bigl(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\Bigr)^2 + \boxed{\lambda \sum_{j=1}^p \beta_j^2} \]

Lasso minimises:

\[ \sum_{i=1}^n \Bigl(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\Bigr)^2 + \boxed{\lambda \sum_{j=1}^p |\beta_j|} \]

L2 vs L1: squared coefficients (Ridge) vs absolute values (Lasso).

Notes

The penalty is the only thing that changes. Both methods minimise (RSS + penalty); the penalty differs in functional form but the optimisation problem is structurally identical.

Why does the L1 penalty do something fundamentally different from L2? Geometrically, the L1 penalty’s contour lines (regions of constant penalty cost in \(\\beta\)-space) are diamond-shaped, with corners on the coordinate axes. The L2 contour lines are circles, smooth everywhere. When the optimisation finds the point of contact between the data’s RSS contour and the penalty contour, the L2 contour’s smoothness means the contact point is generically off-axis (some \(\\beta\) component is small but non-zero). The L1 diamond’s corners pull the contact point onto an axis (some \(\\beta\) component is exactly zero) for many RSS shapes.

JWHT (James et al. 2021) §6.2.2 has an excellent diagram of this geometric argument; if you want to internalise why L1 zeros and L2 doesn’t, that figure is worth two minutes.

What’s the big deal?

  • Looks like a tiny change — but the L1 penalty can drive some coefficients to exactly zero.
  • Lasso therefore yields a model that has high predictive power and is simple to interpret (variable selection!).
  • Drawback: there is no closed-form solution like Ridge’s \((X'X + \lambda I)^{-1} X'y\) — numerical optimisation only.

Notes

The variable-selection effect is the headline feature: a sparse Lasso fit (most coefficients zero) gives you both a predictor and a story about which features matter. For the project, this is operationally valuable — your report can have a short “the model uses these N indicators, here’s the economic interpretation of each” section rather than a 30-row coefficient table.

The lack of a closed-form solution is the engineering trade-off. Ridge’s \((X'X + \\lambda I)^{-1}X'y\) is one matrix inverse — fast and predictable. Lasso requires numerical optimisation: typically coordinate descent, which glmnet implements very efficiently in C. For small problems (a few hundred predictors, a few thousand observations) the speed difference is invisible; for very large problems Ridge can be substantially faster.

In practice: glmnet(x, y, alpha = 0) for Ridge, glmnet(x, y, alpha = 1) for Lasso, glmnet(x, y, alpha = 0.5) for elastic net — the same R interface for all three, you just toggle one argument.

Hitters data — Lasso coefficient paths

  • Reproduce with Lasso.Rglmnet with alpha = 1.
  • Note that — unlike Ridge — coefficients hit exactly zero at finite \(\lambda\).
  • Question: how do we pick the optimal \(\lambda\)?

➡ Answer next: cross-validation.

Notes

Lasso coefficient paths look distinctively different from Ridge’s. Reading from right to left as \(\\lambda\) shrinks:

  • Far right (large \(\\lambda\)): all coefficients are zero. The model is the intercept-only “predict the mean” benchmark.
  • Moving left (decreasing \(\\lambda\)): coefficients enter the model one (or a few) at a time, going from exactly zero to non-zero. The order in which they enter tells you which predictors Lasso considers most informative.
  • Far left (small \(\\lambda\)): almost all coefficients are non-zero, the fit approaches OLS.

The top axis shows the count of non-zero coefficients at each \(\\lambda\) — for Hitters, dropping from 19 (full OLS) at small \(\\lambda\) down to 0 at large \(\\lambda\). The “right” \(\\lambda\) — the one that minimises out-of-sample MSE — sits somewhere in the middle, with a moderate number of selected variables. Picking it is exactly the cross-validation problem on the next slide.

4.4 Resampling methods

  • 4.1 Course objectives
  • 4.2 Recap from Lecture 3
  • 4.3 The Lasso
  • 4.4 Resampling methods
  • 4.5 Selecting λ for Lasso
  • 4.6 Refinements
  • 4.M Conclusion of Lecture 4
  • What are resampling methods?
  • Three types of cross-validation
  • The Validation Set Approach
  • Example — Auto data, validation set
  • Live demo — validation-set approach
  • Validation set — pros & cons
  • Leave-One-Out Cross-Validation (LOOCV)
  • LOOCV vs validation set
  • Live demo — LOOCV by hand
  • K-fold Cross-Validation
  • Auto data — LOOCV vs K-fold
  • What do we do in practice?

What are resampling methods?

Tools that involve repeatedly drawing samples from a training set and refitting a model on each sample, to obtain more information about the fitted model.

  • Model assessment — estimate test error rates.
  • Model selection — pick the appropriate level of model flexibility (e.g. \(\lambda\)).

Drawback: resampling is computationally expensive.

In this course we use cross-validation (we skip bootstrapping).

Notes

Resampling methods are statistical techniques that draw multiple samples from the data and refit the model on each. The motivation is simple: a single fit on a single training set gives you one estimate of model performance; many fits on many resamples give you a distribution of estimates, from which you can quantify uncertainty and make better decisions about model selection.

The two main families:

  • Cross-validation (today’s focus) — partition the data into folds; fit on most, test on the held-out fold; rotate. Standard tool for tuning hyperparameters like Lasso’s \(\\lambda\).
  • Bootstrap — sample with replacement to create resamples of the same size as the original. Used for standard-error estimation and quantifying parameter uncertainty. We don’t cover it in this course; JWHT chapter 5 has it for reference.

The “computationally expensive” caveat is real but increasingly less binding — modern hardware fits hundreds of CV folds per second for typical empirical-finance problems. For your project, CV is fast enough not to be a bottleneck.

Three types of cross-validation

We cover:

  1. The Validation Set Approach
  2. Leave-One-Out Cross-Validation (LOOCV)
  3. K-fold Cross-Validation

The Validation Set Approach

  • Suppose we want the variable set with the lowest test (not training) error rate.
  • With a large data set: randomly split into training and validation parts.
  • Fit each candidate model on the training set.
  • Pick the model with the lowest validation error.

Notes

The simplest resampling method: a single random split into training and validation. Common splits are 70/30 or 80/20 in favour of training. Fit on training, evaluate on validation, pick the model with the lowest validation MSE.

Pros: very simple, fast (one fit per candidate model).

Cons (developed on the next two slides): - The validation MSE is noisy — a different random split would have given a different MSE, and the variance across splits can be substantial. - You’re throwing away half (or more) of your data when fitting — models trained on smaller samples are weaker than they could be.

Despite these limitations the validation-set approach is a useful conceptual baseline. Cross-validation methods on the subsequent slides solve both problems by averaging across many splits.

Example — Auto data, validation set

  • Predict mpg from horsepower.
  • Two candidate models:
    • \(\mathrm{mpg} \sim \mathrm{horsepower}\)
    • \(\mathrm{mpg} \sim \mathrm{horsepower} + \mathrm{horsepower}^2\) (and higher-order polynomials)
  • Randomly split 392 obs into 196 training + 196 validation.
  • Fit both models on training; evaluate test MSE on the validation half.
  • Lowest test MSE wins.

Notes

The Auto dataset (cars: mpg, horsepower, weight, …) is a JWHT teaching example. The candidate models are polynomial fits of mpg on horsepower of degree 1, 2, 3, etc. — increasing flexibility. The right degree to choose is the one with the lowest test MSE.

Why polynomials? Because mpg decreases with horsepower but in a non-linear way (the relationship flattens at low horsepower and steepens in the middle). A linear fit underfits; a degree-10 polynomial overfits; some degree in between is the sweet spot. Validation-set is the simplest tool to pick that sweet spot.

Live demo — validation-set approach

library(ISLR);  attach(Auto)

# 10 random splits × 10 polynomial degrees → matrix of test MSEs
mse <- matrix(0, 10, 10)
for (i in 1:10) {
  set.seed(i)
  train <- sample(392, 196)
  for (j in 1:10) {
    lm.fit    <- lm(mpg ~ poly(horsepower, j), data = Auto, subset = train)
    mse[i, j] <- mean((mpg - predict(lm.fit, Auto))[-train]^2)
  }
}

plot(mse[1, ], type = "l", col = 1, xlab = "Flexibility", ylab = "MSE",
     ylim = c(15, 30))
for (j in 2:10) lines(mse[j, ], col = j)
  • Outer loop: 10 different random train/test splits.
  • Inner loop: polynomial degrees 1–10.
  • Resulting plot shows a lot of variability between splits — hence the validation MSE itself is unreliable.
  • Note: avoid for loops in your project where possible — prefer vectorised / apply-family code (see https://www.datacamp.com/community/tutorials/r-tutorial-apply-family).

Notes

The 10×10 matrix (10 splits × 10 polynomial degrees) demonstrates the variance across random splits that motivates more robust methods. Each row of the plot is one split’s MSE-vs-flexibility curve; the rows visibly disagree about which degree is optimal. Some splits prefer degree 2; others prefer degree 5 or 6. Without averaging, you couldn’t honestly say “degree X is best”.

The for-loop note is a stylistic rather than substantive caution. R’s for-loops work but are slower than vectorised alternatives; for production code or larger datasets, prefer sapply, purrr::map_dbl, or vectorised primitives. For pedagogical clarity, the loops are fine — and for the project’s typical sample sizes, the speed difference is invisible.

Validation set — pros & cons

  • Simple to think about.
  • Easy to implement.
  • The validation MSE is highly variable between random splits.
  • Only a subset of observations is used to fit — methods perform worse with fewer training observations.

Notes

The two disadvantages compound. The high variance in validation MSE means that with just one split you might confidently pick the wrong model. The reduced training-set size means each candidate model is fit on less data than it could have been; if more data would have changed the relative ordering (which it sometimes does), the validation-set comparison is misleading.

Both problems are solved by averaging over many splits — the next two slides introduce LOOCV (the maximally fine-grained version) and K-fold CV (the practical compromise).

Leave-One-Out Cross-Validation (LOOCV)

  • For each candidate model:
    • Split the data of size \(n\) into training (size \(n-1\)) and validation (size 1).
    • Fit the model on the training set.
    • Compute the squared error for the held-out observation.
    • Repeat \(n\) times.
  • \(\mathrm{CV}_{(n)} = \dfrac{1}{n}\sum_{i=1}^n \mathrm{MSE}_i\)

Notes

LOOCV is the limiting case of K-fold CV at K = n. Conceptually elegant: every observation gets its turn as the held-out test point, the rest of the data fits the model. The CV estimate of test MSE is the average squared error across all n leave-one-out predictions.

The compute cost is the obvious downside — n model fits per candidate hyperparameter, vs 5 or 10 for K-fold. For small samples (n < a few hundred) and fast-fitting models (linear regression, Ridge), LOOCV is feasible. For larger samples or slower models, K-fold is the practical choice.

There’s a clever computational trick that makes LOOCV essentially free for OLS specifically: the leave-one-out prediction error can be computed without re-fitting, using the hat matrix diagonal. boot::cv.glm exploits this when applicable. For Lasso (no closed-form), the trick doesn’t apply and LOOCV is genuinely n times slower than a single fit.

LOOCV vs validation set

  • LOOCV has less bias — almost the entire data set is used to fit each model.
  • LOOCV produces a more stable MSE — the validation approach gives different MSEs each time due to randomness in splitting; LOOCV always returns the same answer.
  • LOOCV is computationally intensive — fit each model \(n\) times.

Notes

  • Less bias than validation-set: each model is trained on n − 1 observations (essentially the full dataset) rather than n/2, so the estimated test error is closer to the true test error you’d see if you trained on the full sample and evaluated on a fresh sample.
  • More stable than validation-set: there’s no random split, so every run of LOOCV on the same data gives the same answer. The validation-set approach gives different MSEs each time (visible in the previous slide’s plot of 10 random splits).
  • Slow: factor of n. For n = 10000 and a 1-second model fit, that’s 3 hours per hyperparameter value.

For your project, LOOCV is rarely the right choice — K-fold is faster and statistically better in most settings (lower variance of the CV estimate). LOOCV’s main role is conceptual: it’s the limit of K-fold and is the right starting point for understanding why CV works.

Live demo — LOOCV by hand

library(ISLR);  attach(Auto)

# Manual LOOCV across 10 polynomial degrees
mse <- matrix(0, 392, 10)
for (j in 1:10) {
  for (i in 1:392) {
    lm.fit    <- lm(mpg ~ poly(horsepower, j), data = Auto[-i, ])
    mse[i, j] <- (mpg - predict(lm.fit, Auto))[i]^2
  }
}

mse_loocv <- colMeans(mse)
plot(mse_loocv, type = "l", xlab = "Flexibility", ylab = "MSE",
     ylim = c(15, 30))
  • Outer loop over polynomial degrees (1–10), inner loop over 392 observations.
  • This is the slow, didactic LOOCV. In practice use boot::cv.glm for an order-of-magnitude speed-up.
  • Compare against 5-fold CV (5fold_CV.R) — same shape, much faster.

Notes

The didactic implementation makes the LOOCV mechanic concrete: outer loop over polynomial degree, inner loop over the 392 observations being held out one at a time. The output is a 392×10 matrix of squared prediction errors; column means give the LOOCV estimate of test MSE per polynomial degree.

In production code you’d never write the inner loop yourself — boot::cv.glm(data, model, K = nrow(data)) does the same thing in one line and runs much faster (it exploits the closed-form trick mentioned earlier). The point of writing it explicitly here is to demonstrate that there’s no magic — LOOCV is just a structured loop over one-at-a-time train/test splits.

K-fold Cross-Validation

LOOCV is computationally heavy, so we run K-fold CV instead:

  1. Divide the data into K different parts (e.g. \(K = 5\) or \(K = 10\)).
  2. Remove the first part; fit the model on the remaining \(K-1\) parts; compute MSE on the omitted part.
  3. Repeat \(K\) times — taking out a different part each round.
  4. Average the \(K\) MSEs:

\[ \mathrm{CV}_{(k)} \;=\; \dfrac{1}{k} \sum_{i=1}^k \mathrm{MSE}_i \]

Notes

K-fold CV is the practical compromise between LOOCV (most accurate but slowest) and validation-set (fastest but noisiest). The standard recipe:

  1. Split data into K equally-sized folds (K = 5 or K = 10 are conventional).
  2. For each fold k = 1, …, K: hold out fold k as the validation set, fit the model on the remaining K − 1 folds, compute MSE on fold k.
  3. Average the K MSEs.

Why K = 5 or 10? It’s empirical — Kohavi (1995) and subsequent work found that K in this range gives the best balance of bias (each training set is large enough that the model is similar to the one fit on the full data) and variance (averaging across K folds reduces variance compared to validation-set; K too large makes the folds too correlated, increasing variance).

cv.glmnet from the glmnet package handles this automatically for Ridge / Lasso / Elastic Net — nfolds = 10 by default. For your project, just call cv.glmnet(x, y, alpha = 1) and use $lambda.min to pick the best \(\\lambda\).

Auto data — LOOCV vs K-fold

  • Left: LOOCV error curve — single, deterministic.
  • Right: 5-fold CV run 10 times — curves coincide tightly (much less spread than the validation-set approach).
  • LOOCV is a special case of K-fold with \(K = n\).
  • Both stable; LOOCV more compute-heavy.

Notes

This is the empirical comparison: 5-fold CV (right panel) gives nearly the same answer as LOOCV (left panel) at a fraction of the computational cost. Even rerunning 5-fold 10 times — visible as 10 nearly-overlapping curves — produces estimates so similar to the LOOCV curve that the practical loss from using K-fold is negligible.

The contrast with the validation-set approach (way back, with 10 different splits producing 10 visibly distinct curves) is what motivates K-fold over single-split: K-fold’s averaging across folds delivers stability that single splits cannot.

For empirical-finance work, 5-fold or 10-fold CV is the universal default. Anything more elaborate (LOOCV, time-series-aware CV, blocked CV) requires a specific reason to deviate.

What do we do in practice?

  • We tend to use K-fold CV with \(K = 5\) or \(K = 10\).
  • Empirically these yield test-error estimates that suffer neither from excessively high bias nor very high variance — best balance.

Notes

One important caveat for time-series data (which your project has): the standard K-fold CV randomly assigns observations to folds, ignoring the temporal ordering. For predicting future returns from past predictors, this is wrong — it allows the model to use future information when forecasting past values, leaking signal across folds and inflating CV-MSE optimism.

The right CV for time-series data is walk-forward (a.k.a. time-series CV) — same as the walk-forward backtest from Lecture 1. Each fold uses only past observations to predict the next; advance one step; repeat. The tsibble and slider packages provide the helpers; rsample::rolling_origin automates it.

For the project, when CV-tuning \(\\lambda\) for Lasso on time-series indicator data, use walk-forward CV not random K-fold. Otherwise your CV-selected \(\\lambda\) is optimistic and your reported OOS performance will overstate the real edge.

4.5 Selecting λ for Lasso

  • 4.1 Course objectives
  • 4.2 Recap from Lecture 3
  • 4.3 The Lasso
  • 4.4 Resampling methods
  • 4.5 Selecting λ for Lasso
  • 4.6 Refinements
  • 4.M Conclusion of Lecture 4
  • Lasso — selecting the tuning parameter λ
  • Live demo — CV-tuned Lasso

Lasso — selecting the tuning parameter λ

  • Pick a grid of candidate \(\lambda\) values.
  • Use cross-validation to estimate test error for each.
  • Choose the \(\lambda\) giving the smallest test error.
  • In this example, min MSE ≈ 9.3 (log 2.2); only 10 of 19 coefficients remain — Lasso has shrunk 9 to zero.

Notes

This is the operational answer to “what \(\\lambda\) should I use?” — the one that minimises CV-MSE. The chart shows:

  • CV-MSE on the y-axis (lower is better) as a function of log(\(\\lambda\)).
  • A clear minimum at log(\(\\lambda\)) ≈ 2.2, corresponding to \(\\lambda\) ≈ 9.3.
  • The non-zero-coefficient count on the top axis drops from 19 (left, small \(\\lambda\), OLS-like) to 0 (right, large \(\\lambda\), intercept-only).

At the optimal \(\\lambda\), Lasso selects 10 out of 19 candidate variables — half are zeroed out, and we get the sparser, more interpretable model the variable-selection penalty is designed to deliver.

A useful refinement: cv.glmnet returns two recommended \(\\lambda\) values, lambda.min and lambda.1se. The first is the strict CV minimum; the second is the largest \(\\lambda\) within one standard error of the minimum, giving a sparser model that’s barely worse on CV. For a tie-break in favour of interpretability, lambda.1se is often the better choice.

Live demo — CV-tuned Lasso

library(ISLR);  library(glmnet)
Hitters <- na.omit(Hitters)
x <- model.matrix(Salary ~ ., Hitters)[, -1]
y <- Hitters$Salary

grid <- 10^seq(5, -5, length = 100)
set.seed(1)
train  <- sample(1:nrow(x), nrow(x) / 2)
test   <- -train
y.test <- y[test]

# Lasso path on the training half
lasso.mod <- glmnet(x[train, ], y[train], alpha = 1, lambda = grid)
plot(lasso.mod, xvar = "lambda")

# 10-fold CV — pick the best λ
cv.out  <- cv.glmnet(x[train, ], y[train], alpha = 1)
plot(cv.out)
bestlam <- cv.out$lambda.min

# Test MSE
lasso.pred <- predict(lasso.mod, s = bestlam, newx = x[test, ])
mean((lasso.pred - y.test)^2)

# Refit on full sample, list non-zero coefficients
out        <- glmnet(x, y, alpha = 1, lambda = grid)
lasso.coef <- predict(out, type = "coefficients", s = bestlam)[1:20, ]
lasso.coef[lasso.coef != 0]
  • cv.glmnet builds the CV folds and returns lambda.min.
  • The final glmnet is fit on the full sample at the chosen \(\lambda\).
  • lasso.coef[lasso.coef != 0] lists the surviving variables — your sparse model.

Notes

This is the canonical Lasso workflow in R, end-to-end. The pattern generalises to your project:

  1. Build the design matrix with model.matrix(). The [, -1] drops the intercept column (glmnet adds its own).
  2. Train/test split for honest evaluation of the final model.
  3. Compute the Lasso path with glmnet(..., alpha = 1, lambda = grid). The grid is a sequence of \(\\lambda\) values to evaluate.
  4. Pick optimal \(\\lambda\) via CV with cv.glmnet. lambda.min is the CV-MSE minimiser; lambda.1se is the sparser-but-similar alternative.
  5. Predict on test to get an honest test-MSE estimate.
  6. Refit on the full sample at the chosen \(\\lambda\) for the final reported model.
  7. Inspect the surviving coefficients to interpret the model.

The sequence of model.matrix → glmnet → cv.glmnet → predict → refit-on-full is the recipe to memorise. Substitute alpha = 0 for Ridge or alpha = 0.5 for Elastic Net — the rest of the workflow is identical.

4.6 Refinements

  • 4.1 Course objectives
  • 4.2 Recap from Lecture 3
  • 4.3 The Lasso
  • 4.4 Resampling methods
  • 4.5 Selecting λ for Lasso
  • 4.6 Refinements
  • 4.M Conclusion of Lecture 4
  • OLS post-Lasso
  • Live demo — OLS post-Lasso
  • Elastic Net — combining Ridge and Lasso
  • Live demo — tuning Elastic Net

OLS post-Lasso

  • Lasso’s penalty mitigates overfitting and yields a sparse solution …
  • … but it also tends to shrink coefficients of selected variables too much.

Recipe — OLS post-Lasso:

  1. Use Lasso to reduce the dimension of the model.
  2. Re-estimate the coefficients of the selected predictors with plain OLS — bias-corrected.
  3. Standard errors need adjusting (not naïve OLS standard errors).

Notes

Lasso’s penalty deliberately biases coefficients toward zero, which is great for prediction but hurts coefficient interpretation: the magnitudes you read off the Lasso fit are systematically too small. OLS post-Lasso fixes this by using Lasso just for variable selection, then refitting plain OLS on the selected variables — the OLS coefficients on the selected subset are unbiased estimates (assuming the selection itself was right).

The post-selection inference issue is technical and underappreciated: when you select variables with Lasso and then run OLS on the selected subset, the standard OLS p-values don’t account for the fact that you peeked at the data to pick which variables to include. The naïve p-values from summary(fitPostLasso) are too small. Belloni and Chernozhukov (Belloni and Chernozhukov 2013) propose corrections; for the project, the simpler workaround is to acknowledge this and present the OLS coefficients as descriptive (not for hypothesis testing).

For prediction (which is what your project cares about), this is a non-issue — OLS post-Lasso predictions are nearly as good as Lasso predictions and have the cleaner, unbiased coefficient interpretation as a bonus.

Live demo — OLS post-Lasso

library(ISLR);  library(glmnet)
Hitters <- na.omit(Hitters)
x <- model.matrix(Salary ~ ., Hitters)[, -1]
y <- Hitters$Salary

set.seed(1)
train <- sample(1:nrow(x), nrow(x) / 2)

# Step 1: CV-tuned Lasso to find non-zero coefficients
cv.out  <- cv.glmnet(x[train, ], y[train], alpha = 1)
bestlam <- cv.out$lambda.min
out     <- glmnet(x, y, alpha = 1, lambda = 10^seq(5, -5, length = 100))
lasso.coef <- predict(out, type = "coefficients", s = bestlam)[2:20, ]
indexLasso <- which(lasso.coef != 0)

# Step 2: re-fit OLS on the surviving columns
fitPostLasso <- lm(y ~ x[, indexLasso])
summary(fitPostLasso)
  • Step 1 picks the non-zero set via Lasso + CV.
  • Step 2 runs plain lm restricted to that set.
  • Caveat: the printed standard errors ignore the selection step; correcting inference (e.g. via post-selection inference (Belloni and Chernozhukov 2013)) is a separate research topic.

Notes

The two-step recipe in code form. Step 1 fits CV-tuned Lasso and pulls out the indices of non-zero coefficients (excluding the intercept). Step 2 re-fits OLS using only those columns. The resulting coefficients are unbiased estimates (under selection-as-given), with the standard-error caveat from the previous slide.

For your project, OLS post-Lasso is a good final-model choice when you want both prediction and interpretable coefficients. The reporting story: “Lasso selects \(k\) out of \(p\) candidate indicators (Table X); OLS re-fit on the selected set yields the model in Table Y, with predictions evaluated by walk-forward backtest in Figure Z.”

Belloni and Chernozhukov (Belloni and Chernozhukov 2013) develop the asymptotic theory for valid post-selection inference; their hdm R package implements the corrections. Worth knowing about for thesis-level work; beyond the scope of the project deliverable.

Elastic Net — combining Ridge and Lasso

  • Elastic Net combines Ridge (\(\alpha = 0\)) and Lasso (\(\alpha = 1\)).
  • In glmnet, the parameter \(\alpha\) defines the mix between the two penalties.
  • Useful framing:
    • \(\alpha\) controls the mixing between the L2 and L1 penalties.
    • \(\lambda\) controls the amount of penalisation.
  • For the Hitters data set, \(\alpha = 0\) (pure Ridge) yielded the lowest test MSE.

Notes

Elastic Net’s penalty is \(\\lambda \\bigl[ (1 - \\alpha) \\|\\beta\\|_2^2 / 2 + \\alpha \\|\\beta\\|_1 \\bigr]\) — a convex combination of L2 (Ridge) and L1 (Lasso). \(\\alpha\) controls the mix; \(\\lambda\) controls the overall strength.

When does each end of the spectrum win?

  • Pure Ridge (\(\\alpha = 0\)) — when many predictors are correlated and all carry some signal. Ridge keeps them all and lets the redundant correlation reduce variance.
  • Pure Lasso (\(\\alpha = 1\)) — when only a subset of predictors are truly informative and the rest are noise. Lasso zeroes out the noise.
  • Mid-range Elastic Net (\(\\alpha \\approx 0.5\)) — when there are correlated groups of informative predictors. Pure Lasso tends to arbitrarily pick one from each correlated group; Elastic Net’s L2 component encourages the whole group to enter together.

For the Hitters dataset, the Lasso-vs-Ridge comparison favours Ridge — most predictors carry some signal and Lasso’s aggressive zeroing throws away useful information. Whether the same holds for your prediction-market data is an empirical question; sweep \(\\alpha\) from 0 to 1 and let CV pick.

Live demo — tuning Elastic Net

library(ISLR);  library(glmnet)
Hitters <- na.omit(Hitters)
x <- model.matrix(Salary ~ ., Hitters)[, -1]
y <- Hitters$Salary

set.seed(1)
train  <- sample(1:nrow(x), nrow(x) / 2)
test   <- -train
y.test <- y[test]

# Sweep α from 0 to 1 in 0.1 steps; tune λ via 10-fold CV at each α
mse <- matrix(0, 11, 2)
for (i in 1:11) {
  alpha_i  <- (i / 10) - 0.1
  cv.out   <- cv.glmnet(x[train, ], y[train], alpha = alpha_i)
  bestlam  <- cv.out$lambda.min
  pred     <- predict(cv.out, s = bestlam, newx = x[test, ])
  mse[i, ] <- c(alpha_i, mean((pred - y.test)^2))
}
mse
  • Outer sweep: \(\alpha \in \{0, 0.1, \ldots, 1\}\).
  • Inner: 10-fold CV picks the best \(\lambda\) for each \(\alpha\).
  • Read the resulting MSE matrix to find the best (\(\alpha, \lambda\)) pair.
  • For Hitters, the minimum lands at \(\alpha = 0\) — pure Ridge.

Notes

The two-loop sweep is straightforward: outer over candidate \(\\alpha\) values, inner is CV-tuning of \(\\lambda\) at each \(\\alpha\). The resulting matrix has one row per \(\\alpha\) with the corresponding test-MSE; pick the row with the lowest MSE.

For larger problems, a finer \(\\alpha\) grid (e.g. 0, 0.05, 0.10, …) and a longer \(\\lambda\) grid is typical. The caret and tidymodels R frameworks provide more elaborate machinery for hyperparameter sweeps with full reporting; for a 2-deep grid like this, raw glmnet calls are fine.

For your project, start with pure Lasso (\(\\alpha = 1\)) — it’s the most interpretable and a sensible baseline. If your CV-MSE is meaningfully better with Ridge or Elastic Net, switch — but don’t over-engineer the model; a clean Lasso baseline is preferable to a marginally-better Elastic Net at the cost of an extra tuning parameter.

4.M Conclusion of Lecture 4

  • 4.1 Course objectives
  • 4.2 Recap from Lecture 3
  • 4.3 The Lasso
  • 4.4 Resampling methods
  • 4.5 Selecting λ for Lasso
  • 4.6 Refinements
  • 4.M Conclusion of Lecture 4
  • Course at a glance (1/2)
  • Course at a glance (2/2)
  • Further reading
  • Prepare before next lecture
  • See you next time
  • References

Course at a glance (1/2)

Foundations

Week 1

15.04.2026

Course outline · Backtesting fundamentals

  • Course aim & organisation
  • Backtesting overview & case study
  • In-sample tests (Welch & Goyal 2008)
  • Out-of-sample (walk-forward, R²_OS)
  • Useful predictors & p-hacking

Introduction to R

Week 2

22.04.2026

RStudio · variables · vectors · data frames · live coding

  • Why R for empirical asset-management research
  • RStudio and the script editor
  • Variables, vectors, matrices, data frames, lists
  • Functions and loops
  • Data import and export

Assessing model accuracy & Ridge regression

Week 3

29.04.2026

Statistical learning · MSE · bias-variance · linear model selection · Ridge

  • Statistical learning: Y = f(X) + ε
  • Quality of fit and the train/test MSE distinction
  • Bias-variance trade-off and overfitting
  • OLS limits: prediction accuracy & interpretability
  • Ridge regression and the L2 penalty

Lasso, cross-validation & Elastic Net

Week 4

06.05.2026

Sparse regularisation · resampling for honest test error · choosing λ

  • Lasso: L1 penalty and exact-zero coefficients
  • Cross-validation: validation set, LOOCV, K-fold
  • Choosing the optimal λ for Lasso
  • OLS post-Lasso for cleaner coefficient inference
  • Elastic Net — combining Ridge and Lasso

Prediction markets, the Polymarket Quant Bench & your project

Week 5

13.05.2026

From Welch-Goyal to event-resolved binary contracts

  • Prediction markets — definition and Polymarket as the canonical venue
  • How prices form: liquidity, resolution, mechanics
  • The Polymarket Quant Bench dataset (HuggingFace): access and schema
  • First look at the data in R
  • Your project: indicator design, back-test, deliverables, R toolbox

Course at a glance (2/2)

Final presentations

Week 13

01.07.2026

Group presentations · Q&A · wrap-up

  • Presentation order and time budget
  • Q&A rules
  • Closing thoughts and feedback

Further reading

  • James et al. (2021) — Chapter 5 (resampling), Chapter 6 (Lasso, Ridge, Elastic Net).
  • Tibshirani (1996) — the original Lasso paper.
  • Belloni and Chernozhukov (2013) — formal post-selection inference for OLS post-Lasso.

Notes

  • JWHT chapters 5 & 6 (James et al. 2021) are the textbook reading for today. Chapter 5 covers resampling (CV, bootstrap); chapter 6 covers Lasso, Ridge, Elastic Net. Both chapters have R labs that mirror today’s live demos.
  • Tibshirani’s original Lasso paper (Tibshirani 1996) is the historical artefact that introduced the L1 penalty. Worth reading once for the geometric intuition; everything else has been said better in JWHT since.
  • Belloni and Chernozhukov (Belloni and Chernozhukov 2013) for valid post-selection inference — the “I want OLS coefficients on Lasso-selected variables and proper p-values” problem. Useful for thesis-level work; beyond the project.

Prepare before next lecture

  1. Run Lasso.R, OLS_Post_Lasso.R, EN.R, and 5fold_CV.R locally.
  2. Compare CV-selected \(\lambda\) across multiple seeds — how stable is it?
  3. Read ISLR §6.2 (Ridge & Lasso) and §5.1 (CV).

Notes

The seed-stability check (point 2) is a useful exercise in CV’s own variability. CV with a fresh random seed produces slightly different fold partitions and hence slightly different \(\\lambda^*\). If \(\\lambda^*\) varies wildly across seeds, the optimum is shallow and the choice of \(\\lambda\) matters less than you might think; if \(\\lambda^*\) is highly stable, you can be confident in the picked value.

For Lecture 5 (the project briefing), come with the Lecture 1 backtesting concepts and today’s regularisation toolkit fresh. We’ll wire them together with the Polymarket dataset and turn you loose on the indicator-design phase.

See you next time

Reminder
  • Lecture 5 (13 May 2026): prediction-markets primer + the Polymarket dataset + your project briefing. Bring questions!

References

Belloni, Alexandre, and Victor Chernozhukov. 2013. “Least Squares After Model Selection in High-Dimensional Sparse Models.” Bernoulli 19 (2): 521–47. https://doi.org/10.3150/11-BEJ410.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning with Applications in R. 2nd ed. New York, NY: Springer. https://www.statlearning.com/.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58 (1): 267–88. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.