From Welch-Goyal to event-resolved binary contracts
Asset2026_surname1_surname2_surname3. Email it to oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de and your team-mates.Scope
We will:
We will NOT:
Approach
Part I — Foundations
Part II — Application
Foundations
Week 1
15.04.2026
Course outline · Backtesting fundamentals
Introduction to R
Week 2
22.04.2026
RStudio · variables · vectors · data frames · live coding
Assessing model accuracy & Ridge regression
Week 3
29.04.2026
Statistical learning · MSE · bias-variance · linear model selection · Ridge
Lasso, cross-validation & Elastic Net
Week 4
06.05.2026
Sparse regularisation · resampling for honest test error · choosing λ
Prediction markets, the Polymarket Quant Bench & your project
Week 5
13.05.2026
From Welch-Goyal to event-resolved binary contracts
Final presentations
Week 13
01.07.2026
Group presentations · Q&A · wrap-up
Project (Code + Report) 50% of your grade
Rmd code + knitr-rendered PDF report. Build a library of indicators over the Polymarket Quant Bench dataset (curated OHLCV bars on HuggingFace, derived from Jon Becker’s polymarket-data dump), derive trade signals, back-test, and write a critical reflection.
Group of up to 3.
Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Finance Project — Asset Management_assignment-1-project-report_surname1_surname2_…
30 June 2026
Final Presentation 50% of your grade
20-minute group presentation in class on 1 July 2026; submit slides as PDF together with the project zip.
Group of up to 3.
Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Finance Project — Asset Management_assignment-2-final-presentation_surname1_surname2_…
1 July 2026
Today: meet the asset class that will host all of this — prediction markets.
Many of your L1–L4 tools transfer; the interpretation of risk and return changes. There is no Sharpe-ratio analogue without first defining a notion of “return” on a binary contract.
prediction-market-analysis raw on-chain dump (Becker 2025); we resample to clean hourly and daily bars and pre-filter to liquid resolved markets (≥ $100k cumulative volume and ≥ 200 on-chain trade fills).markets (36,831 rows, one per market), bars_hourly (~12.66M rows), bars_daily (~1.46M rows).smf2026polymarketquantbench (Strategic Management and Finance 2026).# 1. Install the HuggingFace CLI (one-time, per machine).
# Comes with the official `huggingface_hub` Python package.
pip install huggingface_hub
# 2. Set up your group's project folder. Recommended layout:
#
# asset-group-NN/
# ├── asset-group-NN.R ← RStudio project file
# ├── asset-group-NN.Rmd ← your analysis lives here
# ├── data/
# └── polymarket/ ← dataset lands here (~603 MB)
# 3. Download the dataset. The CLI shows a live progress bar.
hf download smf-ulm/polymarket-quant-bench \
--repo-type dataset \
--local-dir data/huggingface_hub, run the CLI command once.--local-dir data/ writes real file copies into your project tree.data/ to .gitignore so the 603 MB never enters your repo.pip ships with Python ≥ 3.4. Tick “Add Python to PATH” during install on Windows.pip: command not found → use the Python-launcher fallback:
py -m pip install huggingface_hubpython3 -m pip install huggingface_hubhf: command not found after a successful install → close and reopen your terminal (PATH refresh). If still missing, fall back to the module form python -m huggingface_hub download … (same flags as before).huggingface-cli is deprecated, use hf instead → just replace huggingface-cli with hf in your command. The flags are identical.HTTPS_PROXY=http://<proxy-host>:<port> (and HTTP_PROXY similarly) before running the CLI. Ask IT for the right address if you’re not sure.# install.packages(c("arrow", "dplyr", "ggplot2", "lubridate"))
library(arrow) # parquet + lazy datasets
library(dplyr) # wrangle
library(ggplot2) # plot
library(lubridate) # dates
# Path that matches the --local-dir from the CLI download.
local_path <- "data/polymarket"
# Open each config lazily — 1,418 parquet shards across three folders;
# arrow stitches them as one logical table.
markets <- arrow::open_dataset(file.path(local_path, "markets")) |> collect()
bars_daily <- arrow::open_dataset(file.path(local_path, "bars_daily")) |> collect()
bars_hourly <- arrow::open_dataset(file.path(local_path, "bars_hourly")) |> collect()arrow::open_dataset() |> collect() loads each config into RAM as a regular tibble. Peak memory across all three is ~1.5 GB — comfortable on a 16 GB laptop.|> (pipe) is base-R syntax sugar — same as collect(open_dataset(...)). No magic, just chains the calls left-to-right.arrow::open_dataset() stitches the parquet shards transparently — bars_hourly is hundreds of separate files but you see one tibble.markets (one row per market, 36,831 rows):
id, condition_id (parent / child link), question, slug, categoryoutcomes, outcome_prices (JSON arrays — winner closes near 1.0)clob_token_ids (JSON [yes_token_id, no_token_id] — pairs market to bars)volume, liquidity, created_at, end_datebars_daily / bars_hourly (one row per token × period):
token_id (YES or NO — not the market!)period_start, period_end (UTC)open, high, low, close, vwap — all in [0, 1] (implied probability)volume_usd, n_trades, n_buys, n_sellsThree gotchas: (1) bars are per token, so YES and NO are separate series — pair via
clob_token_idsif you want a mid. (2) Bars are sparse (no row for periods with zero trades) —tidyr::fill()to forward-fill. (3)liquidityinmarketsis a snapshot when the data was collected, not a time series — usevolume_usdin bars for time-varying liquidity.
# Row counts per config — sanity check after download.
nrow(markets) # 36,831
nrow(bars_daily) # 1,462,282
nrow(bars_hourly) # 12,655,266
# Distribution of markets by category.
markets |>
count(category, sort = TRUE) |>
ggplot(aes(reorder(category, n), n)) +
geom_col() + coord_flip() +
labs(x = NULL, y = "Markets", title = "Resolved markets by category")
# Histogram of cumulative volume per market (log scale).
markets |>
filter(volume > 0) |>
ggplot(aes(volume)) +
geom_histogram(bins = 60) + scale_x_log10() +
labs(x = "Cumulative volume (USDC, log)", y = "Markets",
title = "Volume distribution across resolved markets")category is best-effort upstream labelling (Politics, Sports, Crypto, …). Treat as a hint, not a contract — verify by sampling.library(jsonlite)
library(slider)
library(patchwork)
# Pick the most heavily traded market, parse out its YES token id.
top_mkt <- markets |> slice_max(volume, n = 1)
yes_id <- fromJSON(top_mkt$clob_token_ids)[1]
# Pull its full daily bar history (one row per calendar day).
mkt_bars <- bars_daily |>
filter(token_id == yes_id) |>
arrange(period_start) |>
mutate(sma_20 = slide_dbl(close, mean, .before = 19, .complete = TRUE))
p_price <- ggplot(mkt_bars, aes(period_start)) +
geom_line(aes(y = close), colour = "steelblue", linewidth = 0.4) +
geom_line(aes(y = sma_20), colour = "darkorange", linewidth = 0.6) +
labs(title = top_mkt$question, y = "Implied probability", x = NULL)
p_vol <- ggplot(mkt_bars, aes(period_start, volume_usd)) +
geom_col(width = 1, fill = "grey40") +
labs(y = "Volume (USDC)", x = NULL)
p_price / p_vol # patchwork: stacks the two panels verticallyslice_max(volume, n = 1) returns the single most-traded market — a US-election or crypto-price market in most snapshots.clob_token_ids is a JSON-stringified [yes, no] array — fromJSON() parses it to a length-2 character vector; take [1] for YES.slider::slide_dbl() is the modern, vectorised rolling-window primitive — .before = 19 + the current row = 20-day window.patchwork’s / operator stacks two ggplot panels vertically, sharing the x-axis.tidyr::fill(close, .direction = "down") after arrange(period_start) before any rolling computation.bars_daily is keyed on token_id. Always filter to the YES (or NO) side first; otherwise you’re mixing two anti-correlated series.Project goal
In groups of three, design a small library of indicators on the Polymarket subset, derive trading signals from them, back-test a strategy on the price history, and write a critical reflection on what works and what doesn’t.
Optional but encouraged: bring external data (Google Trends, news, related markets, sports / political odds, weather, …) to enrich your indicators.
glimpse, summary, basic plots; pick the market category you’ll focus on.arrow — read parquet (incl. sharded open_dataset())tidyverse (dplyr, readr, tidyr)lubridate — datesdata.table — large datajsonlite — JSON dumpsgh — GitHub APIxts, zoo — time-series objectstsibble, slider — rolling windows in tidy formTTR — classic technical indicators (SMA, EMA, RSI, Bollinger…)quantmod, tidyquant — quant wrappersforecast — ARIMA, ETS if neededglmnet — Ridge / Lasso / ENcaret or tidymodels — CV pipelinesPerformanceAnalytics — Sharpe-like metricsggplot2 — figuresrmarkdown, knitr, kableExtra — your deliverablefor loops for vectorisable computations — use slider::slide_dbl, dplyr::mutate(across(...)), or data.table syntax. Marked down at grading.set.seed() everywhere, lock package versions if you can (renv is overkill but worth knowing).Summary
Submit your assignment by 30 June 2026, 18:00 in a single zip-folder named Asset2026_surname1_surname2_surname3 containing:
Email the zip to oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de and your team-mates. Subject line follows the same pattern as the zip name.
Foundations
Week 1
15.04.2026
Course outline · Backtesting fundamentals
Introduction to R
Week 2
22.04.2026
RStudio · variables · vectors · data frames · live coding
Assessing model accuracy & Ridge regression
Week 3
29.04.2026
Statistical learning · MSE · bias-variance · linear model selection · Ridge
Lasso, cross-validation & Elastic Net
Week 4
06.05.2026
Sparse regularisation · resampling for honest test error · choosing λ
Prediction markets, the Polymarket Quant Bench & your project
Week 5
13.05.2026
From Welch-Goyal to event-resolved binary contracts
Final presentations
Week 13
01.07.2026
Group presentations · Q&A · wrap-up
Final presentations
Institute of Strategic Management and Finance · Ulm University