Descriptive · inferential · modelling — applied in R
Scope
We will:
We will NOT:
Approach
Part I — Learn the Basics
Part II — Apply your learnings
Basics
Week 1
29.10.2025
Course objectives, schedule, assignments · Introduction to R · Live coding
Data Handling & Visualization
Week 2
05.11.2025
API access, merging, cleansing, transforming and visualising financial data in R · Introduction to Overleaf
Statistical Analysis
Week 3
12.11.2025
Descriptive · inferential · modelling — applied in R
Academic Publishing & Refereeing
Week 4
19.11.2025
What makes a great empirical paper · publication process · how to write a referee report
Brown Bag Seminar
Week 13
20.01.2026
Engage with doctoral research and prepare your referee report
Assignment I — Problem Set 50% of your grade
Documented .R script + PDF write-up (Overleaf)
Group of up to 5.
Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Research in Finance_assignment-1-problem-set_surname1_surname2_…
19 January 2026
Assignment II — Referee Report 50% of your grade
2.5–3 page referee report on a Brown-Bag presentation
Group of up to 5.
Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Research in Finance_assignment-2-referee-report_surname1_surname2_…
3 February 2026
| Topic | What is it? | What for? | Example syntax |
|---|---|---|---|
| API access | Load remote data via authenticated queries | Fetching CFTC time series | Quandl.api_key("…"); Quandl.datatable("QDL/LFON", …) |
| Import & cleanse | Load files (CSV) and initial transforms | Date formatting | read_csv() %>% mutate(date = as.Date(date)) |
| Merge & append | Join by keys (merge) or stack (bind_rows) |
Multi-asset views | merge(d1, d2, by="date"), bind_rows(...) |
| Filter & mutate | Subset rows, derive columns | Post-2021 windows, net longs | filter(date > "2021-01-01") %>% mutate(net = longs - shorts) |
| Group & summarise | Group categories, aggregate | Annual means / SD per asset | group_by(Asset, Year) %>% summarise(...) |
| Pivot | Reshape wide ⇄ long | Correlation matrices | pivot_wider(), pivot_longer() |
| Visualization | Layered plotting system | Publication-ready figures | ggplot() + geom_line() + theme_minimal() |
Practical advice
Always start with a thorough descriptive analysis to deeply understand your data before turning to advanced methods.
What is it?
What is it used for?
library(dplyr)
library(tidyr)
# Compact summary statistics per asset and per variable
desc_stats <- combined_clean %>%
select(Asset, Net_Longs, market_participation) %>%
pivot_longer(cols = -Asset, names_to = "Variable", values_to = "Value") %>%
mutate(Variable = case_when(
Variable == "Net_Longs" ~ "Net Longs",
Variable == "market_participation" ~ "Market Part."
)) %>%
group_by(Asset, Variable) %>%
summarise(
Mean = round(mean(Value, na.rm = TRUE), 0),
Std = round(sd(Value, na.rm = TRUE), 0),
Min = round(min(Value, na.rm = TRUE), 0),
P10 = round(quantile(Value, 0.1, na.rm = TRUE), 0),
P50 = round(median(Value, na.rm = TRUE), 0),
P90 = round(quantile(Value, 0.9, na.rm = TRUE), 0),
Max = round(max(Value, na.rm = TRUE), 0),
.groups = "drop"
)
write_csv(desc_stats, "desc_stats_clean.csv")(Asset, Variable) → summarise per cell.case_when() cleans variable labels for the published version..groups = "drop" quiets the grouping warning and produces an ungrouped result.library(ggplot2)
long_data <- combined_clean %>%
select(Asset, Net_Longs, market_participation,
non_commercial_shorts, non_commercial_longs) %>%
pivot_longer(-Asset, names_to = "Variable", values_to = "Value")
ggplot(long_data, aes(x = Value, color = Asset, fill = Asset)) +
geom_density(alpha = 0.5) + # transparent overlays
facet_wrap(~ Variable, scales = "free") +
theme_minimal() +
labs(title = "Faceted Density Plots: Distributions by Variable",
x = "Value", y = "Density")
ggsave("distribution_plot.pdf", width = 6, height = 4)geom_density() is a smoothed histogram — better for comparing shapes than raw bars.alpha = 0.5 lets overlapping curves stay readable.facet_wrap(~ Variable, scales = "free") lets each variable use its own y-range.What is it?
What is it used for?
library(dplyr); library(ggplot2)
cor_matrix <- combined_clean %>%
select(date, Asset, Net_Longs) %>%
pivot_wider(names_from = Asset, values_from = Net_Longs, values_fill = NA) %>%
select(-date) %>%
cor(use = "pairwise.complete.obs", method = "pearson") %>%
as.data.frame() %>%
rownames_to_column(var = "Asset1") %>%
pivot_longer(-Asset1, names_to = "Asset2", values_to = "Correlation") %>%
mutate(Correlation = round(Correlation, 2)) %>%
filter(!is.na(Correlation))
ggplot(cor_matrix, aes(x = Asset2, y = Asset1, fill = Correlation)) +
geom_tile(color = "white", linewidth = 0.5) +
geom_text(aes(label = Correlation), color = "black",
size = 3, fontface = "bold") +
scale_fill_gradient2(low = "blue", mid = "white", high = "red",
midpoint = 0, limits = c(-1, 1)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Correlation Matrix of Net Longs (Assets)",
subtitle = "Negative values show decoupling (e.g., Gold vs Bitcoin in crises)",
x = "Asset", y = "Asset")|r| > 0.5 ≈ strong; negative = inverse (hedging potential).pairwise.complete.obs ignores NAs pairwise — usable when the panel isn’t perfectly balanced.gold_net <- combined_clean$Net_Longs[combined_clean$Asset == "Gold"]
btc_net <- combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"]
# Full-dataset test
full_test <- cor.test(gold_net, btc_net)
print(full_test)
# e.g., r = -0.14, p = 0.03 — weak negative correlation, statistically significant
# Crisis subset (2022)
crisis_gold <- combined_clean$Net_Longs[
combined_clean$Asset == "Gold" & combined_clean$Year == 2022]
crisis_btc <- combined_clean$Net_Longs[
combined_clean$Asset == "Bitcoin" & combined_clean$Year == 2022]
crisis_test <- cor.test(crisis_gold, crisis_btc)
print(crisis_test)
# e.g., r = -0.06, p = 0.68 — not statistically significant|r| > 0.5 strong; p < 0.05 rejects H₀ of no association.method = "spearman").r, p, and the sample size alongside the test in your write-up.library(broom)
trad_net <- combined_clean$Net_Longs[combined_clean$Asset %in% c("Gold", "Silver")]
crypto_net <- combined_clean$Net_Longs[combined_clean$Asset %in% c("Bitcoin", "Ethereum")]
# Full dataset
full_test <- t.test(trad_net, crypto_net)
print(full_test %>% tidy()) # tidy = clean table
# p < 0.05 → crypto means differ from traditional means
# Crisis subset (2022)
crisis_data <- combined_clean[combined_clean$Year == 2022, ]
non_crisis_data <- combined_clean[combined_clean$Year != 2022, ]
crisis_test <- t.test(crisis_data$Net_Longs, non_crisis_data$Net_Longs)
print(crisis_test %>% tidy())< 0.05) rejects H₀ of equal means.broom::tidy() turns the test object into a single-row tibble — easy to bind into a results table.library(broom)
shapiro_gold <- shapiro.test(combined_clean$Net_Longs[combined_clean$Asset == "Gold"])
shapiro_btc <- shapiro.test(combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"])
print(shapiro_gold %>% tidy()) # e.g., p < 0.05 → non-normal
print(shapiro_btc %>% tidy()) # e.g., p > 0.05 → normal
# Q-Q plot for visual check
qqnorm(combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"])
qqline(combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"])
# Density comparison
gold_btc_data <- combined_clean[combined_clean$Asset %in% c("Gold", "Bitcoin"), ]
ggplot(gold_btc_data, aes(x = Net_Longs, fill = Asset)) +
geom_density(alpha = 0.7) +
facet_wrap(~ Asset, scales = "free") +
theme_minimal() +
labs(title = "Net Longs Distributions: Gold vs Bitcoin (Faceted)",
x = "Net Longs", y = "Density")log) or non-parametric tests.gold_net <- combined_clean$Net_Longs[combined_clean$Asset == "Gold"]
btc_net <- combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"]
ks_test <- ks.test(gold_net, btc_net)
print(ks_test %>% tidy()) # e.g., D = 0.25, p < 0.05 → different shapes
# ECDF visualisation
ggplot(gold_btc_data, aes(x = Net_Longs, color = Asset)) +
stat_ecdf() +
facet_wrap(~ Asset, scales = "free") +
theme_minimal() +
labs(title = "ECDF Plot: Gold vs Bitcoin (Faceted)",
x = "Net Longs", y = "Cumulative Probability")What is it?
What is it used for?
library(broom); library(dplyr)
# Step 0: hedging net bias
combined_clean <- combined_clean %>%
mutate(net_commercial = commercial_longs - commercial_shorts)
# Step 1: simple LM
simple_lm <- lm(Net_Longs ~ net_commercial, data = combined_clean)
print(simple_lm %>% tidy())
# Step 2: + asset fixed effects
fe_asset <- lm(Net_Longs ~ net_commercial + factor(Asset), data = combined_clean)
print(fe_asset %>% tidy())
# Step 3: + asset & year fixed effects
fe_asset_year <- lm(Net_Longs ~ net_commercial + factor(Asset) + factor(Year),
data = combined_clean)
print(fe_asset_year %>% tidy())Year * Asset.library(sandwich); library(lmtest)
# HC1 clustered by Year — robust to heteroskedasticity & intra-cluster correlation
clustered_se <- coeftest(simple_lm, vcov = vcovHC(simple_lm, type = "HC1", cluster = "Year"))
print(clustered_se)
# Residuals plot — check linearity
plot(residuals(simple_lm) ~ fitted(simple_lm))stargazerlibrary(stargazer)
# Compute clustered SEs for each model
clustered_se_simple <- coeftest(simple_lm,
vcov = vcovHC(simple_lm, type = "HC1", cluster = "Year"))
clustered_se_fe_asset <- coeftest(fe_asset,
vcov = vcovHC(fe_asset, type = "HC1", cluster = "Year"))
clustered_se_fe_asset_year <- coeftest(fe_asset_year,
vcov = vcovHC(fe_asset_year, type = "HC1", cluster = "Year"))
# Build LaTeX table
stargazer(
simple_lm, fe_asset, fe_asset_year,
type = "latex",
out = "regression_table.tex",
dep.var.caption = "Effects on Net Longs (Post-2021 Data)",
label = "tab:regression_net",
dep.var.labels = "Non-Commercial Longs",
covariate.labels = c("Commercial Longs"),
omit.stat = c("ser", "f"),
star.char = c("*", "**", "***"),
star.cutoffs = c(0.1, 0.05, 0.01),
add.lines = list(
c("Asset FE", "No", "Yes", "Yes"),
c("Year FE", "No", "No", "Yes")
),
omit = c("factor"),
header = FALSE, model.numbers = TRUE,
table.placement = "H", digits = 3,
se = list(clustered_se_simple[, 2],
clustered_se_fe_asset[, 2],
clustered_se_fe_asset_year[, 2])
)stargazer produces clean LaTeX/HTML/ASCII regression tables — multiple models side by side.add.lines documents which fixed effects each model includes — important for replicability.star.cutoffs sets the significance thresholds; report them in the table notes..R script, load libraries, set API key, load Quandl futures data (gold, silver, BTC, ETH). Merge by date into merged_*, drop NAs, sort ascending by date.combined, with an Asset variable.combined_clean. Derive Net Long Position per trader type, Weekly Change, and Year/Quarter/Month indicators.ggplot plots with 4–5 sentences each.stargazer (academic style) and min. 2 captioned plots.Use the Overleaf template on Moodle. Page minima per section: Intro 0.5, Lit 0.5, Method 0.5, Data 1.5, Results 1.5, Conclusion 0.5.
Solve the problem set posted on Moodle, building on Lectures 1–3.
Submit ONLY two files:
.R script — well-commented, self-explanatory, efficient (meaningful names, functions for repetition, avoid for-loops). Include text/comments answering Exercise 1 and 2..R script (plots, calculations), stargazer LaTeX tables, captioned plots, written text. 11 pt Times New Roman, 1.5 spaced.RiF_ProblemSet_surname1_surname2_...Basics
Week 1
29.10.2025
Course objectives, schedule, assignments · Introduction to R · Live coding
Data Handling & Visualization
Week 2
05.11.2025
API access, merging, cleansing, transforming and visualising financial data in R · Introduction to Overleaf
Statistical Analysis
Week 3
12.11.2025
Descriptive · inferential · modelling — applied in R
Academic Publishing & Refereeing
Week 4
19.11.2025
What makes a great empirical paper · publication process · how to write a referee report
Brown Bag Seminar
Week 13
20.01.2026
Engage with doctoral research and prepare your referee report
.Rmd file.Reminder
Institute of Strategic Management and Finance · Ulm University