Lecture 3: Statistical Analysis
Descriptive · inferential · modelling — applied in R
3.1 Course objectives
- 3.1 Course objectives
- 3.2 Recap from Lecture 2
- 3.3 Live Coding Session 3
- 3.4 Discussion of Assignment I
- 3.5 Conclusion of Lecture 3
Welcome to Research in Finance
- Register for “exam” 13337 in campusonline by 30 November 2025. The registration is what binds you to the course requirements; without it you cannot submit. If you are registered but don’t submit, you receive a fail grade (5.0).
- Ask questions during or right after each session — that is the preferred channel.
- Admin / studies / exam-eligibility questions go to the registrar’s office (Studiensekretariat) at studiensekretariat@uni-ulm.de.
- Course-content questions outside class: email oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de.
- We also recommend the student advisory service.
Course Objective
Scope
We will:
- Prepare Master students for their empirical thesis
- Hands-on R intro for data management, visualization, cleaning, basic modelling
- Writing tips for theses, including LaTeX & Overleaf
- Referee reviews on research presentations for empirical critique skills
We will NOT:
- Deep dive into advanced stats or ML methods
- Specific finance topics (asset pricing, etc.)
- Full thesis writing / research design training
Approach
Part I — Learn the Basics
- Hands-on R intro: a widely used language for statistical computing
- Manage, visualize and clean data; run and interpret statistical models
- Solve a real empirical problem set in R, in groups
Part II — Apply your learnings
- Mandatory participation in the institute’s Brown Bag Seminar
- Two assignments (group work and individual referee report) — see Assignments / Exams
Course at a glance
Basics
Course objectives, schedule, assignments · Introduction to R · Live coding
- Course objectives, schedule and assignments
- Introduction to R and RStudio
- Live coding: variables, vectors, matrices, data frames, lists, functions, loops
- Data import and export
Data Handling & Visualization
API access, merging, cleansing, transforming and visualising financial data in R · Introduction to Overleaf
- API access (Nasdaq Data Link / Quandl, FRED, Yahoo, Coingecko, Polygon)
- Import and cleanse: read_csv, mutate, types
- Merge and append data (merge, bind_rows)
- Filter and mutate (dplyr): subset rows, derive variables
- Group by and summarise
- Pivot wide / long
- Data visualization with ggplot2 (six-step pipeline)
- Introduction to LaTeX and Overleaf
Statistical Analysis
Descriptive · inferential · modelling — applied in R
- Descriptive statistics in R
- Correlation matrix and Pearson correlation test
- t-Test and Wilcoxon test
- Shapiro-Wilk and Kolmogorov-Smirnov tests
- Linear regression with fixed effects
- Clustered standard errors
- Exporting regression tables with stargazer
- Discussion of Assignment I (Problem Set)
Academic Publishing & Refereeing
What makes a great empirical paper · publication process · how to write a referee report
- What makes a good empirical paper (contribution, identification, write-up)
- The publication process step by step
- Top finance and economics journals
- Bad outcome vs revise & resubmit
- Referee Reports — summary, major issues, minor issues
- Referee checklist (question, identification, data, econometrics, results)
- Discussion of Assignment II (Referee Report)
Brown Bag Seminar
Engage with doctoral research and prepare your referee report
- Doctoral research presentations
- Apply empirical / writing tips for the referee report
- Group discussion and Q&A
Assignments / Exams
Assignment I — Problem Set 50% of your grade
Documented .R script + PDF write-up (Overleaf)
Group of up to 5.
Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Research in Finance_assignment-1-problem-set_surname1_surname2_…
19 January 2026
Assignment II — Referee Report 50% of your grade
2.5–3 page referee report on a Brown-Bag presentation
Group of up to 5.
Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Research in Finance_assignment-2-referee-report_surname1_surname2_…
3 February 2026
3.2 Recap from Lecture 2
- 3.1 Course objectives
- 3.2 Recap from Lecture 2
- 3.3 Live Coding Session 3
- 3.4 Discussion of Assignment I
- 3.5 Conclusion of Lecture 3
What we covered
| Topic | What is it? | What for? | Example syntax |
|---|---|---|---|
| API access | Load remote data via authenticated queries | Fetching CFTC time series | Quandl.api_key("…"); Quandl.datatable("QDL/LFON", …) |
| Import & cleanse | Load files (CSV) and initial transforms | Date formatting | read_csv() %>% mutate(date = as.Date(date)) |
| Merge & append | Join by keys (merge) or stack (bind_rows) |
Multi-asset views | merge(d1, d2, by="date"), bind_rows(...) |
| Filter & mutate | Subset rows, derive columns | Post-2021 windows, net longs | filter(date > "2021-01-01") %>% mutate(net = longs - shorts) |
| Group & summarise | Group categories, aggregate | Annual means / SD per asset | group_by(Asset, Year) %>% summarise(...) |
| Pivot | Reshape wide ⇄ long | Correlation matrices | pivot_wider(), pivot_longer() |
| Visualization | Layered plotting system | Publication-ready figures | ggplot() + geom_line() + theme_minimal() |
Notes
Today’s lecture inverts the workflow direction of the previous two: where Lectures 1–2 took raw data and produced clean panels and pictures, Lecture 3 takes those panels and answers questions about them — descriptive (what does the data look like?), inferential (is what we see real or noise?), and modelling (what relationships and effects can we estimate?).
Every method introduced today builds on the combined_clean panel we constructed last week. If you didn’t follow Lecture 2’s pivot or join steps, the in-class code chunks will be hard to follow — revisit the corresponding ::: notes blocks in the Lecture 2 handout, or run last week’s script up to the combined_clean definition before the in-class session.
3.3 Live Coding Session 3
- 3.1 Course objectives
- 3.2 Recap from Lecture 2
- 3.3 Live Coding Session 3
- 3.4 Discussion of Assignment I
- 3.5 Conclusion of Lecture 3
Overview of basic statistical methods
- Sample spaces and events
- Random variables
- Expectations and moments
- Inequalities
- Convergence concepts
- Distributions
- Sampling
- Missing-value handling
- Central tendency, dispersion, shape
- Data summarization
- Point and interval estimation
- Hypothesis testing
- Non-parametric inference
- Bayesian inference
- Statistical decision theory
- Regression analysis
- Multivariate analysis
- ANOVA
- Time series & stochastic processes
- Survival analysis
- Causal inference
Always start with a thorough descriptive analysis to deeply understand your data before turning to advanced methods.
Notes
This is the bird’s-eye map. Most of what statistical software lets you compute falls into one of these four buckets, and a healthy empirical workflow uses all four in roughly the displayed order:
- Probability theory is the foundation — it tells you what assumptions you’re implicitly making when you pick any inferential procedure (e.g., a t-test assumes a sampling distribution that converges to the t under finite-sample normality or asymptotic CLT). You don’t need to derive these from scratch each time, but you should know which assumptions your tools encode.
- Descriptive statistics is the first pass on real data: means, dispersion, distributions, outliers, missingness. This is the “always do first” step the callout flags. Skipping descriptive analysis is the single most common source of overconfident results — your t-test assuming normality is meaningless if you didn’t notice that one outlier is pulling the mean.
- Inferential statistics is quantifying uncertainty: confidence intervals, hypothesis tests, p-values. The bridge from “we observed this in the sample” to “what can we say about the population”. Today’s t-test, Wilcoxon, Shapiro-Wilk, KS, and correlation tests live here.
- Statistical modelling is structured estimation — fitting a parametric form (linear regression, fixed effects, time series, causal models) to extract effects rather than just summary statistics. The regression and clustered-SE examples later today are the workhorses for academic finance.
The standard textbook references for empirical-economics modelling and identification are Mostly Harmless Econometrics (Angrist and Pischke 2009) and An Introduction to Statistical Learning (James et al. 2021); for the statistics themselves, Modern Statistics with R (Thulin 2024) walks through every test today’s lecture uses with worked R examples.
Descriptive Statistics: Overview
What is it?
- Summarises and organises raw data into meaningful patterns using averages, spreads, and visualisations.
- Presents data without drawing conclusions beyond the dataset (histograms, box plots).
- Includes data-type categorisation and basic central-tendency / variability statistics.
What is it used for?
- Initial overview — trends, outliers, patterns before deeper analysis.
- Communicating findings to non-experts.
- Data cleaning and quality checks.
- Sampling — methods, survey design, data types (nominal, ordinal, interval, ratio).
- Missing values — imputation, prediction, elimination.
- Central tendency — mean (arithmetic, geometric, harmonic), median, mode.
- Dispersion — variance, standard deviation, range, IQR, coefficient of variation.
- Shape & symmetry — skewness, kurtosis, normality (Q-Q plots).
- Summarization — frequency distributions, histograms, box / stem-and-leaf / scatter plots, ECDF, quantiles, outliers.
Notes
The order in which you should look at a new dataset for an empirical paper:
- Sample composition —
nrow(),length(unique(group_var)), missingness map (naniar::vis_miss). Are observations balanced across groups, or is the panel unbalanced? - Per-variable centre and spread — mean, median, sd, min/max. Use
summary(df)for a quick look; tabulate properly viadplyr::summarisefor inclusion in a paper. - Distributional shape — skewness, kurtosis, formal normality tests (Shapiro-Wilk later in this deck), Q-Q plots. Many parametric tests assume approximate normality; checking is non-negotiable.
- Outliers and influential points — boxplots (last lecture’s example) or studentised residuals after a regression. Document and decide explicitly: keep, winsorise, or drop, and report the choice.
- Cross-sectional structure — correlation matrix, scatterplot matrix. Heads up dependence between variables that affect what regression specifications make sense.
Every academic finance paper has a “Table 1” that reports per-variable means, SDs, percentiles for the regression sample. Building that table is the descriptive-statistics step productionised. The pivot+group_by+summarise pattern on the next slide is the canonical recipe.
Example: Summary statistics
library(dplyr)
library(tidyr)
# Compact summary statistics per asset and per variable
desc_stats <- combined_clean %>%
select(Asset, Net_Longs, market_participation) %>%
pivot_longer(cols = -Asset, names_to = "Variable", values_to = "Value") %>%
mutate(Variable = case_when(
Variable == "Net_Longs" ~ "Net Longs",
Variable == "market_participation" ~ "Market Part."
)) %>%
group_by(Asset, Variable) %>%
summarise(
Mean = round(mean(Value, na.rm = TRUE), 0),
Std = round(sd(Value, na.rm = TRUE), 0),
Min = round(min(Value, na.rm = TRUE), 0),
P10 = round(quantile(Value, 0.1, na.rm = TRUE), 0),
P50 = round(median(Value, na.rm = TRUE), 0),
P90 = round(quantile(Value, 0.9, na.rm = TRUE), 0),
Max = round(max(Value, na.rm = TRUE), 0),
.groups = "drop"
)
write_csv(desc_stats, "desc_stats_clean.csv")- Pivot to long format → group by
(Asset, Variable)→ summarise per cell. - Reports central tendency (Mean, Median = P50), dispersion (Std), and the tails (Min, P10, P90, Max) — the standard table-1 shape for empirical papers.
case_when()cleans variable labels for the published version..groups = "drop"quiets the grouping warning and produces an ungrouped result.
Notes
This is the canonical recipe for an empirical paper’s “Table 1”. A few stylistic choices worth flagging:
- Pivot to long, summarise, leave wide — the long-format intermediate makes it natural to add a new variable later (just include it in the
select()); the final wide table reads as a paper-ready summary. - Reporting both percentiles and (mean, SD) — central tendency and tails. Means alone hide skewness; reporting P10 and P90 (or Q1/Q3) lets the reader see the bulk of the distribution.
- Round at output time, not in storage —
round(..., 0)here makes the table presentation-friendly. Keep the underlying values un-rounded incombined_cleanfor any further computation. case_when()for label cleanup — the raw column names (market_participation,non_commercial_longs) are CSV-friendly but ugly in a paper. Renaming inside the pivoted long frame keeps the source data names intact while presenting nicer labels.- Persist the result —
write_csv("desc_stats_clean.csv")saves the table for later read by your LaTeX paper (e.g.read_csv+kableExtra::kablefor booktabs output, or paste straight into atabularenvironment).
Example: Distribution plots
library(ggplot2)
long_data <- combined_clean %>%
select(Asset, Net_Longs, market_participation,
non_commercial_shorts, non_commercial_longs) %>%
pivot_longer(-Asset, names_to = "Variable", values_to = "Value")
ggplot(long_data, aes(x = Value, color = Asset, fill = Asset)) +
geom_density(alpha = 0.5) + # transparent overlays
facet_wrap(~ Variable, scales = "free") +
theme_minimal() +
labs(title = "Faceted Density Plots: Distributions by Variable",
x = "Value", y = "Density")
ggsave("distribution_plot.pdf", width = 6, height = 4)geom_density()is a smoothed histogram — better for comparing shapes than raw bars.alpha = 0.5lets overlapping curves stay readable.facet_wrap(~ Variable, scales = "free")lets each variable use its own y-range.- Save as PDF for inclusion in your paper — vector plots scale without pixelation.
Notes
Density plots are usually preferable to histograms when you want to compare distributions across groups. Histograms encode bin choices as a parameter (and a misleading bin width can dramatically change the visual story); a kernel density estimate smooths over that choice with the bandwidth parameter, which geom_density() selects sensibly by default.
For per-group comparison, geom_density(aes(fill = group), alpha = 0.5) overlays the densities on the same panel — fast visual readout of “are these distributions the same shape, the same location, or different in both?” When the panels are very different in scale (as here — market_participation is in the thousands; Net_Longs ranges across hundreds of thousands), facet_wrap(~ Variable, scales = "free") is the right call: each panel uses its own x-axis range so neither variable is squashed.
The output of this slide is the visual counterpart to the previous slide’s tabular descriptive statistics. Reporting both — a Table 1 and a multi-panel density figure — is standard practice in modern empirical-finance papers.
Inferential Statistics: Overview
What is it?
- Uses sample data to generalise / predict about a population, via estimation and testing.
- Probability-based methods: hypothesis tests, confidence intervals, p-values to quantify uncertainty.
- Covers parametric (assume distribution) and non-parametric approaches; Bayesian techniques update beliefs.
What is it used for?
- Test hypotheses (e.g., does treatment have an effect?).
- Estimate population parameters from samples.
- Assess relationships and differences with controlled error rates.
- Point estimation — bias, variance, MSE; method of moments; MLE; asymptotic normality.
- Interval estimation — normal/t-based CIs; bootstrap; coverage probability.
- Hypothesis testing — null/alternative; test statistics & p-values; type I/II errors; common tests (t, ANOVA, Wald, LR, χ², permutation, correlation); multiple-testing adjustments (Bonferroni, FDR); goodness-of-fit (KS, χ²).
- Non-parametric inference — rank tests (Wilcoxon, Mann-Whitney); sign tests; kernel density basics.
- Bayesian inference — priors/posteriors, Bayes’ theorem, conjugate priors, credible intervals, MCMC.
- Decision theory — loss/risk; minimax/Bayes estimators.
Notes
Inferential statistics is the bridge from sample to population. Every test on this slide is a way of asking “does what I see in the sample tell me something about the underlying data-generating process, or could it be noise?” The answer is always probabilistic, and the standard reporting unit is the p-value (probability of observing at least this extreme a result under the null hypothesis).
A few framing reminders worth internalising before today’s tests:
- Statistical significance ≠ economic significance. A trillion observations make any small effect “significant” at p < 0.001; a point estimate of 0.0001% is not interesting in finance even if its standard error is tiny. Always report and interpret the magnitude of an effect, not just the p-value.
- Multiple-testing inflates the false-discovery rate. Run 20 t-tests at α = 0.05 and you should expect 1 spurious “significant” result by chance. For an honest analysis with many comparisons, adjust (Bonferroni for conservative control of the family-wise error rate; Benjamini–Hochberg for FDR).
- Choose parametric vs non-parametric based on what you can defend. If your data is approximately normal (Shapiro-Wilk passes) and the sample is large enough for the CLT to bite, parametric tests (t-test, Pearson correlation) are more powerful. If not, the rank-based equivalents (Wilcoxon, Spearman) cost some power but make weaker assumptions — usually the safer call when you’re unsure.
- Bayesian alternatives (
brms,rstanarmin R) report the posterior distribution of the parameter rather than a binary “reject/fail-to-reject”, which many find more directly interpretable. Beyond today’s scope but worth knowing about for thesis-level work.
Reference: Mostly Harmless Econometrics (Angrist and Pischke 2009) for the empirical-economist’s view of inference and identification; Modern Statistics with R (Thulin 2024) for an R-first walkthrough of every test.
Example: Correlation matrix
library(dplyr); library(ggplot2)
cor_matrix <- combined_clean %>%
select(date, Asset, Net_Longs) %>%
pivot_wider(names_from = Asset, values_from = Net_Longs, values_fill = NA) %>%
select(-date) %>%
cor(use = "pairwise.complete.obs", method = "pearson") %>%
as.data.frame() %>%
rownames_to_column(var = "Asset1") %>%
pivot_longer(-Asset1, names_to = "Asset2", values_to = "Correlation") %>%
mutate(Correlation = round(Correlation, 2)) %>%
filter(!is.na(Correlation))
ggplot(cor_matrix, aes(x = Asset2, y = Asset1, fill = Correlation)) +
geom_tile(color = "white", linewidth = 0.5) +
geom_text(aes(label = Correlation), color = "black",
size = 3, fontface = "bold") +
scale_fill_gradient2(low = "blue", mid = "white", high = "red",
midpoint = 0, limits = c(-1, 1)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Correlation Matrix of Net Longs (Assets)",
subtitle = "Negative values show decoupling (e.g., Gold vs Bitcoin in crises)",
x = "Asset", y = "Asset")- Values from −1 to 1; diagonal = 1 (self), off-diagonal = strength/direction.
|r| > 0.5≈ strong; negative = inverse (hedging potential).pairwise.complete.obsignores NAs pairwise — usable when the panel isn’t perfectly balanced.- Heatmap + value annotation makes the matrix readable at a glance, even in a slide.
Notes
This is the same correlation-heatmap recipe as Lecture 2 — repeated here because the next slide turns the visual into a hypothesis test. The visualisation tells you which pairs might be worth a more rigorous test; the test on the next slide turns “looks correlated” into “we can reject zero correlation at p < α”.
Two technical notes for the assignment:
use = "pairwise.complete.obs"computes each pairwise correlation on whatever observations are available for that pair, even if other columns have missingness on those rows. The defaultuse = "everything"returns NA for any pair where any column has any NA — usually too aggressive. The pairwise option is the standard for unbalanced panels but introduces a subtle issue: different pairwise correlations are computed on different sample sizes, and the resulting matrix is no longer guaranteed to be positive semi-definite (occasionally matters for downstream uses like portfolio optimisation).method = "pearson"measures linear dependence and is sensitive to outliers.method = "spearman"uses ranks and measures monotonic dependence — robust to outliers and to non-linear-but-monotonic relationships. For finance returns where heavy tails are common, Spearman is often the safer default; report both if they disagree.
The correlation visualisation is the descriptive stop; running cor.test() (next slide) and reporting effect size + significance is the inferential stop.
Example: Pearson correlation test
gold_net <- combined_clean$Net_Longs[combined_clean$Asset == "Gold"]
btc_net <- combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"]
# Full-dataset test
full_test <- cor.test(gold_net, btc_net)
print(full_test)
# e.g., r = -0.14, p = 0.03 — weak negative correlation, statistically significant
# Crisis subset (2022)
crisis_gold <- combined_clean$Net_Longs[
combined_clean$Asset == "Gold" & combined_clean$Year == 2022]
crisis_btc <- combined_clean$Net_Longs[
combined_clean$Asset == "Bitcoin" & combined_clean$Year == 2022]
crisis_test <- cor.test(crisis_gold, crisis_btc)
print(crisis_test)
# e.g., r = -0.06, p = 0.68 — not statistically significant- r ranges −1 to 1;
|r| > 0.5strong;p < 0.05rejects H₀ of no association. - Subsetting (here: 2022 only) tests whether the relationship is regime-dependent.
- Pearson assumes linearity + normality — for non-normal data prefer Spearman (
method = "spearman"). - Always report
r,p, and the sample size alongside the test in your write-up.
Notes
cor.test() returns the Pearson correlation \(r\) along with a t-statistic, p-value, and 95 % CI. The reporting unit in any paper is the full triple (\(r\), \(p\), \(n\)) — never just the correlation, never just the p-value.
The example deliberately compares the full sample to a regime subset (2022 only) — this is the subsample style of analysis that academic papers use to surface regime-dependence: “the relationship between gold and bitcoin sentiment is weakly negative on average but not statistically significant in 2022, suggesting the decoupling hypothesis isn’t supported during this specific window”. That’s a meaningful, defensible empirical claim — much better than just “they’re correlated” or “they’re not”.
Three robustness checks worth running alongside cor.test():
- Different windows (rolling correlation, e.g.
slider::slide_dbl(..., .before = 12)for a rolling 12-week correlation) — visualise the time-variation in the relationship rather than a single point estimate. - Outlier sensitivity — re-run after winsorising (
DescTools::Winsorizeat 1 % each tail) or trimming. If the relationship survives, more credible. - Spearman comparison — if Pearson and Spearman disagree, your finding likely depends on a few influential observations or a non-linear shape; investigate before publishing.
When sample sizes are small (under ~30), Fisher’s z-transformation gives a more reliable confidence interval; cor.test() does this automatically.
Example: t-Test
library(broom)
trad_net <- combined_clean$Net_Longs[combined_clean$Asset %in% c("Gold", "Silver")]
crypto_net <- combined_clean$Net_Longs[combined_clean$Asset %in% c("Bitcoin", "Ethereum")]
# Full dataset
full_test <- t.test(trad_net, crypto_net)
print(full_test %>% tidy()) # tidy = clean table
# p < 0.05 → crypto means differ from traditional means
# Crisis subset (2022)
crisis_data <- combined_clean[combined_clean$Year == 2022, ]
non_crisis_data <- combined_clean[combined_clean$Year != 2022, ]
crisis_test <- t.test(crisis_data$Net_Longs, non_crisis_data$Net_Longs)
print(crisis_test %>% tidy())- Parametric — assumes the two samples are roughly normal.
- Outputs t-statistic, p-value, and 95% CI.
- Low p (
< 0.05) rejects H₀ of equal means. - Use Wilcoxon if data are non-normal; ANOVA for >2 groups.
broom::tidy()turns the test object into a single-row tibble — easy to bind into a results table.
Notes
The two-sample t-test asks: do these two groups have the same mean? It is the workhorse comparison in empirical work and is the right tool when:
- The two samples are roughly normal (check with Shapiro-Wilk later in this deck or visually with a Q-Q plot), OR the sample sizes are large enough for the CLT to kick in (~30+ per group is the conventional cut-off, more if the data is very skewed).
- The samples are independent (no observation pairing). For paired data — e.g. the same firm before vs after an event — use the paired t-test (
t.test(x, y, paired = TRUE)) instead. - The variance assumption is being handled. R’s
t.test()defaults to the Welch t-test which does not assume equal variances — the safer default. The classical Student t-test (equal variances) ist.test(..., var.equal = TRUE); only use it if you’ve checkedvar.test()and the assumption holds.
The reporting triple is again (\(t\)-statistic, \(p\), group means and Ns). For a paper, also report the effect size — Cohen’s \(d\), computed as \((\\bar{x}_1 - \\bar{x}_2) / s_p\) where \(s_p\) is the pooled SD. R’s effectsize::cohens_d() computes it. A statistically significant t-test with d = 0.05 is statistically real but economically tiny.
broom::tidy() is the trick that makes test objects analysis-friendly: it turns the chunky htest object that t.test() returns into a one-row tibble (estimate1, estimate2, statistic, p.value, conf.low, conf.high, method, …). You can bind_rows many tidied tests into a single comparison table.
Example: Wilcoxon test
# Full dataset
wilcox_full <- wilcox.test(trad_net, crypto_net)
print(wilcox_full %>% tidy())
# Crisis subset
crisis_net <- crisis_data$Net_Longs
non_crisis_net <- non_crisis_data$Net_Longs
wilcox_crisis <- wilcox.test(crisis_net, non_crisis_net)
print(wilcox_crisis %>% tidy())- Compares medians non-parametrically using ranks; outputs W-statistic and p-value.
- Robust to outliers and non-normality — your default when Shapiro shows non-normality.
- For >2 groups: Kruskal-Wallis.
- Pair with a density plot to communicate why medians differ.
Notes
The Wilcoxon rank-sum test (also known as Mann-Whitney U) is the non-parametric counterpart to the t-test. Instead of asking “are the means equal?”, it asks “are the distributions equal in central tendency?” — operationally, “if you randomly drew one observation from each group, what’s the probability that group 1’s draw exceeds group 2’s?”. When that probability is exactly 0.5, the distributions are identical in central tendency.
When to use Wilcoxon over t:
- The data is heavily skewed or has outliers — the rank transformation makes it robust.
- The sample is small (under ~30) and the data is non-normal — the t-test’s CLT-based approximation isn’t reliable yet.
- The data is ordinal (e.g., survey ratings 1–5) — means don’t really make sense; medians do.
When to prefer t:
- Both groups are roughly normal and reasonably sized — the t-test has higher power (more likely to detect a true effect when one exists).
In practice, run both and report whichever your data justifies. If they agree on the conclusion (both significant or both not), you’re robust; if they disagree, the parametric assumption is probably violated and Wilcoxon is the more credible call.
For >2 groups the analogues are ANOVA (parametric) and Kruskal-Wallis (non-parametric). For paired samples: paired t-test or Wilcoxon signed-rank (wilcox.test(..., paired = TRUE)).
Example: Shapiro-Wilk test
library(broom)
shapiro_gold <- shapiro.test(combined_clean$Net_Longs[combined_clean$Asset == "Gold"])
shapiro_btc <- shapiro.test(combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"])
print(shapiro_gold %>% tidy()) # e.g., p < 0.05 → non-normal
print(shapiro_btc %>% tidy()) # e.g., p > 0.05 → normal
# Q-Q plot for visual check
qqnorm(combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"])
qqline(combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"])
# Density comparison
gold_btc_data <- combined_clean[combined_clean$Asset %in% c("Gold", "Bitcoin"), ]
ggplot(gold_btc_data, aes(x = Net_Longs, fill = Asset)) +
geom_density(alpha = 0.7) +
facet_wrap(~ Asset, scales = "free") +
theme_minimal() +
labs(title = "Net Longs Distributions: Gold vs Bitcoin (Faceted)",
x = "Net Longs", y = "Density")- Outputs W-statistic (1 = perfect normal) and a p-value.
- Low p rejects H₀ of normality → consider transformations (
log) or non-parametric tests. - The Q-Q plot is the visual companion — straight diagonal = normal, deviations show where it breaks.
- Run on each subgroup, not just the pooled data — averages can hide non-normality.
Notes
Shapiro-Wilk is the standard test for normality. The null hypothesis is “the data is normally distributed”; small p-values reject normality. The W-statistic itself is between 0 and 1 — closer to 1 means closer to normal — but the p-value is the operational reporting unit.
A few caveats that matter for empirical work:
- Shapiro-Wilk is sensitive to sample size. With \(n > 5{,}000\), the test will reject normality even on essentially-normal data because the small departures every real dataset has become statistically detectable. With \(n < 30\), the test has too little power to detect even substantial non-normality. The visual companions — Q-Q plot and density plot — remain useful regardless of n.
- Normality is a finite-sample concern. The CLT means that for the sampling distribution of the mean (which is what t-tests rely on), normality of the underlying data matters less and less as n grows. By \(n \\sim 30\) per group the t-test is robust to mild non-normality; by \(n \\sim 100\) it’s very robust.
- Test each subgroup separately — pooling can mask group-specific non-normality. The example does this correctly: separate Shapiro on Gold and on Bitcoin.
The Q-Q plot (qqnorm + qqline) is the visual diagnostic — points should fall along the diagonal. Heavy tails show as upward deviations at the extremes (commonly seen in financial returns); skewness as a curved deviation. Always inspect the Q-Q plot, not just the p-value.
If non-normality matters for your downstream test, options: (a) transform (log, Box-Cox), (b) use rank-based / non-parametric tests, (c) bootstrap the standard errors. Option (c) is increasingly the go-to for academic work because it makes minimal distributional assumptions.
Example: Kolmogorov-Smirnov test
gold_net <- combined_clean$Net_Longs[combined_clean$Asset == "Gold"]
btc_net <- combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"]
ks_test <- ks.test(gold_net, btc_net)
print(ks_test %>% tidy()) # e.g., D = 0.25, p < 0.05 → different shapes
# ECDF visualisation
ggplot(gold_btc_data, aes(x = Net_Longs, color = Asset)) +
stat_ecdf() +
facet_wrap(~ Asset, scales = "free") +
theme_minimal() +
labs(title = "ECDF Plot: Gold vs Bitcoin (Faceted)",
x = "Net Longs", y = "Cumulative Probability")- Compares two distributions non-parametrically.
- D = max vertical distance between the two empirical CDFs.
- Low p rejects H₀ that the samples come from the same distribution.
- Useful as a prelude to a parametric test — confirms group differences before assuming a shared distribution.
Notes
The Kolmogorov-Smirnov (KS) test comes in two flavours, both of which ks.test() handles depending on its arguments:
- Two-sample KS (the example): “are these two empirical distributions different in any way — location, scale, or shape?”. The D-statistic is the maximum vertical gap between the two empirical CDFs. The test is sensitive to differences in shape that a t-test (location only) or F-test (scale only) would miss.
- One-sample KS: “is this sample drawn from a specified distribution?” — usually used to test against the normal:
ks.test(x, "pnorm", mean(x), sd(x)). Largely superseded by Shapiro-Wilk for normality testing in modern practice.
KS is conservative — it has lower power than t or Wilcoxon when the difference is purely in location. Its strength is detecting differences in shape that the location-only tests miss; for a paper, reporting KS alongside t (and Wilcoxon) gives the reader a fuller picture. If t says “different means”, Wilcoxon says “different medians”, and KS says “different distributions”, you have very strong evidence the two groups come from different processes.
The ECDF plot (stat_ecdf in ggplot2) is the visual companion — vertical gaps directly correspond to the KS test statistic. A steeper or more shifted ECDF for one group is the visual “why” behind a significant KS p-value.
Statistical Modelling: Overview
What is it?
- Mathematical representations of data relationships (regression, multivariate) to explain or predict outcomes.
- Time-series forecasting, ML interfaces, computational simulations.
- Focuses on model fitting, diagnostics, validation in real-world data.
What is it used for?
- Forecasting (markets, disease).
- Causal inference, classification, dimensionality reduction.
- Optimization in big-data settings.
- Regression — simple/multiple linear, logistic, fixed/random effects, clustered/robust SEs, interactions, non-linear terms.
- Multivariate — PCA, factor analysis, canonical correlation, k-means, discriminant analysis.
- Time series — ACF/PACF, ADF/KPSS stationarity, ARIMA, Monte Carlo, GARCH.
- Survival — Kaplan-Meier, Cox PH, censoring, Weibull.
- Causal inference — IV, DID, PSM, DAGs, RDD.
- Non-linear / GLM — GLMs, Poisson, splines, quantile regression.
Notes
This is the high-level menu of “what kinds of relationships can statistical models estimate?”. For an empirical-finance master’s-level workflow, the operationally relevant subset is:
- Linear regression with fixed effects (next slide) — the workhorse for almost every paper. Pools observations across groups (firms, assets, dates) while absorbing group-specific means via fixed effects. Mostly Harmless Econometrics (Angrist and Pischke 2009) is the canonical reference.
- Robust / clustered standard errors (slide after) — an inference fix on top of OLS that accounts for within-group correlation in residuals (essential whenever your panel has natural groupings: industry, year, firm).
- Causal inference techniques (DiD, IV, RDD, propensity-score matching) — when the question is “does X cause Y?” rather than “is X correlated with Y?”. DiD: compare treatment vs control before/after an event; IV: use a third variable that affects X but not Y directly to identify the causal effect of X. Mostly Harmless covers these in detail; the modern Causal Inference: The Mixtape by Cunningham is also excellent.
- Time-series methods (ARIMA, GARCH for volatility, VAR for multivariate dynamics) — when the time-ordering of observations matters (autocorrelation, predictive lags, conditional volatility).
The Finance Project course (the sister course to this one) goes deeper into the modelling toolkit specifically: ridge, lasso, cross-validation, prediction-market microstructure. An Introduction to Statistical Learning (James et al. 2021) is the bridging textbook between this lecture’s classical inference and the predictive-modelling lens used there.
Example: Linear regression with fixed effects
library(broom); library(dplyr)
# Step 0: hedging net bias
combined_clean <- combined_clean %>%
mutate(net_commercial = commercial_longs - commercial_shorts)
# Step 1: simple LM
simple_lm <- lm(Net_Longs ~ net_commercial, data = combined_clean)
print(simple_lm %>% tidy())
# Step 2: + asset fixed effects
fe_asset <- lm(Net_Longs ~ net_commercial + factor(Asset), data = combined_clean)
print(fe_asset %>% tidy())
# Step 3: + asset & year fixed effects
fe_asset_year <- lm(Net_Longs ~ net_commercial + factor(Asset) + factor(Year),
data = combined_clean)
print(fe_asset_year %>% tidy())- Absorb group-specific means → estimate within-group variation.
- Control for time-invariant confounders.
- Reduce omitted-variable bias.
- Limitation: cannot estimate the group effects themselves; for time-varying interactions use
Year * Asset.
Notes
The progressive build (no FE → asset FE → asset + year FE) is exactly how you’d write the empirical-strategy section of a paper: each step adds one piece of structure, and the question at every step is what does my coefficient on net_commercial mean now?
lm(Net_Longs ~ net_commercial)— pools across all assets and dates, slope estimates the average relationship in the pooled data. Vulnerable to omitted variable bias: any variable correlated with both Net_Longs and net_commercial that we haven’t controlled for biases the slope.+ factor(Asset)— adds asset fixed effects. Conceptually equivalent to demeaning Net_Longs and net_commercial within each asset, then running OLS. The slope now estimates within-asset variation: “for a given asset, when its commercial net goes up by 1, its non-commercial net changes by β”. Controls for every time-invariant asset characteristic (Bitcoin’s higher baseline volatility, gold’s longer history) without us having to enumerate them.+ factor(Year)— adds year fixed effects on top. Now the slope estimates the within-(asset × year) relationship: “in a given asset and year, when commercial net goes up, what happens to non-commercial net?”. Absorbs all year-specific shocks that affect all assets equally (a global risk-on vs risk-off year, common policy events).
For larger panels with many fixed effects, fixest::feols is much faster than lm() and has cleaner syntax: feols(Net_Longs ~ net_commercial | Asset + Year, data = ...) is the FE equivalent and natively supports clustered SEs.
The classical reference for fixed-effects panel methods is Mostly Harmless Econometrics (Angrist and Pischke 2009), chapter 5; Tidy Finance with R (Scheuch, Voigt, and Weiss 2023) applies them to asset-pricing regressions in the natural finance context.
Example: Clustered standard errors
library(sandwich); library(lmtest)
# HC1 clustered by Year — robust to heteroskedasticity & intra-cluster correlation
clustered_se <- coeftest(simple_lm, vcov = vcovHC(simple_lm, type = "HC1", cluster = "Year"))
print(clustered_se)
# Residuals plot — check linearity
plot(residuals(simple_lm) ~ fitted(simple_lm))- Adjust for intra-group correlation (e.g., observations in the same year).
- Wider SEs may make p-values larger (e.g., 0.01 → 0.05) — conservative inference.
- Need ≥ 20 clusters; bootstrap for fewer.
- Always specify in your write-up: “SEs clustered at the year level to account for intra-year correlation and heteroskedasticity.”
Notes
OLS standard errors are correct only under (a) homoskedasticity and (b) no correlation across observations. Real financial data violates both routinely:
- Heteroskedasticity — residual variance depends on the right-hand-side variables (e.g. variance of returns scales with volatility regimes). HC robust SEs (the “HC” family) fix this.
- Cross-observation correlation — observations within the same group (firm, year, industry) share unobserved shocks, so residuals are correlated within groups. Clustered SEs (HC1 with
cluster = "Year") account for this.
The intuition: with intra-cluster correlation, you have fewer effective independent observations than the raw N. Treating dependent observations as independent makes standard errors too small and p-values misleadingly low. Clustering inflates SEs to the right scale.
When to cluster on what:
- One-way: cluster on the dimension you suspect has the most within-cluster correlation. For panel data with rare cross-firm shocks but big year-specific shocks, cluster on year. For an event study with cross-sectional events, cluster on firm.
- Two-way clustering (firm AND year, common in academic finance) —
vcovCL(model, cluster = ~ firm + year)from thesandwichpackage. Robust to either source of correlation. - Bootstrap — for very small numbers of clusters (<20), cluster-bootstrap (
fwildclusterboot) gives more reliable inference.
The convention in modern empirical-finance papers is to report clustered SEs by default, with a note specifying the clustering dimension(s) — see the add.lines block in the next slide’s stargazer example.
Exporting results — stargazer
library(stargazer)
# Compute clustered SEs for each model
clustered_se_simple <- coeftest(simple_lm,
vcov = vcovHC(simple_lm, type = "HC1", cluster = "Year"))
clustered_se_fe_asset <- coeftest(fe_asset,
vcov = vcovHC(fe_asset, type = "HC1", cluster = "Year"))
clustered_se_fe_asset_year <- coeftest(fe_asset_year,
vcov = vcovHC(fe_asset_year, type = "HC1", cluster = "Year"))
# Build LaTeX table
stargazer(
simple_lm, fe_asset, fe_asset_year,
type = "latex",
out = "regression_table.tex",
dep.var.caption = "Effects on Net Longs (Post-2021 Data)",
label = "tab:regression_net",
dep.var.labels = "Non-Commercial Longs",
covariate.labels = c("Commercial Longs"),
omit.stat = c("ser", "f"),
star.char = c("*", "**", "***"),
star.cutoffs = c(0.1, 0.05, 0.01),
add.lines = list(
c("Asset FE", "No", "Yes", "Yes"),
c("Year FE", "No", "No", "Yes")
),
omit = c("factor"),
header = FALSE, model.numbers = TRUE,
table.placement = "H", digits = 3,
se = list(clustered_se_simple[, 2],
clustered_se_fe_asset[, 2],
clustered_se_fe_asset_year[, 2])
)stargazerproduces clean LaTeX/HTML/ASCII regression tables — multiple models side by side.add.linesdocuments which fixed effects each model includes — important for replicability.star.cutoffssets the significance thresholds; report them in the table notes.- Always report R², N, and p-values: “All models include year and asset fixed effects; \\\* p < 0.01, \\ p < 0.05, \* p < 0.1.”
Notes
stargazer produces the multi-column regression table that is essentially universal in empirical-economics and finance papers — three or four nested specifications side-by-side, each adding a layer of fixed effects or controls, with R², N, and significance stars at the bottom.
The conventions encoded in the example are the academic-paper standard:
- Multiple models in adjacent columns (model 1 = no FE, model 2 = asset FE, model 3 = asset + year FE) tells the reader what changes as you add structure — does the coefficient remain stable (good — robust to specification choice) or move dramatically (warning — driven by what you absorbed)?
add.linesdocumenting fixed effects — readers should see at a glance which columns control for what. Without these, your reader has to parse coefficient labels to figure it out.- Clustered SEs in
se = list(...)— by defaultstargazershows OLS SEs from the model object; passing the clustered SEs explicitly is essential when you have clustered standard errors. Otherwise the table reports the wrong inference. omit = c("factor")suppresses the dozens of fixed-effect dummies from the displayed table — they’re estimated but uninteresting; readers care about the coefficient onnet_commercial, not on each year dummy.- Significance stars + thresholds —
star.cutoffs = c(0.1, 0.05, 0.01)is the most common (some journals prefer 0.05/0.01/0.001); always state which thresholds you used in the table notes.
Modern alternatives to stargazer: modelsummary (more flexible, supports more model types, prettier defaults) and fixest::etable (designed to pair with feols, handles many FE elegantly). For a thesis or paper, all three are acceptable; consistency within one document matters more than which one you chose.
3.4 Discussion of Assignment I
- 3.1 Course objectives
- 3.2 Recap from Lecture 2
- 3.3 Live Coding Session 3
- 3.4 Discussion of Assignment I
- 3.5 Conclusion of Lecture 3
Assignment I — Problem Set
- Get the data: create
.Rscript, load libraries, set API key, load Quandl futures data (gold, silver, BTC, ETH). Merge by date intomerged_*, drop NAs, sort ascending by date. - Understand the data: read Nasdaq & CFTC sources; write 3–5 sentences for your “Data” section.
- Join data sets: combine into
combined, with anAssetvariable. - Clean & transform: filter post-1 April 2021, save as
combined_clean. Derive Net Long Position per trader type, Weekly Change, andYear/Quarter/Monthindicators. - Descriptive analysis: mean, sd, min, P10, median, P90, max — table for the Data section.
- Analysis & plot: discuss patterns; implement min. two
ggplotplots with 4–5 sentences each.
- Define a research question: simple, testable RQ from your patterns (e.g., “Do crypto assets show higher Net Long Position volatility?”). Briefly motivate in Introduction (≥ 0.5 page).
- Literature review: 1–2 papers to build on (3–4 sentences, ≥ 0.5 page).
- Define your approach: how you will answer, why it makes sense (≥ 0.5 page).
- Report insights: Results (≥ 1.5 pages), min. two LaTeX tables via
stargazer(academic style) and min. 2 captioned plots. - Explain & discuss: Conclusion (≥ 0.5 page) — implications, quality vs length.
Use the Overleaf template on Moodle. Page minima per section: Intro 0.5, Lit 0.5, Method 0.5, Data 1.5, Results 1.5, Conclusion 0.5.
Notes
The assignment is split into two halves on purpose: Exercise 1 (data preparation) is essentially a structured replication of the workflow we built in Lectures 1–3 — pull data, merge, clean, descriptive statistics, two plots — checking that you can put the toolkit into operation end-to-end. Exercise 2 (paper) is the harder and more rewarding half: define your own research question on the data and write a short academic paper.
A few recommendations for the paper half:
- Pick a research question whose answer you don’t know in advance. “Are gold and bitcoin correlated?” is too obvious — you’ve already plotted it. “Do crypto net-long positions show stronger weekly persistence than gold’s, and is the gap larger after FTX?” is sharper, testable, and the answer is genuinely uncertain.
- Two papers are enough for the lit review. This isn’t a thesis — find two recent, well-cited papers on a closely-related question, summarise what they did and what they found in 3–4 sentences each, and explain what your analysis adds.
- Methodology section should be specific enough that a competent reader could re-run your analysis from the description alone. List the regression specification (with cluster level), define every variable, state the sample.
- Results section should lead with the takeaway, then back it up with the table or figure. Don’t bury the headline finding in the third paragraph.
The page minima are floors, not ceilings. A 7-page paper that says one thing well beats a 12-page paper with three half-explored ideas. Marker fatigue is real — terse and crisp wins.
Assignment I — submission rules
Solve the problem set posted on Moodle, building on Lectures 1–3.
Submit ONLY two files:
- One
.Rscript — well-commented, self-explanatory, efficient (meaningful names, functions for repetition, avoid for-loops). Include text/comments answering Exercise 1 and 2. - One PDF compiled from the Overleaf template — outputs from your
.Rscript (plots, calculations),stargazerLaTeX tables, captioned plots, written text. 11 pt Times New Roman, 1.5 spaced.
- Work in teams of up to 5 students.
- 50% of the final grade. Evaluated on code quality, creativity (plots/RQ), writing (concise, skim-friendly, with economic justifications).
- Deadline: 19 January 2026 (midnight). Email files to oliver.padmaperuma@uni-ulm.de, with andre.guettler@uni-ulm.de and your teammates in CC.
- Subject / file pattern:
RiF_ProblemSet_surname1_surname2_... - If attachments are too large, share a cloud link in the email body.
Notes
Submission mechanics worth nailing down well in advance of the deadline:
- Two files only —
.Rand.pdf. Not the Overleaf project ZIP (the marker shouldn’t have to recompile). Not the raw CSVs (your script should generate them orread_csv()from a documented source). The marker’s path: open.Rin RStudio, run top-to-bottom from a fresh session, expect it to run cleanly; then read the PDF in parallel. - Self-contained
.Rscript. Every dependency is stated at the top (library()calls); every input file is loaded with a relative path that the marker can satisfy by placing files in the same folder; nosetwd()to a path that only exists on your machine. Reproducibility is graded. - Group of up to 5. All names listed in the file pattern (
RiF_ProblemSet_surname1_surname2_surname3...) so the marker can attribute. Each member emails CCing the others — this is the audit trail that you all submitted on time. - Subject line follows the pattern verbatim — the marker’s inbox is filtered on this; deviations risk getting lost in the noise.
- Cloud link if attachments exceed mailbox limits. University SMTP often caps at 25 MB; PDF + R script should be well under, but if your output PDF embeds many figures, switch to a Nextcloud / OneDrive link.
Late submissions are evaluated case-by-case but treat the deadline as firm. The harder constraint is that the assignment counts for 50 % of the final grade — bake in a buffer day in case Overleaf compiles fail or one of the dataset URLs goes down.
3.5 Conclusion of Lecture 3
- 3.1 Course objectives
- 3.2 Recap from Lecture 2
- 3.3 Live Coding Session 3
- 3.4 Discussion of Assignment I
- 3.5 Conclusion of Lecture 3
Course at a glance
Basics
Course objectives, schedule, assignments · Introduction to R · Live coding
- Course objectives, schedule and assignments
- Introduction to R and RStudio
- Live coding: variables, vectors, matrices, data frames, lists, functions, loops
- Data import and export
Data Handling & Visualization
API access, merging, cleansing, transforming and visualising financial data in R · Introduction to Overleaf
- API access (Nasdaq Data Link / Quandl, FRED, Yahoo, Coingecko, Polygon)
- Import and cleanse: read_csv, mutate, types
- Merge and append data (merge, bind_rows)
- Filter and mutate (dplyr): subset rows, derive variables
- Group by and summarise
- Pivot wide / long
- Data visualization with ggplot2 (six-step pipeline)
- Introduction to LaTeX and Overleaf
Statistical Analysis
Descriptive · inferential · modelling — applied in R
- Descriptive statistics in R
- Correlation matrix and Pearson correlation test
- t-Test and Wilcoxon test
- Shapiro-Wilk and Kolmogorov-Smirnov tests
- Linear regression with fixed effects
- Clustered standard errors
- Exporting regression tables with stargazer
- Discussion of Assignment I (Problem Set)
Academic Publishing & Refereeing
What makes a great empirical paper · publication process · how to write a referee report
- What makes a good empirical paper (contribution, identification, write-up)
- The publication process step by step
- Top finance and economics journals
- Bad outcome vs revise & resubmit
- Referee Reports — summary, major issues, minor issues
- Referee checklist (question, identification, data, econometrics, results)
- Discussion of Assignment II (Referee Report)
Brown Bag Seminar
Engage with doctoral research and prepare your referee report
- Doctoral research presentations
- Apply empirical / writing tips for the referee report
- Group discussion and Q&A
Further reading
- Angrist and Pischke (2009) — the standard graduate text on identification and causal inference in empirical economics.
- Thulin (2024) — modern statistical methods worked through in R, free at https://www.modernstatisticswithr.com/.
- Navarro (2015) — accessible introduction to statistics with R, free at https://learningstatisticswithr.com/.
- Gentle (2020) — applied statistical analysis on financial data, with R examples.
- James et al. (2021) — covers regression, model selection, resampling, and tree-based methods.
Notes
Specialised picks for what you’re likely to need next:
- For the assignment paper specifically — Mostly Harmless Econometrics (Angrist and Pischke 2009) gives the right framing for “how do I know my regression is identifying anything causal?”. Tidy Finance (Scheuch, Voigt, and Weiss 2023) (Lecture 1’s reference) shows the canonical empirical-finance regression workflow in tidyverse R.
- If today’s tests still feel hand-wavy — Modern Statistics with R (Thulin 2024) walks each test in detail with R code and is free online. Learning Statistics with R (Navarro 2015) is gentler, almost a conversation, useful if formal stats notation is intimidating.
- For finance-specific applications — Statistical Analysis of Financial Data with R (Gentle 2020) covers heavy-tailed return distributions, GARCH, copula dependence — the concepts that actually matter for asset returns and that introductory stats books skip.
- Bridge to predictive modelling — (James et al. 2021) (already a Lecture 1 reference) is the right jump-off into the regularisation, resampling, and cross-validation methods that the Finance Project course (sister to this one) builds on.
You don’t need to read all of these. Pick the one whose framing matches the gap you currently feel.
Prepare before next lecture
- Document today’s code in a clean way and save as an
.Rmdfile. - Start sketching your problem-set paper — draft RQ, motivation, and the first plot.
Notes
Two pieces of homework before Lecture 4:
Document today’s code as .Rmd — same rationale as the previous lectures: rebuild the script, intersperse prose, render and read back. The new constructs from today (cor.test, t.test, wilcox.test, lm, coeftest with clustered SEs, stargazer) are the analytical core of the assignment, so spending an hour internalising them now saves the same hour of head-scratching during assignment week.
Start sketching the assignment paper — a one-page outline (RQ + motivation + a single plot you’ve already produced) is enough at this stage. Bringing this draft to the next lecture lets you sanity-check the question against the next topic (what makes a good empirical paper) before you commit to a direction.
A useful drafting heuristic: write the abstract first — three sentences saying what you ask, what you do, what you find. If you can’t write those three sentences, the paper isn’t ready to start. Once the abstract reads cleanly, the rest of the paper is filling it in.
See you next time
- Register for “exam” 13337 in campusonline by 30 November 2025.
- Lecture 4: Academic publishing and refereeing — what makes a great empirical paper, the publication process, how to write a referee report.