Lecture 3: Statistical Analysis

Descriptive · inferential · modelling — applied in R

Prof. Dr. Andre Guettler
Prof. Dr. Andre Guettler Director of the Institute
Helmholtzstraße 22, Room 205
andre.guettler@uni-ulm.de
+49 731 50 31 030
Oliver Padmaperuma
Oliver Padmaperuma Doctoral Candidate
Helmholtzstraße 22, Room 203
oliver.padmaperuma@uni-ulm.de
+49 731 50 31 036

3.1 Course objectives

  • 3.1 Course objectives
  • 3.2 Recap from Lecture 2
  • 3.3 Live Coding Session 3
  • 3.4 Discussion of Assignment I
  • 3.5 Conclusion of Lecture 3
  • Welcome to
  • Course Objective
  • Course at a glance
  • Assignments / Exams

Welcome to Research in Finance

  • Register for “exam” 13337 in campusonline by 30 November 2025. The registration is what binds you to the course requirements; without it you cannot submit. If you are registered but don’t submit, you receive a fail grade (5.0).
  • Ask questions during or right after each session — that is the preferred channel.
  • Admin / studies / exam-eligibility questions go to the registrar’s office (Studiensekretariat) at studiensekretariat@uni-ulm.de.
  • Course-content questions outside class: email oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de.
  • We also recommend the student advisory service.

Course Objective

Scope

We will:

  • Prepare Master students for their empirical thesis
  • Hands-on R intro for data management, visualization, cleaning, basic modelling
  • Writing tips for theses, including LaTeX & Overleaf
  • Referee reviews on research presentations for empirical critique skills

We will NOT:

  • Deep dive into advanced stats or ML methods
  • Specific finance topics (asset pricing, etc.)
  • Full thesis writing / research design training

Approach

Part I — Learn the Basics

  • Hands-on R intro: a widely used language for statistical computing
  • Manage, visualize and clean data; run and interpret statistical models
  • Solve a real empirical problem set in R, in groups

Part II — Apply your learnings

  • Mandatory participation in the institute’s Brown Bag Seminar
  • Two assignments (group work and individual referee report) — see Assignments / Exams

Course at a glance

Basics

Week 1

29.10.2025

Course objectives, schedule, assignments · Introduction to R · Live coding

  • Course objectives, schedule and assignments
  • Introduction to R and RStudio
  • Live coding: variables, vectors, matrices, data frames, lists, functions, loops
  • Data import and export

Data Handling & Visualization

Week 2

05.11.2025

API access, merging, cleansing, transforming and visualising financial data in R · Introduction to Overleaf

  • API access (Nasdaq Data Link / Quandl, FRED, Yahoo, Coingecko, Polygon)
  • Import and cleanse: read_csv, mutate, types
  • Merge and append data (merge, bind_rows)
  • Filter and mutate (dplyr): subset rows, derive variables
  • Group by and summarise
  • Pivot wide / long
  • Data visualization with ggplot2 (six-step pipeline)
  • Introduction to LaTeX and Overleaf

Statistical Analysis

Week 3

12.11.2025

Descriptive · inferential · modelling — applied in R

  • Descriptive statistics in R
  • Correlation matrix and Pearson correlation test
  • t-Test and Wilcoxon test
  • Shapiro-Wilk and Kolmogorov-Smirnov tests
  • Linear regression with fixed effects
  • Clustered standard errors
  • Exporting regression tables with stargazer
  • Discussion of Assignment I (Problem Set)

Academic Publishing & Refereeing

Week 4

19.11.2025

What makes a great empirical paper · publication process · how to write a referee report

  • What makes a good empirical paper (contribution, identification, write-up)
  • The publication process step by step
  • Top finance and economics journals
  • Bad outcome vs revise & resubmit
  • Referee Reports — summary, major issues, minor issues
  • Referee checklist (question, identification, data, econometrics, results)
  • Discussion of Assignment II (Referee Report)

Brown Bag Seminar

Week 13

20.01.2026

Engage with doctoral research and prepare your referee report

  • Doctoral research presentations
  • Apply empirical / writing tips for the referee report
  • Group discussion and Q&A

Assignments / Exams

Assignment I — Problem Set 50% of your grade

Documented .R script + PDF write-up (Overleaf)

Group of up to 5.

Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Research in Finance_assignment-1-problem-set_surname1_surname2_…

19 January 2026

2.5–3 page referee report on a Brown-Bag presentation

Group of up to 5.

Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Research in Finance_assignment-2-referee-report_surname1_surname2_…

3 February 2026

3.2 Recap from Lecture 2

  • 3.1 Course objectives
  • 3.2 Recap from Lecture 2
  • 3.3 Live Coding Session 3
  • 3.4 Discussion of Assignment I
  • 3.5 Conclusion of Lecture 3
  • What we covered

What we covered

Topic What is it? What for? Example syntax
API access Load remote data via authenticated queries Fetching CFTC time series Quandl.api_key("…"); Quandl.datatable("QDL/LFON", …)
Import & cleanse Load files (CSV) and initial transforms Date formatting read_csv() %>% mutate(date = as.Date(date))
Merge & append Join by keys (merge) or stack (bind_rows) Multi-asset views merge(d1, d2, by="date"), bind_rows(...)
Filter & mutate Subset rows, derive columns Post-2021 windows, net longs filter(date > "2021-01-01") %>% mutate(net = longs - shorts)
Group & summarise Group categories, aggregate Annual means / SD per asset group_by(Asset, Year) %>% summarise(...)
Pivot Reshape wide ⇄ long Correlation matrices pivot_wider(), pivot_longer()
Visualization Layered plotting system Publication-ready figures ggplot() + geom_line() + theme_minimal()

3.3 Live Coding Session 3

  • 3.1 Course objectives
  • 3.2 Recap from Lecture 2
  • 3.3 Live Coding Session 3
  • 3.4 Discussion of Assignment I
  • 3.5 Conclusion of Lecture 3
  • Overview of basic statistical methods
  • Descriptive Statistics: Overview
  • Example: Summary statistics
  • Example: Distribution plots
  • Inferential Statistics: Overview
  • Example: Correlation matrix
  • Example: Pearson correlation test
  • Example: t-Test
  • Example: Wilcoxon test
  • Example: Shapiro-Wilk test
  • Example: Kolmogorov-Smirnov test
  • Statistical Modelling: Overview
  • Example: Linear regression with fixed effects
  • Example: Clustered standard errors
  • Exporting results — stargazer

Overview of basic statistical methods

  • Sample spaces and events
  • Random variables
  • Expectations and moments
  • Inequalities
  • Convergence concepts
  • Distributions
  • Sampling
  • Missing-value handling
  • Central tendency, dispersion, shape
  • Data summarization
  • Point and interval estimation
  • Hypothesis testing
  • Non-parametric inference
  • Bayesian inference
  • Statistical decision theory
  • Regression analysis
  • Multivariate analysis
  • ANOVA
  • Time series & stochastic processes
  • Survival analysis
  • Causal inference

Practical advice

Always start with a thorough descriptive analysis to deeply understand your data before turning to advanced methods.

Descriptive Statistics: Overview

What is it?

  • Summarises and organises raw data into meaningful patterns using averages, spreads, and visualisations.
  • Presents data without drawing conclusions beyond the dataset (histograms, box plots).
  • Includes data-type categorisation and basic central-tendency / variability statistics.

What is it used for?

  • Initial overview — trends, outliers, patterns before deeper analysis.
  • Communicating findings to non-experts.
  • Data cleaning and quality checks.
  • Sampling — methods, survey design, data types (nominal, ordinal, interval, ratio).
  • Missing values — imputation, prediction, elimination.
  • Central tendency — mean (arithmetic, geometric, harmonic), median, mode.
  • Dispersion — variance, standard deviation, range, IQR, coefficient of variation.
  • Shape & symmetry — skewness, kurtosis, normality (Q-Q plots).
  • Summarization — frequency distributions, histograms, box / stem-and-leaf / scatter plots, ECDF, quantiles, outliers.

Example: Summary statistics

library(dplyr)
library(tidyr)

# Compact summary statistics per asset and per variable
desc_stats <- combined_clean %>%
  select(Asset, Net_Longs, market_participation) %>%
  pivot_longer(cols = -Asset, names_to = "Variable", values_to = "Value") %>%
  mutate(Variable = case_when(
    Variable == "Net_Longs" ~ "Net Longs",
    Variable == "market_participation" ~ "Market Part."
  )) %>%
  group_by(Asset, Variable) %>%
  summarise(
    Mean = round(mean(Value, na.rm = TRUE), 0),
    Std  = round(sd(Value,   na.rm = TRUE), 0),
    Min  = round(min(Value,  na.rm = TRUE), 0),
    P10  = round(quantile(Value, 0.1, na.rm = TRUE), 0),
    P50  = round(median(Value, na.rm = TRUE), 0),
    P90  = round(quantile(Value, 0.9, na.rm = TRUE), 0),
    Max  = round(max(Value,  na.rm = TRUE), 0),
    .groups = "drop"
  )

write_csv(desc_stats, "desc_stats_clean.csv")
  • Pivot to long format → group by (Asset, Variable) → summarise per cell.
  • Reports central tendency (Mean, Median = P50), dispersion (Std), and the tails (Min, P10, P90, Max) — the standard table-1 shape for empirical papers.
  • case_when() cleans variable labels for the published version.
  • .groups = "drop" quiets the grouping warning and produces an ungrouped result.

Example: Distribution plots

library(ggplot2)

long_data <- combined_clean %>%
  select(Asset, Net_Longs, market_participation,
         non_commercial_shorts, non_commercial_longs) %>%
  pivot_longer(-Asset, names_to = "Variable", values_to = "Value")

ggplot(long_data, aes(x = Value, color = Asset, fill = Asset)) +
  geom_density(alpha = 0.5) +                      # transparent overlays
  facet_wrap(~ Variable, scales = "free") +
  theme_minimal() +
  labs(title = "Faceted Density Plots: Distributions by Variable",
       x = "Value", y = "Density")

ggsave("distribution_plot.pdf", width = 6, height = 4)
  • geom_density() is a smoothed histogram — better for comparing shapes than raw bars.
  • alpha = 0.5 lets overlapping curves stay readable.
  • facet_wrap(~ Variable, scales = "free") lets each variable use its own y-range.
  • Save as PDF for inclusion in your paper — vector plots scale without pixelation.

Inferential Statistics: Overview

What is it?

  • Uses sample data to generalise / predict about a population, via estimation and testing.
  • Probability-based methods: hypothesis tests, confidence intervals, p-values to quantify uncertainty.
  • Covers parametric (assume distribution) and non-parametric approaches; Bayesian techniques update beliefs.

What is it used for?

  • Test hypotheses (e.g., does treatment have an effect?).
  • Estimate population parameters from samples.
  • Assess relationships and differences with controlled error rates.
  • Point estimation — bias, variance, MSE; method of moments; MLE; asymptotic normality.
  • Interval estimation — normal/t-based CIs; bootstrap; coverage probability.
  • Hypothesis testing — null/alternative; test statistics & p-values; type I/II errors; common tests (t, ANOVA, Wald, LR, χ², permutation, correlation); multiple-testing adjustments (Bonferroni, FDR); goodness-of-fit (KS, χ²).
  • Non-parametric inference — rank tests (Wilcoxon, Mann-Whitney); sign tests; kernel density basics.
  • Bayesian inference — priors/posteriors, Bayes’ theorem, conjugate priors, credible intervals, MCMC.
  • Decision theory — loss/risk; minimax/Bayes estimators.

Example: Correlation matrix

library(dplyr); library(ggplot2)

cor_matrix <- combined_clean %>%
  select(date, Asset, Net_Longs) %>%
  pivot_wider(names_from = Asset, values_from = Net_Longs, values_fill = NA) %>%
  select(-date) %>%
  cor(use = "pairwise.complete.obs", method = "pearson") %>%
  as.data.frame() %>%
  rownames_to_column(var = "Asset1") %>%
  pivot_longer(-Asset1, names_to = "Asset2", values_to = "Correlation") %>%
  mutate(Correlation = round(Correlation, 2)) %>%
  filter(!is.na(Correlation))

ggplot(cor_matrix, aes(x = Asset2, y = Asset1, fill = Correlation)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = Correlation), color = "black",
            size = 3, fontface = "bold") +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red",
                       midpoint = 0, limits = c(-1, 1)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Correlation Matrix of Net Longs (Assets)",
       subtitle = "Negative values show decoupling (e.g., Gold vs Bitcoin in crises)",
       x = "Asset", y = "Asset")
  • Values from −1 to 1; diagonal = 1 (self), off-diagonal = strength/direction.
  • |r| > 0.5 ≈ strong; negative = inverse (hedging potential).
  • pairwise.complete.obs ignores NAs pairwise — usable when the panel isn’t perfectly balanced.
  • Heatmap + value annotation makes the matrix readable at a glance, even in a slide.

Example: Pearson correlation test

gold_net <- combined_clean$Net_Longs[combined_clean$Asset == "Gold"]
btc_net  <- combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"]

# Full-dataset test
full_test <- cor.test(gold_net, btc_net)
print(full_test)
# e.g., r = -0.14, p = 0.03 — weak negative correlation, statistically significant

# Crisis subset (2022)
crisis_gold <- combined_clean$Net_Longs[
  combined_clean$Asset == "Gold"    & combined_clean$Year == 2022]
crisis_btc  <- combined_clean$Net_Longs[
  combined_clean$Asset == "Bitcoin" & combined_clean$Year == 2022]

crisis_test <- cor.test(crisis_gold, crisis_btc)
print(crisis_test)
# e.g., r = -0.06, p = 0.68 — not statistically significant
  • r ranges −1 to 1; |r| > 0.5 strong; p < 0.05 rejects H₀ of no association.
  • Subsetting (here: 2022 only) tests whether the relationship is regime-dependent.
  • Pearson assumes linearity + normality — for non-normal data prefer Spearman (method = "spearman").
  • Always report r, p, and the sample size alongside the test in your write-up.

Example: t-Test

library(broom)

trad_net   <- combined_clean$Net_Longs[combined_clean$Asset %in% c("Gold", "Silver")]
crypto_net <- combined_clean$Net_Longs[combined_clean$Asset %in% c("Bitcoin", "Ethereum")]

# Full dataset
full_test <- t.test(trad_net, crypto_net)
print(full_test %>% tidy())     # tidy = clean table
# p < 0.05 → crypto means differ from traditional means

# Crisis subset (2022)
crisis_data     <- combined_clean[combined_clean$Year == 2022, ]
non_crisis_data <- combined_clean[combined_clean$Year != 2022, ]

crisis_test <- t.test(crisis_data$Net_Longs, non_crisis_data$Net_Longs)
print(crisis_test %>% tidy())
  • Parametric — assumes the two samples are roughly normal.
  • Outputs t-statistic, p-value, and 95% CI.
  • Low p (< 0.05) rejects H₀ of equal means.
  • Use Wilcoxon if data are non-normal; ANOVA for >2 groups.
  • broom::tidy() turns the test object into a single-row tibble — easy to bind into a results table.

Example: Wilcoxon test

# Full dataset
wilcox_full <- wilcox.test(trad_net, crypto_net)
print(wilcox_full %>% tidy())

# Crisis subset
crisis_net     <- crisis_data$Net_Longs
non_crisis_net <- non_crisis_data$Net_Longs

wilcox_crisis <- wilcox.test(crisis_net, non_crisis_net)
print(wilcox_crisis %>% tidy())
  • Compares medians non-parametrically using ranks; outputs W-statistic and p-value.
  • Robust to outliers and non-normality — your default when Shapiro shows non-normality.
  • For >2 groups: Kruskal-Wallis.
  • Pair with a density plot to communicate why medians differ.

Example: Shapiro-Wilk test

library(broom)

shapiro_gold <- shapiro.test(combined_clean$Net_Longs[combined_clean$Asset == "Gold"])
shapiro_btc  <- shapiro.test(combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"])

print(shapiro_gold %>% tidy())   # e.g., p < 0.05 → non-normal
print(shapiro_btc  %>% tidy())   # e.g., p > 0.05 → normal

# Q-Q plot for visual check
qqnorm(combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"])
qqline(combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"])

# Density comparison
gold_btc_data <- combined_clean[combined_clean$Asset %in% c("Gold", "Bitcoin"), ]
ggplot(gold_btc_data, aes(x = Net_Longs, fill = Asset)) +
  geom_density(alpha = 0.7) +
  facet_wrap(~ Asset, scales = "free") +
  theme_minimal() +
  labs(title = "Net Longs Distributions: Gold vs Bitcoin (Faceted)",
       x = "Net Longs", y = "Density")
  • Outputs W-statistic (1 = perfect normal) and a p-value.
  • Low p rejects H₀ of normality → consider transformations (log) or non-parametric tests.
  • The Q-Q plot is the visual companion — straight diagonal = normal, deviations show where it breaks.
  • Run on each subgroup, not just the pooled data — averages can hide non-normality.

Example: Kolmogorov-Smirnov test

gold_net <- combined_clean$Net_Longs[combined_clean$Asset == "Gold"]
btc_net  <- combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"]

ks_test <- ks.test(gold_net, btc_net)
print(ks_test %>% tidy())   # e.g., D = 0.25, p < 0.05 → different shapes

# ECDF visualisation
ggplot(gold_btc_data, aes(x = Net_Longs, color = Asset)) +
  stat_ecdf() +
  facet_wrap(~ Asset, scales = "free") +
  theme_minimal() +
  labs(title = "ECDF Plot: Gold vs Bitcoin (Faceted)",
       x = "Net Longs", y = "Cumulative Probability")
  • Compares two distributions non-parametrically.
  • D = max vertical distance between the two empirical CDFs.
  • Low p rejects H₀ that the samples come from the same distribution.
  • Useful as a prelude to a parametric test — confirms group differences before assuming a shared distribution.

Statistical Modelling: Overview

What is it?

  • Mathematical representations of data relationships (regression, multivariate) to explain or predict outcomes.
  • Time-series forecasting, ML interfaces, computational simulations.
  • Focuses on model fitting, diagnostics, validation in real-world data.

What is it used for?

  • Forecasting (markets, disease).
  • Causal inference, classification, dimensionality reduction.
  • Optimization in big-data settings.
  • Regression — simple/multiple linear, logistic, fixed/random effects, clustered/robust SEs, interactions, non-linear terms.
  • Multivariate — PCA, factor analysis, canonical correlation, k-means, discriminant analysis.
  • Time series — ACF/PACF, ADF/KPSS stationarity, ARIMA, Monte Carlo, GARCH.
  • Survival — Kaplan-Meier, Cox PH, censoring, Weibull.
  • Causal inference — IV, DID, PSM, DAGs, RDD.
  • Non-linear / GLM — GLMs, Poisson, splines, quantile regression.

Example: Linear regression with fixed effects

library(broom); library(dplyr)

# Step 0: hedging net bias
combined_clean <- combined_clean %>%
  mutate(net_commercial = commercial_longs - commercial_shorts)

# Step 1: simple LM
simple_lm <- lm(Net_Longs ~ net_commercial, data = combined_clean)
print(simple_lm %>% tidy())

# Step 2: + asset fixed effects
fe_asset <- lm(Net_Longs ~ net_commercial + factor(Asset), data = combined_clean)
print(fe_asset %>% tidy())

# Step 3: + asset & year fixed effects
fe_asset_year <- lm(Net_Longs ~ net_commercial + factor(Asset) + factor(Year),
                    data = combined_clean)
print(fe_asset_year %>% tidy())
  • Absorb group-specific means → estimate within-group variation.
  • Control for time-invariant confounders.
  • Reduce omitted-variable bias.
  • Limitation: cannot estimate the group effects themselves; for time-varying interactions use Year * Asset.

Example: Clustered standard errors

library(sandwich); library(lmtest)

# HC1 clustered by Year — robust to heteroskedasticity & intra-cluster correlation
clustered_se <- coeftest(simple_lm, vcov = vcovHC(simple_lm, type = "HC1", cluster = "Year"))
print(clustered_se)

# Residuals plot — check linearity
plot(residuals(simple_lm) ~ fitted(simple_lm))
  • Adjust for intra-group correlation (e.g., observations in the same year).
  • Wider SEs may make p-values larger (e.g., 0.01 → 0.05) — conservative inference.
  • Need ≥ 20 clusters; bootstrap for fewer.
  • Always specify in your write-up: “SEs clustered at the year level to account for intra-year correlation and heteroskedasticity.”

Exporting results — stargazer

library(stargazer)

# Compute clustered SEs for each model
clustered_se_simple        <- coeftest(simple_lm,
  vcov = vcovHC(simple_lm,        type = "HC1", cluster = "Year"))
clustered_se_fe_asset      <- coeftest(fe_asset,
  vcov = vcovHC(fe_asset,         type = "HC1", cluster = "Year"))
clustered_se_fe_asset_year <- coeftest(fe_asset_year,
  vcov = vcovHC(fe_asset_year,    type = "HC1", cluster = "Year"))

# Build LaTeX table
stargazer(
  simple_lm, fe_asset, fe_asset_year,
  type = "latex",
  out  = "regression_table.tex",
  dep.var.caption = "Effects on Net Longs (Post-2021 Data)",
  label = "tab:regression_net",
  dep.var.labels  = "Non-Commercial Longs",
  covariate.labels = c("Commercial Longs"),
  omit.stat = c("ser", "f"),
  star.char = c("*", "**", "***"),
  star.cutoffs = c(0.1, 0.05, 0.01),
  add.lines = list(
    c("Asset FE", "No", "Yes", "Yes"),
    c("Year FE",  "No", "No",  "Yes")
  ),
  omit = c("factor"),
  header = FALSE, model.numbers = TRUE,
  table.placement = "H", digits = 3,
  se = list(clustered_se_simple[, 2],
            clustered_se_fe_asset[, 2],
            clustered_se_fe_asset_year[, 2])
)
  • stargazer produces clean LaTeX/HTML/ASCII regression tables — multiple models side by side.
  • add.lines documents which fixed effects each model includes — important for replicability.
  • star.cutoffs sets the significance thresholds; report them in the table notes.
  • Always report , N, and p-values: “All models include year and asset fixed effects; \\\* p < 0.01, \\ p < 0.05, \* p < 0.1.”

3.4 Discussion of Assignment I

  • 3.1 Course objectives
  • 3.2 Recap from Lecture 2
  • 3.3 Live Coding Session 3
  • 3.4 Discussion of Assignment I
  • 3.5 Conclusion of Lecture 3
  • Assignment I — Problem Set
  • Assignment I — submission rules

Assignment I — Problem Set

  1. Get the data: create .R script, load libraries, set API key, load Quandl futures data (gold, silver, BTC, ETH). Merge by date into merged_*, drop NAs, sort ascending by date.
  2. Understand the data: read Nasdaq & CFTC sources; write 3–5 sentences for your “Data” section.
  3. Join data sets: combine into combined, with an Asset variable.
  4. Clean & transform: filter post-1 April 2021, save as combined_clean. Derive Net Long Position per trader type, Weekly Change, and Year/Quarter/Month indicators.
  5. Descriptive analysis: mean, sd, min, P10, median, P90, max — table for the Data section.
  6. Analysis & plot: discuss patterns; implement min. two ggplot plots with 4–5 sentences each.
  1. Define a research question: simple, testable RQ from your patterns (e.g., “Do crypto assets show higher Net Long Position volatility?”). Briefly motivate in Introduction (≥ 0.5 page).
  2. Literature review: 1–2 papers to build on (3–4 sentences, ≥ 0.5 page).
  3. Define your approach: how you will answer, why it makes sense (≥ 0.5 page).
  4. Report insights: Results (≥ 1.5 pages), min. two LaTeX tables via stargazer (academic style) and min. 2 captioned plots.
  5. Explain & discuss: Conclusion (≥ 0.5 page) — implications, quality vs length.

Use the Overleaf template on Moodle. Page minima per section: Intro 0.5, Lit 0.5, Method 0.5, Data 1.5, Results 1.5, Conclusion 0.5.

Assignment I — submission rules

Solve the problem set posted on Moodle, building on Lectures 1–3.

Submit ONLY two files:

  1. One .R script — well-commented, self-explanatory, efficient (meaningful names, functions for repetition, avoid for-loops). Include text/comments answering Exercise 1 and 2.
  2. One PDF compiled from the Overleaf template — outputs from your .R script (plots, calculations), stargazer LaTeX tables, captioned plots, written text. 11 pt Times New Roman, 1.5 spaced.
  • Work in teams of up to 5 students.
  • 50% of the final grade. Evaluated on code quality, creativity (plots/RQ), writing (concise, skim-friendly, with economic justifications).
  • Deadline: 19 January 2026 (midnight). Email files to oliver.padmaperuma@uni-ulm.de, with andre.guettler@uni-ulm.de and your teammates in CC.
  • Subject / file pattern: RiF_ProblemSet_surname1_surname2_...
  • If attachments are too large, share a cloud link in the email body.

3.5 Conclusion of Lecture 3

  • 3.1 Course objectives
  • 3.2 Recap from Lecture 2
  • 3.3 Live Coding Session 3
  • 3.4 Discussion of Assignment I
  • 3.5 Conclusion of Lecture 3
  • Course at a glance
  • Further reading
  • Prepare before next lecture
  • See you next time
  • References

Course at a glance

Basics

Week 1

29.10.2025

Course objectives, schedule, assignments · Introduction to R · Live coding

  • Course objectives, schedule and assignments
  • Introduction to R and RStudio
  • Live coding: variables, vectors, matrices, data frames, lists, functions, loops
  • Data import and export

Data Handling & Visualization

Week 2

05.11.2025

API access, merging, cleansing, transforming and visualising financial data in R · Introduction to Overleaf

  • API access (Nasdaq Data Link / Quandl, FRED, Yahoo, Coingecko, Polygon)
  • Import and cleanse: read_csv, mutate, types
  • Merge and append data (merge, bind_rows)
  • Filter and mutate (dplyr): subset rows, derive variables
  • Group by and summarise
  • Pivot wide / long
  • Data visualization with ggplot2 (six-step pipeline)
  • Introduction to LaTeX and Overleaf

Statistical Analysis

Week 3

12.11.2025

Descriptive · inferential · modelling — applied in R

  • Descriptive statistics in R
  • Correlation matrix and Pearson correlation test
  • t-Test and Wilcoxon test
  • Shapiro-Wilk and Kolmogorov-Smirnov tests
  • Linear regression with fixed effects
  • Clustered standard errors
  • Exporting regression tables with stargazer
  • Discussion of Assignment I (Problem Set)

Academic Publishing & Refereeing

Week 4

19.11.2025

What makes a great empirical paper · publication process · how to write a referee report

  • What makes a good empirical paper (contribution, identification, write-up)
  • The publication process step by step
  • Top finance and economics journals
  • Bad outcome vs revise & resubmit
  • Referee Reports — summary, major issues, minor issues
  • Referee checklist (question, identification, data, econometrics, results)
  • Discussion of Assignment II (Referee Report)

Brown Bag Seminar

Week 13

20.01.2026

Engage with doctoral research and prepare your referee report

  • Doctoral research presentations
  • Apply empirical / writing tips for the referee report
  • Group discussion and Q&A

Further reading

  • Angrist and Pischke (2009) — the standard graduate text on identification and causal inference in empirical economics.
  • Thulin (2024) — modern statistical methods worked through in R, free at https://www.modernstatisticswithr.com/.
  • Navarro (2015) — accessible introduction to statistics with R, free at https://learningstatisticswithr.com/.
  • Gentle (2020) — applied statistical analysis on financial data, with R examples.
  • James et al. (2021) — covers regression, model selection, resampling, and tree-based methods.

Prepare before next lecture

  1. Document today’s code in a clean way and save as an .Rmd file.
  2. Start sketching your problem-set paper — draft RQ, motivation, and the first plot.

See you next time

Reminder

  • Register for “exam” 13337 in campusonline by 30 November 2025.
  • Lecture 4: Academic publishing and refereeing — what makes a great empirical paper, the publication process, how to write a referee report.

References

Angrist, Joshua D., and Jörn-Steffen Pischke. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton, NJ: Princeton University Press. https://press.princeton.edu/books/paperback/9780691120355/mostly-harmless-econometrics.
Gentle, James E. 2020. Statistical Analysis of Financial Data: With Examples in R. Boca Raton, FL: CRC Press.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning with Applications in R. 2nd ed. New York, NY: Springer. https://www.statlearning.com/.
Navarro, Daniel J. 2015. Learning Statistics with R. https://learningstatisticswithr.com/.
Scheuch, Christoph, Stefan Voigt, and Patrick Weiss. 2023. Tidy Finance with R. Chapman & Hall/CRC. https://www.tidy-finance.org/r/.
Thulin, Måns. 2024. Modern Statistics with R. https://www.modernstatisticswithr.com/.