Lecture 3: Statistical Analysis

Descriptive · inferential · modelling — applied in R

Prof. Dr. Andre Guettler Director of the Institute
Helmholtzstraße 22, Room 205
andre.guettler@uni-ulm.de
+49 731 50 31 030

Oliver Padmaperuma Doctoral Candidate
Helmholtzstraße 22, Room 203
oliver.padmaperuma@uni-ulm.de
+49 731 50 31 036

3.1 Course objectives

3.1 Course objectives
3.2 Recap from Lecture 2
3.3 Live Coding Session 3
3.4 Discussion of Assignment I
3.5 Conclusion of Lecture 3

Welcome to
Course Objective
Course at a glance
Assignments / Exams

Welcome to Research in Finance

Register for “exam” 13337 in campusonline by 30 November 2025. The registration is what binds you to the course requirements; without it you cannot submit. If you are registered but don’t submit, you receive a fail grade (5.0).
Ask questions during or right after each session — that is the preferred channel.
Admin / studies / exam-eligibility questions go to the registrar’s office (Studiensekretariat) at studiensekretariat@uni-ulm.de.
Course-content questions outside class: email oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de.
We also recommend the student advisory service.

Course Objective

Scope

We will:

Prepare Master students for their empirical thesis
Hands-on R intro for data management, visualization, cleaning, basic modelling
Writing tips for theses, including LaTeX & Overleaf
Referee reviews on research presentations for empirical critique skills

We will NOT:

Deep dive into advanced stats or ML methods
Specific finance topics (asset pricing, etc.)
Full thesis writing / research design training

Approach

Part I — Learn the Basics

Hands-on R intro: a widely used language for statistical computing
Manage, visualize and clean data; run and interpret statistical models
Solve a real empirical problem set in R, in groups

Part II — Apply your learnings

Mandatory participation in the institute’s Brown Bag Seminar
Two assignments (group work and individual referee report) — see Assignments / Exams

Course at a glance

Basics

Week 1

29.10.2025

Course objectives, schedule, assignments · Introduction to R · Live coding

Course objectives, schedule and assignments
Introduction to R and RStudio
Live coding: variables, vectors, matrices, data frames, lists, functions, loops
Data import and export

Data Handling & Visualization

Week 2

05.11.2025

API access, merging, cleansing, transforming and visualising financial data in R · Introduction to Overleaf

API access (Nasdaq Data Link / Quandl, FRED, Yahoo, Coingecko, Polygon)
Import and cleanse: read_csv, mutate, types
Merge and append data (merge, bind_rows)
Filter and mutate (dplyr): subset rows, derive variables
Group by and summarise
Pivot wide / long
Data visualization with ggplot2 (six-step pipeline)
Introduction to LaTeX and Overleaf

Statistical Analysis

Week 3

12.11.2025

Descriptive · inferential · modelling — applied in R

Descriptive statistics in R
Correlation matrix and Pearson correlation test
t-Test and Wilcoxon test
Shapiro-Wilk and Kolmogorov-Smirnov tests
Linear regression with fixed effects
Clustered standard errors
Exporting regression tables with stargazer
Discussion of Assignment I (Problem Set)

Academic Publishing & Refereeing

Week 4

19.11.2025

What makes a great empirical paper · publication process · how to write a referee report

What makes a good empirical paper (contribution, identification, write-up)
The publication process step by step
Top finance and economics journals
Bad outcome vs revise & resubmit
Referee Reports — summary, major issues, minor issues
Referee checklist (question, identification, data, econometrics, results)
Discussion of Assignment II (Referee Report)

Brown Bag Seminar

Week 13

20.01.2026

Engage with doctoral research and prepare your referee report

Doctoral research presentations
Apply empirical / writing tips for the referee report
Group discussion and Q&A

Assignments / Exams

Assignment I — Problem Set 50% of your grade

Documented .R script + PDF write-up (Overleaf)

Group of up to 5.

Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Research in Finance_assignment-1-problem-set_surname1_surname2_…

19 January 2026

Assignment II — Referee Report 50% of your grade

2.5–3 page referee report on a Brown-Bag presentation

Group of up to 5.

Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Research in Finance_assignment-2-referee-report_surname1_surname2_…

3 February 2026

3.2 Recap from Lecture 2

3.1 Course objectives
3.2 Recap from Lecture 2
3.3 Live Coding Session 3
3.4 Discussion of Assignment I
3.5 Conclusion of Lecture 3

What we covered

What we covered

Topic	What is it?	What for?	Example syntax
API access	Load remote data via authenticated queries	Fetching CFTC time series	`Quandl.api_key("…"); Quandl.datatable("QDL/LFON", …)`
Import & cleanse	Load files (CSV) and initial transforms	Date formatting	`read_csv() %>% mutate(date = as.Date(date))`
Merge & append	Join by keys (`merge`) or stack (`bind_rows`)	Multi-asset views	`merge(d1, d2, by="date")`, `bind_rows(...)`
Filter & mutate	Subset rows, derive columns	Post-2021 windows, net longs	`filter(date > "2021-01-01") %>% mutate(net = longs - shorts)`
Group & summarise	Group categories, aggregate	Annual means / SD per asset	`group_by(Asset, Year) %>% summarise(...)`
Pivot	Reshape wide ⇄ long	Correlation matrices	`pivot_wider()`, `pivot_longer()`
Visualization	Layered plotting system	Publication-ready figures	`ggplot() + geom_line() + theme_minimal()`

3.3 Live Coding Session 3

3.1 Course objectives
3.2 Recap from Lecture 2
3.3 Live Coding Session 3
3.4 Discussion of Assignment I
3.5 Conclusion of Lecture 3

Overview of basic statistical methods
Descriptive Statistics: Overview
Example: Summary statistics
Example: Distribution plots
Inferential Statistics: Overview
Example: Correlation matrix
Example: Pearson correlation test
Example: t-Test
Example: Wilcoxon test
Example: Shapiro-Wilk test
Example: Kolmogorov-Smirnov test
Statistical Modelling: Overview
Example: Linear regression with fixed effects
Example: Clustered standard errors
Exporting results — stargazer

Overview of basic statistical methods

Sample spaces and events
Random variables
Expectations and moments
Inequalities
Convergence concepts
Distributions

Sampling
Missing-value handling
Central tendency, dispersion, shape
Data summarization

Point and interval estimation
Hypothesis testing
Non-parametric inference
Bayesian inference
Statistical decision theory

Regression analysis
Multivariate analysis
ANOVA
Time series & stochastic processes
Survival analysis
Causal inference

Practical advice

Always start with a thorough descriptive analysis to deeply understand your data before turning to advanced methods.

This is the bird’s-eye map. Most of what statistical software lets you compute falls into one of these four buckets, and a healthy empirical workflow uses all four in roughly the displayed order:

Probability theory is the foundation — it tells you what assumptions you’re implicitly making when you pick any inferential procedure (e.g., a t-test assumes a sampling distribution that converges to the t under finite-sample normality or asymptotic CLT). You don’t need to derive these from scratch each time, but you should know which assumptions your tools encode.
Descriptive statistics is the first pass on real data: means, dispersion, distributions, outliers, missingness. This is the “always do first” step the callout flags. Skipping descriptive analysis is the single most common source of overconfident results — your t-test assuming normality is meaningless if you didn’t notice that one outlier is pulling the mean.
Inferential statistics is quantifying uncertainty: confidence intervals, hypothesis tests, p-values. The bridge from “we observed this in the sample” to “what can we say about the population”. Today’s t-test, Wilcoxon, Shapiro-Wilk, KS, and correlation tests live here.
Statistical modelling is structured estimation — fitting a parametric form (linear regression, fixed effects, time series, causal models) to extract effects rather than just summary statistics. The regression and clustered-SE examples later today are the workhorses for academic finance.

The standard textbook references for empirical-economics modelling and identification are Mostly Harmless Econometrics (Angrist and Pischke 2009) and An Introduction to Statistical Learning (James et al. 2021); for the statistics themselves, Modern Statistics with R (Thulin 2024) walks through every test today’s lecture uses with worked R examples.

Descriptive Statistics: Overview

What is it?

Summarises and organises raw data into meaningful patterns using averages, spreads, and visualisations.
Presents data without drawing conclusions beyond the dataset (histograms, box plots).
Includes data-type categorisation and basic central-tendency / variability statistics.

What is it used for?

Initial overview — trends, outliers, patterns before deeper analysis.
Communicating findings to non-experts.
Data cleaning and quality checks.

Sampling — methods, survey design, data types (nominal, ordinal, interval, ratio).
Missing values — imputation, prediction, elimination.
Central tendency — mean (arithmetic, geometric, harmonic), median, mode.
Dispersion — variance, standard deviation, range, IQR, coefficient of variation.
Shape & symmetry — skewness, kurtosis, normality (Q-Q plots).
Summarization — frequency distributions, histograms, box / stem-and-leaf / scatter plots, ECDF, quantiles, outliers.

The order in which you should look at a new dataset for an empirical paper:

Sample composition — nrow(), length(unique(group_var)), missingness map (naniar::vis_miss). Are observations balanced across groups, or is the panel unbalanced?
Per-variable centre and spread — mean, median, sd, min/max. Use summary(df) for a quick look; tabulate properly via dplyr::summarise for inclusion in a paper.
Distributional shape — skewness, kurtosis, formal normality tests (Shapiro-Wilk later in this deck), Q-Q plots. Many parametric tests assume approximate normality; checking is non-negotiable.
Outliers and influential points — boxplots (last lecture’s example) or studentised residuals after a regression. Document and decide explicitly: keep, winsorise, or drop, and report the choice.
Cross-sectional structure — correlation matrix, scatterplot matrix. Heads up dependence between variables that affect what regression specifications make sense.

Every academic finance paper has a “Table 1” that reports per-variable means, SDs, percentiles for the regression sample. Building that table is the descriptive-statistics step productionised. The pivot+group_by+summarise pattern on the next slide is the canonical recipe.

Example: Summary statistics

library(dplyr)
library(tidyr)

# Compact summary statistics per asset and per variable
desc_stats <- combined_clean %>%
  select(Asset, Net_Longs, market_participation) %>%
  pivot_longer(cols = -Asset, names_to = "Variable", values_to = "Value") %>%
  mutate(Variable = case_when(
    Variable == "Net_Longs" ~ "Net Longs",
    Variable == "market_participation" ~ "Market Part."
  )) %>%
  group_by(Asset, Variable) %>%
  summarise(
    Mean = round(mean(Value, na.rm = TRUE), 0),
    Std  = round(sd(Value,   na.rm = TRUE), 0),
    Min  = round(min(Value,  na.rm = TRUE), 0),
    P10  = round(quantile(Value, 0.1, na.rm = TRUE), 0),
    P50  = round(median(Value, na.rm = TRUE), 0),
    P90  = round(quantile(Value, 0.9, na.rm = TRUE), 0),
    Max  = round(max(Value,  na.rm = TRUE), 0),
    .groups = "drop"
  )

write_csv(desc_stats, "desc_stats_clean.csv")

Pivot to long format → group by (Asset, Variable) → summarise per cell.
Reports central tendency (Mean, Median = P50), dispersion (Std), and the tails (Min, P10, P90, Max) — the standard table-1 shape for empirical papers.
case_when() cleans variable labels for the published version.
.groups = "drop" quiets the grouping warning and produces an ungrouped result.

This is the canonical recipe for an empirical paper’s “Table 1”. A few stylistic choices worth flagging:

Pivot to long, summarise, leave wide — the long-format intermediate makes it natural to add a new variable later (just include it in the select()); the final wide table reads as a paper-ready summary.
Reporting both percentiles and (mean, SD) — central tendency and tails. Means alone hide skewness; reporting P10 and P90 (or Q1/Q3) lets the reader see the bulk of the distribution.
Round at output time, not in storage — round(..., 0) here makes the table presentation-friendly. Keep the underlying values un-rounded in combined_clean for any further computation.
case_when() for label cleanup — the raw column names (market_participation, non_commercial_longs) are CSV-friendly but ugly in a paper. Renaming inside the pivoted long frame keeps the source data names intact while presenting nicer labels.
Persist the result — write_csv("desc_stats_clean.csv") saves the table for later read by your LaTeX paper (e.g. read_csv + kableExtra::kable for booktabs output, or paste straight into a tabular environment).

Example: Distribution plots

library(ggplot2)

long_data <- combined_clean %>%
  select(Asset, Net_Longs, market_participation,
         non_commercial_shorts, non_commercial_longs) %>%
  pivot_longer(-Asset, names_to = "Variable", values_to = "Value")

ggplot(long_data, aes(x = Value, color = Asset, fill = Asset)) +
  geom_density(alpha = 0.5) +                      # transparent overlays
  facet_wrap(~ Variable, scales = "free") +
  theme_minimal() +
  labs(title = "Faceted Density Plots: Distributions by Variable",
       x = "Value", y = "Density")

ggsave("distribution_plot.pdf", width = 6, height = 4)

geom_density() is a smoothed histogram — better for comparing shapes than raw bars.
alpha = 0.5 lets overlapping curves stay readable.
facet_wrap(~ Variable, scales = "free") lets each variable use its own y-range.
Save as PDF for inclusion in your paper — vector plots scale without pixelation.

Density plots are usually preferable to histograms when you want to compare distributions across groups. Histograms encode bin choices as a parameter (and a misleading bin width can dramatically change the visual story); a kernel density estimate smooths over that choice with the bandwidth parameter, which geom_density() selects sensibly by default.

For per-group comparison, geom_density(aes(fill = group), alpha = 0.5) overlays the densities on the same panel — fast visual readout of “are these distributions the same shape, the same location, or different in both?” When the panels are very different in scale (as here — market_participation is in the thousands; Net_Longs ranges across hundreds of thousands), facet_wrap(~ Variable, scales = "free") is the right call: each panel uses its own x-axis range so neither variable is squashed.

The output of this slide is the visual counterpart to the previous slide’s tabular descriptive statistics. Reporting both — a Table 1 and a multi-panel density figure — is standard practice in modern empirical-finance papers.

Inferential Statistics: Overview

What is it?

Uses sample data to generalise / predict about a population, via estimation and testing.
Probability-based methods: hypothesis tests, confidence intervals, p-values to quantify uncertainty.
Covers parametric (assume distribution) and non-parametric approaches; Bayesian techniques update beliefs.

What is it used for?

Test hypotheses (e.g., does treatment have an effect?).
Estimate population parameters from samples.
Assess relationships and differences with controlled error rates.

Point estimation — bias, variance, MSE; method of moments; MLE; asymptotic normality.
Interval estimation — normal/t-based CIs; bootstrap; coverage probability.
Hypothesis testing — null/alternative; test statistics & p-values; type I/II errors; common tests (t, ANOVA, Wald, LR, χ², permutation, correlation); multiple-testing adjustments (Bonferroni, FDR); goodness-of-fit (KS, χ²).
Non-parametric inference — rank tests (Wilcoxon, Mann-Whitney); sign tests; kernel density basics.
Bayesian inference — priors/posteriors, Bayes’ theorem, conjugate priors, credible intervals, MCMC.
Decision theory — loss/risk; minimax/Bayes estimators.

Inferential statistics is the bridge from sample to population. Every test on this slide is a way of asking “does what I see in the sample tell me something about the underlying data-generating process, or could it be noise?” The answer is always probabilistic, and the standard reporting unit is the p-value (probability of observing at least this extreme a result under the null hypothesis).

A few framing reminders worth internalising before today’s tests:

Statistical significance ≠ economic significance. A trillion observations make any small effect “significant” at p < 0.001; a point estimate of 0.0001% is not interesting in finance even if its standard error is tiny. Always report and interpret the magnitude of an effect, not just the p-value.
Multiple-testing inflates the false-discovery rate. Run 20 t-tests at α = 0.05 and you should expect 1 spurious “significant” result by chance. For an honest analysis with many comparisons, adjust (Bonferroni for conservative control of the family-wise error rate; Benjamini–Hochberg for FDR).
Choose parametric vs non-parametric based on what you can defend. If your data is approximately normal (Shapiro-Wilk passes) and the sample is large enough for the CLT to bite, parametric tests (t-test, Pearson correlation) are more powerful. If not, the rank-based equivalents (Wilcoxon, Spearman) cost some power but make weaker assumptions — usually the safer call when you’re unsure.
Bayesian alternatives (brms, rstanarm in R) report the posterior distribution of the parameter rather than a binary “reject/fail-to-reject”, which many find more directly interpretable. Beyond today’s scope but worth knowing about for thesis-level work.

Reference: Mostly Harmless Econometrics (Angrist and Pischke 2009) for the empirical-economist’s view of inference and identification; Modern Statistics with R (Thulin 2024) for an R-first walkthrough of every test.

Example: Correlation matrix

library(dplyr); library(ggplot2)

cor_matrix <- combined_clean %>%
  select(date, Asset, Net_Longs) %>%
  pivot_wider(names_from = Asset, values_from = Net_Longs, values_fill = NA) %>%
  select(-date) %>%
  cor(use = "pairwise.complete.obs", method = "pearson") %>%
  as.data.frame() %>%
  rownames_to_column(var = "Asset1") %>%
  pivot_longer(-Asset1, names_to = "Asset2", values_to = "Correlation") %>%
  mutate(Correlation = round(Correlation, 2)) %>%
  filter(!is.na(Correlation))

ggplot(cor_matrix, aes(x = Asset2, y = Asset1, fill = Correlation)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = Correlation), color = "black",
            size = 3, fontface = "bold") +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red",
                       midpoint = 0, limits = c(-1, 1)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Correlation Matrix of Net Longs (Assets)",
       subtitle = "Negative values show decoupling (e.g., Gold vs Bitcoin in crises)",
       x = "Asset", y = "Asset")

Values from −1 to 1; diagonal = 1 (self), off-diagonal = strength/direction.
|r| > 0.5 ≈ strong; negative = inverse (hedging potential).
pairwise.complete.obs ignores NAs pairwise — usable when the panel isn’t perfectly balanced.
Heatmap + value annotation makes the matrix readable at a glance, even in a slide.

This is the same correlation-heatmap recipe as Lecture 2 — repeated here because the next slide turns the visual into a hypothesis test. The visualisation tells you which pairs might be worth a more rigorous test; the test on the next slide turns “looks correlated” into “we can reject zero correlation at p < α”.

Two technical notes for the assignment:

use = "pairwise.complete.obs" computes each pairwise correlation on whatever observations are available for that pair, even if other columns have missingness on those rows. The default use = "everything" returns NA for any pair where any column has any NA — usually too aggressive. The pairwise option is the standard for unbalanced panels but introduces a subtle issue: different pairwise correlations are computed on different sample sizes, and the resulting matrix is no longer guaranteed to be positive semi-definite (occasionally matters for downstream uses like portfolio optimisation).
method = "pearson" measures linear dependence and is sensitive to outliers. method = "spearman" uses ranks and measures monotonic dependence — robust to outliers and to non-linear-but-monotonic relationships. For finance returns where heavy tails are common, Spearman is often the safer default; report both if they disagree.

The correlation visualisation is the descriptive stop; running cor.test() (next slide) and reporting effect size + significance is the inferential stop.

Example: Pearson correlation test

gold_net <- combined_clean$Net_Longs[combined_clean$Asset == "Gold"]
btc_net  <- combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"]

# Full-dataset test
full_test <- cor.test(gold_net, btc_net)
print(full_test)
# e.g., r = -0.14, p = 0.03 — weak negative correlation, statistically significant

# Crisis subset (2022)
crisis_gold <- combined_clean$Net_Longs[
  combined_clean$Asset == "Gold"    & combined_clean$Year == 2022]
crisis_btc  <- combined_clean$Net_Longs[
  combined_clean$Asset == "Bitcoin" & combined_clean$Year == 2022]

crisis_test <- cor.test(crisis_gold, crisis_btc)
print(crisis_test)
# e.g., r = -0.06, p = 0.68 — not statistically significant

r ranges −1 to 1; |r| > 0.5 strong; p < 0.05 rejects H₀ of no association.
Subsetting (here: 2022 only) tests whether the relationship is regime-dependent.
Pearson assumes linearity + normality — for non-normal data prefer Spearman (method = "spearman").
Always report r, p, and the sample size alongside the test in your write-up.

cor.test() returns the Pearson correlation \(r\) along with a t-statistic, p-value, and 95 % CI. The reporting unit in any paper is the full triple (\(r\), \(p\), \(n\)) — never just the correlation, never just the p-value.

The example deliberately compares the full sample to a regime subset (2022 only) — this is the subsample style of analysis that academic papers use to surface regime-dependence: “the relationship between gold and bitcoin sentiment is weakly negative on average but not statistically significant in 2022, suggesting the decoupling hypothesis isn’t supported during this specific window”. That’s a meaningful, defensible empirical claim — much better than just “they’re correlated” or “they’re not”.

Three robustness checks worth running alongside cor.test():

Different windows (rolling correlation, e.g. slider::slide_dbl(..., .before = 12) for a rolling 12-week correlation) — visualise the time-variation in the relationship rather than a single point estimate.
Outlier sensitivity — re-run after winsorising (DescTools::Winsorize at 1 % each tail) or trimming. If the relationship survives, more credible.
Spearman comparison — if Pearson and Spearman disagree, your finding likely depends on a few influential observations or a non-linear shape; investigate before publishing.

When sample sizes are small (under ~30), Fisher’s z-transformation gives a more reliable confidence interval; cor.test() does this automatically.

Example: t-Test

library(broom)

trad_net   <- combined_clean$Net_Longs[combined_clean$Asset %in% c("Gold", "Silver")]
crypto_net <- combined_clean$Net_Longs[combined_clean$Asset %in% c("Bitcoin", "Ethereum")]

# Full dataset
full_test <- t.test(trad_net, crypto_net)
print(full_test %>% tidy())     # tidy = clean table
# p < 0.05 → crypto means differ from traditional means

# Crisis subset (2022)
crisis_data     <- combined_clean[combined_clean$Year == 2022, ]
non_crisis_data <- combined_clean[combined_clean$Year != 2022, ]

crisis_test <- t.test(crisis_data$Net_Longs, non_crisis_data$Net_Longs)
print(crisis_test %>% tidy())

Parametric — assumes the two samples are roughly normal.
Outputs t-statistic, p-value, and 95% CI.
Low p (< 0.05) rejects H₀ of equal means.
Use Wilcoxon if data are non-normal; ANOVA for >2 groups.
broom::tidy() turns the test object into a single-row tibble — easy to bind into a results table.

The two-sample t-test asks: do these two groups have the same mean? It is the workhorse comparison in empirical work and is the right tool when:

The two samples are roughly normal (check with Shapiro-Wilk later in this deck or visually with a Q-Q plot), OR the sample sizes are large enough for the CLT to kick in (~30+ per group is the conventional cut-off, more if the data is very skewed).
The samples are independent (no observation pairing). For paired data — e.g. the same firm before vs after an event — use the paired t-test (t.test(x, y, paired = TRUE)) instead.
The variance assumption is being handled. R’s t.test() defaults to the Welch t-test which does not assume equal variances — the safer default. The classical Student t-test (equal variances) is t.test(..., var.equal = TRUE); only use it if you’ve checked var.test() and the assumption holds.

The reporting triple is again (\(t\)-statistic, \(p\), group means and Ns). For a paper, also report the effect size — Cohen’s \(d\), computed as \((\\bar{x}_1 - \\bar{x}_2) / s_p\) where \(s_p\) is the pooled SD. R’s effectsize::cohens_d() computes it. A statistically significant t-test with d = 0.05 is statistically real but economically tiny.

broom::tidy() is the trick that makes test objects analysis-friendly: it turns the chunky htest object that t.test() returns into a one-row tibble (estimate1, estimate2, statistic, p.value, conf.low, conf.high, method, …). You can bind_rows many tidied tests into a single comparison table.

Example: Wilcoxon test

# Full dataset
wilcox_full <- wilcox.test(trad_net, crypto_net)
print(wilcox_full %>% tidy())

# Crisis subset
crisis_net     <- crisis_data$Net_Longs
non_crisis_net <- non_crisis_data$Net_Longs

wilcox_crisis <- wilcox.test(crisis_net, non_crisis_net)
print(wilcox_crisis %>% tidy())

Compares medians non-parametrically using ranks; outputs W-statistic and p-value.
Robust to outliers and non-normality — your default when Shapiro shows non-normality.
For >2 groups: Kruskal-Wallis.
Pair with a density plot to communicate why medians differ.

The Wilcoxon rank-sum test (also known as Mann-Whitney U) is the non-parametric counterpart to the t-test. Instead of asking “are the means equal?”, it asks “are the distributions equal in central tendency?” — operationally, “if you randomly drew one observation from each group, what’s the probability that group 1’s draw exceeds group 2’s?”. When that probability is exactly 0.5, the distributions are identical in central tendency.

When to use Wilcoxon over t:

The data is heavily skewed or has outliers — the rank transformation makes it robust.
The sample is small (under ~30) and the data is non-normal — the t-test’s CLT-based approximation isn’t reliable yet.
The data is ordinal (e.g., survey ratings 1–5) — means don’t really make sense; medians do.

When to prefer t:

Both groups are roughly normal and reasonably sized — the t-test has higher power (more likely to detect a true effect when one exists).

In practice, run both and report whichever your data justifies. If they agree on the conclusion (both significant or both not), you’re robust; if they disagree, the parametric assumption is probably violated and Wilcoxon is the more credible call.

For >2 groups the analogues are ANOVA (parametric) and Kruskal-Wallis (non-parametric). For paired samples: paired t-test or Wilcoxon signed-rank (wilcox.test(..., paired = TRUE)).

Example: Shapiro-Wilk test

library(broom)

shapiro_gold <- shapiro.test(combined_clean$Net_Longs[combined_clean$Asset == "Gold"])
shapiro_btc  <- shapiro.test(combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"])

print(shapiro_gold %>% tidy())   # e.g., p < 0.05 → non-normal
print(shapiro_btc  %>% tidy())   # e.g., p > 0.05 → normal

# Q-Q plot for visual check
qqnorm(combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"])
qqline(combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"])

# Density comparison
gold_btc_data <- combined_clean[combined_clean$Asset %in% c("Gold", "Bitcoin"), ]
ggplot(gold_btc_data, aes(x = Net_Longs, fill = Asset)) +
  geom_density(alpha = 0.7) +
  facet_wrap(~ Asset, scales = "free") +
  theme_minimal() +
  labs(title = "Net Longs Distributions: Gold vs Bitcoin (Faceted)",
       x = "Net Longs", y = "Density")

Outputs W-statistic (1 = perfect normal) and a p-value.
Low p rejects H₀ of normality → consider transformations (log) or non-parametric tests.
The Q-Q plot is the visual companion — straight diagonal = normal, deviations show where it breaks.
Run on each subgroup, not just the pooled data — averages can hide non-normality.

Shapiro-Wilk is the standard test for normality. The null hypothesis is “the data is normally distributed”; small p-values reject normality. The W-statistic itself is between 0 and 1 — closer to 1 means closer to normal — but the p-value is the operational reporting unit.

A few caveats that matter for empirical work:

Shapiro-Wilk is sensitive to sample size. With \(n > 5{,}000\), the test will reject normality even on essentially-normal data because the small departures every real dataset has become statistically detectable. With \(n < 30\), the test has too little power to detect even substantial non-normality. The visual companions — Q-Q plot and density plot — remain useful regardless of n.
Normality is a finite-sample concern. The CLT means that for the sampling distribution of the mean (which is what t-tests rely on), normality of the underlying data matters less and less as n grows. By \(n \\sim 30\) per group the t-test is robust to mild non-normality; by \(n \\sim 100\) it’s very robust.
Test each subgroup separately — pooling can mask group-specific non-normality. The example does this correctly: separate Shapiro on Gold and on Bitcoin.

The Q-Q plot (qqnorm + qqline) is the visual diagnostic — points should fall along the diagonal. Heavy tails show as upward deviations at the extremes (commonly seen in financial returns); skewness as a curved deviation. Always inspect the Q-Q plot, not just the p-value.

If non-normality matters for your downstream test, options: (a) transform (log, Box-Cox), (b) use rank-based / non-parametric tests, (c) bootstrap the standard errors. Option (c) is increasingly the go-to for academic work because it makes minimal distributional assumptions.

Example: Kolmogorov-Smirnov test

gold_net <- combined_clean$Net_Longs[combined_clean$Asset == "Gold"]
btc_net  <- combined_clean$Net_Longs[combined_clean$Asset == "Bitcoin"]

ks_test <- ks.test(gold_net, btc_net)
print(ks_test %>% tidy())   # e.g., D = 0.25, p < 0.05 → different shapes

# ECDF visualisation
ggplot(gold_btc_data, aes(x = Net_Longs, color = Asset)) +
  stat_ecdf() +
  facet_wrap(~ Asset, scales = "free") +
  theme_minimal() +
  labs(title = "ECDF Plot: Gold vs Bitcoin (Faceted)",
       x = "Net Longs", y = "Cumulative Probability")

Compares two distributions non-parametrically.
D = max vertical distance between the two empirical CDFs.
Low p rejects H₀ that the samples come from the same distribution.
Useful as a prelude to a parametric test — confirms group differences before assuming a shared distribution.

The Kolmogorov-Smirnov (KS) test comes in two flavours, both of which ks.test() handles depending on its arguments:

Two-sample KS (the example): “are these two empirical distributions different in any way — location, scale, or shape?”. The D-statistic is the maximum vertical gap between the two empirical CDFs. The test is sensitive to differences in shape that a t-test (location only) or F-test (scale only) would miss.
One-sample KS: “is this sample drawn from a specified distribution?” — usually used to test against the normal: ks.test(x, "pnorm", mean(x), sd(x)). Largely superseded by Shapiro-Wilk for normality testing in modern practice.

KS is conservative — it has lower power than t or Wilcoxon when the difference is purely in location. Its strength is detecting differences in shape that the location-only tests miss; for a paper, reporting KS alongside t (and Wilcoxon) gives the reader a fuller picture. If t says “different means”, Wilcoxon says “different medians”, and KS says “different distributions”, you have very strong evidence the two groups come from different processes.

The ECDF plot (stat_ecdf in ggplot2) is the visual companion — vertical gaps directly correspond to the KS test statistic. A steeper or more shifted ECDF for one group is the visual “why” behind a significant KS p-value.

Statistical Modelling: Overview

What is it?

Mathematical representations of data relationships (regression, multivariate) to explain or predict outcomes.
Time-series forecasting, ML interfaces, computational simulations.
Focuses on model fitting, diagnostics, validation in real-world data.

What is it used for?

Forecasting (markets, disease).
Causal inference, classification, dimensionality reduction.
Optimization in big-data settings.

Regression — simple/multiple linear, logistic, fixed/random effects, clustered/robust SEs, interactions, non-linear terms.
Multivariate — PCA, factor analysis, canonical correlation, k-means, discriminant analysis.
Time series — ACF/PACF, ADF/KPSS stationarity, ARIMA, Monte Carlo, GARCH.
Survival — Kaplan-Meier, Cox PH, censoring, Weibull.
Causal inference — IV, DID, PSM, DAGs, RDD.
Non-linear / GLM — GLMs, Poisson, splines, quantile regression.

This is the high-level menu of “what kinds of relationships can statistical models estimate?”. For an empirical-finance master’s-level workflow, the operationally relevant subset is:

Linear regression with fixed effects (next slide) — the workhorse for almost every paper. Pools observations across groups (firms, assets, dates) while absorbing group-specific means via fixed effects. Mostly Harmless Econometrics (Angrist and Pischke 2009) is the canonical reference.
Robust / clustered standard errors (slide after) — an inference fix on top of OLS that accounts for within-group correlation in residuals (essential whenever your panel has natural groupings: industry, year, firm).
Causal inference techniques (DiD, IV, RDD, propensity-score matching) — when the question is “does X cause Y?” rather than “is X correlated with Y?”. DiD: compare treatment vs control before/after an event; IV: use a third variable that affects X but not Y directly to identify the causal effect of X. Mostly Harmless covers these in detail; the modern Causal Inference: The Mixtape by Cunningham is also excellent.
Time-series methods (ARIMA, GARCH for volatility, VAR for multivariate dynamics) — when the time-ordering of observations matters (autocorrelation, predictive lags, conditional volatility).

The Finance Project course (the sister course to this one) goes deeper into the modelling toolkit specifically: ridge, lasso, cross-validation, prediction-market microstructure. An Introduction to Statistical Learning (James et al. 2021) is the bridging textbook between this lecture’s classical inference and the predictive-modelling lens used there.

Example: Linear regression with fixed effects

library(broom); library(dplyr)

# Step 0: hedging net bias
combined_clean <- combined_clean %>%
  mutate(net_commercial = commercial_longs - commercial_shorts)

# Step 1: simple LM
simple_lm <- lm(Net_Longs ~ net_commercial, data = combined_clean)
print(simple_lm %>% tidy())

# Step 2: + asset fixed effects
fe_asset <- lm(Net_Longs ~ net_commercial + factor(Asset), data = combined_clean)
print(fe_asset %>% tidy())

# Step 3: + asset & year fixed effects
fe_asset_year <- lm(Net_Longs ~ net_commercial + factor(Asset) + factor(Year),
                    data = combined_clean)
print(fe_asset_year %>% tidy())

Absorb group-specific means → estimate within-group variation.
Control for time-invariant confounders.
Reduce omitted-variable bias.
Limitation: cannot estimate the group effects themselves; for time-varying interactions use Year * Asset.

The progressive build (no FE → asset FE → asset + year FE) is exactly how you’d write the empirical-strategy section of a paper: each step adds one piece of structure, and the question at every step is what does my coefficient on net_commercial mean now?

lm(Net_Longs ~ net_commercial) — pools across all assets and dates, slope estimates the average relationship in the pooled data. Vulnerable to omitted variable bias: any variable correlated with both Net_Longs and net_commercial that we haven’t controlled for biases the slope.
+ factor(Asset) — adds asset fixed effects. Conceptually equivalent to demeaning Net_Longs and net_commercial within each asset, then running OLS. The slope now estimates within-asset variation: “for a given asset, when its commercial net goes up by 1, its non-commercial net changes by β”. Controls for every time-invariant asset characteristic (Bitcoin’s higher baseline volatility, gold’s longer history) without us having to enumerate them.
+ factor(Year) — adds year fixed effects on top. Now the slope estimates the within-(asset × year) relationship: “in a given asset and year, when commercial net goes up, what happens to non-commercial net?”. Absorbs all year-specific shocks that affect all assets equally (a global risk-on vs risk-off year, common policy events).

For larger panels with many fixed effects, fixest::feols is much faster than lm() and has cleaner syntax: feols(Net_Longs ~ net_commercial | Asset + Year, data = ...) is the FE equivalent and natively supports clustered SEs.

The classical reference for fixed-effects panel methods is Mostly Harmless Econometrics (Angrist and Pischke 2009), chapter 5; Tidy Finance with R (Scheuch, Voigt, and Weiss 2023) applies them to asset-pricing regressions in the natural finance context.

Example: Clustered standard errors

library(sandwich); library(lmtest)

# HC1 clustered by Year — robust to heteroskedasticity & intra-cluster correlation
clustered_se <- coeftest(simple_lm, vcov = vcovHC(simple_lm, type = "HC1", cluster = "Year"))
print(clustered_se)

# Residuals plot — check linearity
plot(residuals(simple_lm) ~ fitted(simple_lm))

Adjust for intra-group correlation (e.g., observations in the same year).
Wider SEs may make p-values larger (e.g., 0.01 → 0.05) — conservative inference.
Need ≥ 20 clusters; bootstrap for fewer.
Always specify in your write-up: “SEs clustered at the year level to account for intra-year correlation and heteroskedasticity.”

OLS standard errors are correct only under (a) homoskedasticity and (b) no correlation across observations. Real financial data violates both routinely:

Heteroskedasticity — residual variance depends on the right-hand-side variables (e.g. variance of returns scales with volatility regimes). HC robust SEs (the “HC” family) fix this.
Cross-observation correlation — observations within the same group (firm, year, industry) share unobserved shocks, so residuals are correlated within groups. Clustered SEs (HC1 with cluster = "Year") account for this.

The intuition: with intra-cluster correlation, you have fewer effective independent observations than the raw N. Treating dependent observations as independent makes standard errors too small and p-values misleadingly low. Clustering inflates SEs to the right scale.

When to cluster on what:

One-way: cluster on the dimension you suspect has the most within-cluster correlation. For panel data with rare cross-firm shocks but big year-specific shocks, cluster on year. For an event study with cross-sectional events, cluster on firm.
Two-way clustering (firm AND year, common in academic finance) — vcovCL(model, cluster = ~ firm + year) from the sandwich package. Robust to either source of correlation.
Bootstrap — for very small numbers of clusters (<20), cluster-bootstrap (fwildclusterboot) gives more reliable inference.

The convention in modern empirical-finance papers is to report clustered SEs by default, with a note specifying the clustering dimension(s) — see the add.lines block in the next slide’s stargazer example.

Exporting results — `stargazer`

library(stargazer)

# Compute clustered SEs for each model
clustered_se_simple        <- coeftest(simple_lm,
  vcov = vcovHC(simple_lm,        type = "HC1", cluster = "Year"))
clustered_se_fe_asset      <- coeftest(fe_asset,
  vcov = vcovHC(fe_asset,         type = "HC1", cluster = "Year"))
clustered_se_fe_asset_year <- coeftest(fe_asset_year,
  vcov = vcovHC(fe_asset_year,    type = "HC1", cluster = "Year"))

# Build LaTeX table
stargazer(
  simple_lm, fe_asset, fe_asset_year,
  type = "latex",
  out  = "regression_table.tex",
  dep.var.caption = "Effects on Net Longs (Post-2021 Data)",
  label = "tab:regression_net",
  dep.var.labels  = "Non-Commercial Longs",
  covariate.labels = c("Commercial Longs"),
  omit.stat = c("ser", "f"),
  star.char = c("*", "**", "***"),
  star.cutoffs = c(0.1, 0.05, 0.01),
  add.lines = list(
    c("Asset FE", "No", "Yes", "Yes"),
    c("Year FE",  "No", "No",  "Yes")
  ),
  omit = c("factor"),
  header = FALSE, model.numbers = TRUE,
  table.placement = "H", digits = 3,
  se = list(clustered_se_simple[, 2],
            clustered_se_fe_asset[, 2],
            clustered_se_fe_asset_year[, 2])
)

stargazer produces clean LaTeX/HTML/ASCII regression tables — multiple models side by side.
add.lines documents which fixed effects each model includes — important for replicability.
star.cutoffs sets the significance thresholds; report them in the table notes.
Always report R², N, and p-values: “All models include year and asset fixed effects; \\\* p < 0.01, \\ p < 0.05, \* p < 0.1.”

stargazer produces the multi-column regression table that is essentially universal in empirical-economics and finance papers — three or four nested specifications side-by-side, each adding a layer of fixed effects or controls, with R², N, and significance stars at the bottom.

The conventions encoded in the example are the academic-paper standard:

Multiple models in adjacent columns (model 1 = no FE, model 2 = asset FE, model 3 = asset + year FE) tells the reader what changes as you add structure — does the coefficient remain stable (good — robust to specification choice) or move dramatically (warning — driven by what you absorbed)?
add.lines documenting fixed effects — readers should see at a glance which columns control for what. Without these, your reader has to parse coefficient labels to figure it out.
Clustered SEs in se = list(...) — by default stargazer shows OLS SEs from the model object; passing the clustered SEs explicitly is essential when you have clustered standard errors. Otherwise the table reports the wrong inference.
omit = c("factor") suppresses the dozens of fixed-effect dummies from the displayed table — they’re estimated but uninteresting; readers care about the coefficient on net_commercial, not on each year dummy.
Significance stars + thresholds — star.cutoffs = c(0.1, 0.05, 0.01) is the most common (some journals prefer 0.05/0.01/0.001); always state which thresholds you used in the table notes.

Modern alternatives to stargazer: modelsummary (more flexible, supports more model types, prettier defaults) and fixest::etable (designed to pair with feols, handles many FE elegantly). For a thesis or paper, all three are acceptable; consistency within one document matters more than which one you chose.

3.4 Discussion of Assignment I

3.1 Course objectives
3.2 Recap from Lecture 2
3.3 Live Coding Session 3
3.4 Discussion of Assignment I
3.5 Conclusion of Lecture 3

Assignment I — Problem Set
Assignment I — submission rules

Assignment I — Problem Set

Get the data: create .R script, load libraries, set API key, load Quandl futures data (gold, silver, BTC, ETH). Merge by date into merged_*, drop NAs, sort ascending by date.
Understand the data: read Nasdaq & CFTC sources; write 3–5 sentences for your “Data” section.
Join data sets: combine into combined, with an Asset variable.
Clean & transform: filter post-1 April 2021, save as combined_clean. Derive Net Long Position per trader type, Weekly Change, and Year/Quarter/Month indicators.
Descriptive analysis: mean, sd, min, P10, median, P90, max — table for the Data section.
Analysis & plot: discuss patterns; implement min. two ggplot plots with 4–5 sentences each.

Define a research question: simple, testable RQ from your patterns (e.g., “Do crypto assets show higher Net Long Position volatility?”). Briefly motivate in Introduction (≥ 0.5 page).
Literature review: 1–2 papers to build on (3–4 sentences, ≥ 0.5 page).
Define your approach: how you will answer, why it makes sense (≥ 0.5 page).
Report insights: Results (≥ 1.5 pages), min. two LaTeX tables via stargazer (academic style) and min. 2 captioned plots.
Explain & discuss: Conclusion (≥ 0.5 page) — implications, quality vs length.

Use the Overleaf template on Moodle. Page minima per section: Intro 0.5, Lit 0.5, Method 0.5, Data 1.5, Results 1.5, Conclusion 0.5.

The assignment is split into two halves on purpose: Exercise 1 (data preparation) is essentially a structured replication of the workflow we built in Lectures 1–3 — pull data, merge, clean, descriptive statistics, two plots — checking that you can put the toolkit into operation end-to-end. Exercise 2 (paper) is the harder and more rewarding half: define your own research question on the data and write a short academic paper.

A few recommendations for the paper half:

Pick a research question whose answer you don’t know in advance. “Are gold and bitcoin correlated?” is too obvious — you’ve already plotted it. “Do crypto net-long positions show stronger weekly persistence than gold’s, and is the gap larger after FTX?” is sharper, testable, and the answer is genuinely uncertain.
Two papers are enough for the lit review. This isn’t a thesis — find two recent, well-cited papers on a closely-related question, summarise what they did and what they found in 3–4 sentences each, and explain what your analysis adds.
Methodology section should be specific enough that a competent reader could re-run your analysis from the description alone. List the regression specification (with cluster level), define every variable, state the sample.
Results section should lead with the takeaway, then back it up with the table or figure. Don’t bury the headline finding in the third paragraph.

The page minima are floors, not ceilings. A 7-page paper that says one thing well beats a 12-page paper with three half-explored ideas. Marker fatigue is real — terse and crisp wins.

Assignment I — submission rules

Solve the problem set posted on Moodle, building on Lectures 1–3.

Submit ONLY two files:

One .R script — well-commented, self-explanatory, efficient (meaningful names, functions for repetition, avoid for-loops). Include text/comments answering Exercise 1 and 2.
One PDF compiled from the Overleaf template — outputs from your .R script (plots, calculations), stargazer LaTeX tables, captioned plots, written text. 11 pt Times New Roman, 1.5 spaced.

Work in teams of up to 5 students.
50% of the final grade. Evaluated on code quality, creativity (plots/RQ), writing (concise, skim-friendly, with economic justifications).
Deadline: 19 January 2026 (midnight). Email files to oliver.padmaperuma@uni-ulm.de, with andre.guettler@uni-ulm.de and your teammates in CC.
Subject / file pattern: RiF_ProblemSet_surname1_surname2_...
If attachments are too large, share a cloud link in the email body.

Submission mechanics worth nailing down well in advance of the deadline:

Two files only — .R and .pdf. Not the Overleaf project ZIP (the marker shouldn’t have to recompile). Not the raw CSVs (your script should generate them or read_csv() from a documented source). The marker’s path: open .R in RStudio, run top-to-bottom from a fresh session, expect it to run cleanly; then read the PDF in parallel.
Self-contained .R script. Every dependency is stated at the top (library() calls); every input file is loaded with a relative path that the marker can satisfy by placing files in the same folder; no setwd() to a path that only exists on your machine. Reproducibility is graded.
Group of up to 5. All names listed in the file pattern (RiF_ProblemSet_surname1_surname2_surname3...) so the marker can attribute. Each member emails CCing the others — this is the audit trail that you all submitted on time.
Subject line follows the pattern verbatim — the marker’s inbox is filtered on this; deviations risk getting lost in the noise.
Cloud link if attachments exceed mailbox limits. University SMTP often caps at 25 MB; PDF + R script should be well under, but if your output PDF embeds many figures, switch to a Nextcloud / OneDrive link.

Late submissions are evaluated case-by-case but treat the deadline as firm. The harder constraint is that the assignment counts for 50 % of the final grade — bake in a buffer day in case Overleaf compiles fail or one of the dataset URLs goes down.

3.5 Conclusion of Lecture 3

3.1 Course objectives
3.2 Recap from Lecture 2
3.3 Live Coding Session 3
3.4 Discussion of Assignment I
3.5 Conclusion of Lecture 3

Course at a glance
Further reading
Prepare before next lecture
See you next time
References

Course at a glance

Basics

Week 1

29.10.2025

Course objectives, schedule, assignments · Introduction to R · Live coding

Course objectives, schedule and assignments
Introduction to R and RStudio
Live coding: variables, vectors, matrices, data frames, lists, functions, loops
Data import and export

Data Handling & Visualization

Week 2

05.11.2025

API access, merging, cleansing, transforming and visualising financial data in R · Introduction to Overleaf

API access (Nasdaq Data Link / Quandl, FRED, Yahoo, Coingecko, Polygon)
Import and cleanse: read_csv, mutate, types
Merge and append data (merge, bind_rows)
Filter and mutate (dplyr): subset rows, derive variables
Group by and summarise
Pivot wide / long
Data visualization with ggplot2 (six-step pipeline)
Introduction to LaTeX and Overleaf

Statistical Analysis

Week 3

12.11.2025

Descriptive · inferential · modelling — applied in R

Descriptive statistics in R
Correlation matrix and Pearson correlation test
t-Test and Wilcoxon test
Shapiro-Wilk and Kolmogorov-Smirnov tests
Linear regression with fixed effects
Clustered standard errors
Exporting regression tables with stargazer
Discussion of Assignment I (Problem Set)

Academic Publishing & Refereeing

Week 4

19.11.2025

What makes a great empirical paper · publication process · how to write a referee report

What makes a good empirical paper (contribution, identification, write-up)
The publication process step by step
Top finance and economics journals
Bad outcome vs revise & resubmit
Referee Reports — summary, major issues, minor issues
Referee checklist (question, identification, data, econometrics, results)
Discussion of Assignment II (Referee Report)

Brown Bag Seminar

Week 13

20.01.2026

Engage with doctoral research and prepare your referee report

Doctoral research presentations
Apply empirical / writing tips for the referee report
Group discussion and Q&A

Prepare before next lecture

Document today’s code in a clean way and save as an .Rmd file.
Start sketching your problem-set paper — draft RQ, motivation, and the first plot.

Two pieces of homework before Lecture 4:

Document today’s code as .Rmd — same rationale as the previous lectures: rebuild the script, intersperse prose, render and read back. The new constructs from today (cor.test, t.test, wilcox.test, lm, coeftest with clustered SEs, stargazer) are the analytical core of the assignment, so spending an hour internalising them now saves the same hour of head-scratching during assignment week.

Start sketching the assignment paper — a one-page outline (RQ + motivation + a single plot you’ve already produced) is enough at this stage. Bringing this draft to the next lecture lets you sanity-check the question against the next topic (what makes a good empirical paper) before you commit to a direction.

A useful drafting heuristic: write the abstract first — three sentences saying what you ask, what you do, what you find. If you can’t write those three sentences, the paper isn’t ready to start. Once the abstract reads cleanly, the rest of the paper is filling it in.

See you next time

Reminder

Register for “exam” 13337 in campusonline by 30 November 2025.
Lecture 4: Academic publishing and refereeing — what makes a great empirical paper, the publication process, how to write a referee report.

References

Angrist, Joshua D., and Jörn-Steffen Pischke. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton, NJ: Princeton University Press. https://press.princeton.edu/books/paperback/9780691120355/mostly-harmless-econometrics.

Gentle, James E. 2020. Statistical Analysis of Financial Data: With Examples in R. Boca Raton, FL: CRC Press.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning with Applications in R. 2nd ed. New York, NY: Springer. https://www.statlearning.com/.

Navarro, Daniel J. 2015. Learning Statistics with R. https://learningstatisticswithr.com/.

Scheuch, Christoph, Stefan Voigt, and Patrick Weiss. 2023. Tidy Finance with R. Chapman & Hall/CRC. https://www.tidy-finance.org/r/.

Thulin, Måns. 2024. Modern Statistics with R. https://www.modernstatisticswithr.com/.

Lecture 3: Statistical Analysis

3.1 Course objectives

Welcome to Research in Finance

Course Objective

Course at a glance

Assignments / Exams

3.2 Recap from Lecture 2

What we covered

3.3 Live Coding Session 3

Overview of basic statistical methods

Descriptive Statistics: Overview

Example: Summary statistics

Example: Distribution plots

Inferential Statistics: Overview

Example: Correlation matrix

Example: Pearson correlation test

Example: t-Test

Example: Wilcoxon test

Example: Shapiro-Wilk test

Example: Kolmogorov-Smirnov test

Statistical Modelling: Overview

Example: Linear regression with fixed effects

Example: Clustered standard errors

Exporting results — stargazer

3.4 Discussion of Assignment I

Assignment I — Problem Set

Assignment I — submission rules

3.5 Conclusion of Lecture 3

Course at a glance

Further reading

Prepare before next lecture

See you next time

References

Exporting results — `stargazer`