Lecture 2: Data Handling & Visualization

API access, merging, cleansing, transforming and visualising financial data in R · Introduction to Overleaf

Prof. Dr. Andre Guettler
Prof. Dr. Andre Guettler Director of the Institute
Helmholtzstraße 22, Room 205
andre.guettler@uni-ulm.de
+49 731 50 31 030
Oliver Padmaperuma
Oliver Padmaperuma Doctoral Candidate
Helmholtzstraße 22, Room 203
oliver.padmaperuma@uni-ulm.de
+49 731 50 31 036

2.1 Course objectives

  • 2.1 Course objectives
  • 2.2 Recap from Lecture 1
  • 2.3 Live Coding Session 2
  • 2.4 Introduction to Overleaf
  • 2.5 Conclusion of Lecture 2
  • Welcome to
  • Course Objective
  • Course at a glance
  • Assignments / Exams

Welcome to Research in Finance

  • Register for “exam” 13337 in campusonline by 30 November 2025. The registration is what binds you to the course requirements; without it you cannot submit. If you are registered but don’t submit, you receive a fail grade (5.0).
  • Ask questions during or right after each session — that is the preferred channel.
  • Admin / studies / exam-eligibility questions go to the registrar’s office (Studiensekretariat) at studiensekretariat@uni-ulm.de.
  • Course-content questions outside class: email oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de.
  • We also recommend the student advisory service.

Course Objective

Scope

We will:

  • Prepare Master students for their empirical thesis
  • Hands-on R intro for data management, visualization, cleaning, basic modelling
  • Writing tips for theses, including LaTeX & Overleaf
  • Referee reviews on research presentations for empirical critique skills

We will NOT:

  • Deep dive into advanced stats or ML methods
  • Specific finance topics (asset pricing, etc.)
  • Full thesis writing / research design training

Approach

Part I — Learn the Basics

  • Hands-on R intro: a widely used language for statistical computing
  • Manage, visualize and clean data; run and interpret statistical models
  • Solve a real empirical problem set in R, in groups

Part II — Apply your learnings

  • Mandatory participation in the institute’s Brown Bag Seminar
  • Two assignments (group work and individual referee report) — see Assignments / Exams

Course at a glance

Basics

Week 1

29.10.2025

Course objectives, schedule, assignments · Introduction to R · Live coding

  • Course objectives, schedule and assignments
  • Introduction to R and RStudio
  • Live coding: variables, vectors, matrices, data frames, lists, functions, loops
  • Data import and export

Data Handling & Visualization

Week 2

05.11.2025

API access, merging, cleansing, transforming and visualising financial data in R · Introduction to Overleaf

  • API access (Nasdaq Data Link / Quandl, FRED, Yahoo, Coingecko, Polygon)
  • Import and cleanse: read_csv, mutate, types
  • Merge and append data (merge, bind_rows)
  • Filter and mutate (dplyr): subset rows, derive variables
  • Group by and summarise
  • Pivot wide / long
  • Data visualization with ggplot2 (six-step pipeline)
  • Introduction to LaTeX and Overleaf

Statistical Analysis

Week 3

12.11.2025

Descriptive · inferential · modelling — applied in R

  • Descriptive statistics in R
  • Correlation matrix and Pearson correlation test
  • t-Test and Wilcoxon test
  • Shapiro-Wilk and Kolmogorov-Smirnov tests
  • Linear regression with fixed effects
  • Clustered standard errors
  • Exporting regression tables with stargazer
  • Discussion of Assignment I (Problem Set)

Academic Publishing & Refereeing

Week 4

19.11.2025

What makes a great empirical paper · publication process · how to write a referee report

  • What makes a good empirical paper (contribution, identification, write-up)
  • The publication process step by step
  • Top finance and economics journals
  • Bad outcome vs revise & resubmit
  • Referee Reports — summary, major issues, minor issues
  • Referee checklist (question, identification, data, econometrics, results)
  • Discussion of Assignment II (Referee Report)

Brown Bag Seminar

Week 13

20.01.2026

Engage with doctoral research and prepare your referee report

  • Doctoral research presentations
  • Apply empirical / writing tips for the referee report
  • Group discussion and Q&A

Assignments / Exams

Assignment I — Problem Set 50% of your grade

Documented .R script + PDF write-up (Overleaf)

Group of up to 5.

Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Research in Finance_assignment-1-problem-set_surname1_surname2_…

19 January 2026

2.5–3 page referee report on a Brown-Bag presentation

Group of up to 5.

Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Research in Finance_assignment-2-referee-report_surname1_surname2_…

3 February 2026

2.2 Recap from Lecture 1

  • 2.1 Course objectives
  • 2.2 Recap from Lecture 1
  • 2.3 Live Coding Session 2
  • 2.4 Introduction to Overleaf
  • 2.5 Conclusion of Lecture 2
  • What we covered

What we covered

Topic What is it? What for? Example syntax
Basic R Free language for statistical computing Calculations, exploring data 2 + 3, log(10)
Variables & basic ops Store data; perform math/logic Assigning prices, ratios var <- value, var > 0
Vectors Ordered collections of one type Time series of returns c(1, 2, 3), length(vec)
Matrices 2D arrays Covariance matrices matrix(...), mat1 %*% mat2
Data frames Mixed-column tables Datasets for analysis data.frame(...), df$col
Lists Flexible mixed-type containers Storing model outputs list(a = 1, b = c("x", "y"))
Functions Reusable code blocks Sharpe ratio calc f <- function(x) x*2
Loops Iterate over sequences Monte Carlo runs for (i in 1:3) {…}
Import / export Load and save CSV, RDS workflow read_csv("f.csv")

2.3 Live Coding Session 2

  • 2.1 Course objectives
  • 2.2 Recap from Lecture 1
  • 2.3 Live Coding Session 2
  • 2.4 Introduction to Overleaf
  • 2.5 Conclusion of Lecture 2
  • API access — overview
  • API access — Quandl example
  • Key columns in CFTC datasets
  • Import and cleanse CSV
  • Merge and append data
  • Filter and mutate
  • Group by and summarise
  • Pivot data
  • Data visualization in R — package landscape
  • ggplot2 — the six steps
  • ggplot2 — building up step by step
  • Example 1 — line chart with smoother
  • Example 2 — bar chart with error bars
  • Example 3 — faceted boxplots
  • Example 4 — scatter + regression line
  • Example 5 — correlation heatmap
  • Exporting plots
  • Troubleshooting R

API access — overview

  • APIs (Application Programming Interfaces) let R fetch real-time / structured data from external sources — automating data collection for reproducible empirical analysis.
  • Many APIs exist (thousands for finance/STEM); effort varies — some (like Quandl) deliver ready tables, others require JSON parsing, field specification, authentication. Start simple to avoid frustration.
  • Use packages like httr/jsonlite for general calls, or wrappers (e.g., Quandl) for ease.
  • Whenever possible, get a free key and test with small queries first.
API Description Ease Best for
Nasdaq Data Link (Quandl) Comprehensive finance datasets (e.g., CFTC positions) as tables Easy (Quandl package) Futures / sentiment analysis
FRED (Federal Reserve) Economic indicators (CPI, unemployment) Easy (fredr package) Macro trends in empirical work
Yahoo Finance Historical prices/volumes via quantmod Easy (quantmod package) Quick stock data for portfolios
Coingecko Crypto prices, historical charts Medium (JSON parsing) Crypto time series for volatility studies
Polygon.io Real-time stocks/crypto (free tier) Medium (API calls) High-frequency data for advanced models

API access — Quandl example

library(Quandl)
library(tidyverse)
library(tidyquant)

# Specify API key (paste yours)
Quandl.api_key("57NfEEmRhMSGmivKw_kx")

# Gold futures: positions and concentration ratios
com_gold  <- Quandl.datatable("QDL/LFON", contract_code = "088691", type = "FO_L_ALL")
conc_gold <- Quandl.datatable("QDL/FCR",  contract_code = "088691", type = "FO_L_ALL_CR")

# Silver, Bitcoin, Ethereum — same pattern, different contract codes
com_silver  <- Quandl.datatable("QDL/LFON", contract_code = "084691", type = "FO_L_ALL")
conc_silver <- Quandl.datatable("QDL/FCR",  contract_code = "084691", type = "FO_L_ALL_CR")
com_btc     <- Quandl.datatable("QDL/LFON", contract_code = "133741", type = "FO_L_ALL")
conc_btc    <- Quandl.datatable("QDL/FCR",  contract_code = "133741", type = "FO_L_ALL_CR")
com_eth     <- Quandl.datatable("QDL/LFON", contract_code = "146021", type = "FO_L_ALL")
conc_eth    <- Quandl.datatable("QDL/FCR",  contract_code = "146021", type = "FO_L_ALL_CR")
  • Futures are contracts to buy/sell assets at future prices/dates, used for hedging or speculation.
  • CFTC data is published weekly: positions by trader type (LFON) and concentration ratios (FCR).
  • Positions split by commercials (hedgers) vs non-commercials (speculators); spreads net out neutral bets.
  • Net positions (longs − shorts) gauge bullish/bearish sentiment; high concentration signals herding risk.

Key columns in CFTC datasets

Column Description
contract_code Financial contract code
type F (futures) / FO (futures + options)
date Date of record
market_participation Number of traders
non_commercial_longs Long positions, non-commercials
non_commercial_shorts Short positions, non-commercials
commercial_longs Long, commercial entities
commercial_shorts Short, commercial entities
total_reportable_longs Total reportable longs
total_reportable_shorts Total reportable shorts
Column Description
largest_4_longs_gross Gross long, top 4 traders
largest_4_shorts_gross Gross short, top 4
largest_8_longs_gross Gross long, top 8
largest_8_shorts_gross Gross short, top 8
largest_4_longs_net Net long, top 4
largest_4_shorts_net Net short, top 4
largest_8_longs_net Net long, top 8
largest_8_shorts_net Net short, top 8

Import and cleanse CSV

library(readr)

# Gold
com_gold  <- read_csv("com_gold.csv") %>%
  mutate(date = as.Date(date), contract_code = as.character(contract_code))
conc_gold <- read_csv("conc_gold.csv") %>%
  mutate(date = as.Date(date), contract_code = as.character(contract_code))

# Silver, Bitcoin, Ethereum follow the same pattern …
  • read_csv() parses delimited text into a tibble (the tidyverse data frame).
  • mutate() updates / creates columns row-wise — here we coerce types so later joins behave.
  • Always handle dates explicitly (as.Date) — leaving them as strings breaks merge() and time-series plots later.
  • The full tidyverse workflow: read → clean → merge → analyse → visualise.

Merge and append data

# Merge positions with concentration ratios per asset
merged_gold   <- merge(com_gold,   conc_gold,   by = "date", all = TRUE) %>% arrange(date)
merged_silver <- merge(com_silver, conc_silver, by = "date", all = TRUE) %>% arrange(date)
merged_btc    <- merge(com_btc,    conc_btc,    by = "date", all = TRUE) %>% arrange(date)
merged_eth    <- merge(com_eth,    conc_eth,    by = "date", all = TRUE) %>% arrange(date)

# Stack all four into one long data frame
combined <- bind_rows(
  merged_gold   %>% mutate(Asset = "Gold"),
  merged_silver %>% mutate(Asset = "Silver"),
  merged_btc    %>% mutate(Asset = "Bitcoin"),
  merged_eth    %>% mutate(Asset = "Ethereum")
)
  • merge() joins two data frames by a common key (date here); all = TRUE is an outer join (keeps NAs).
  • bind_rows() appends data frames row-wise — useful for stacking per-asset frames into a long panel.
  • arrange() sorts a frame by one or more columns.
  • Long-format panels are the natural input for dplyr and ggplot2.

Filter and mutate

combined_clean <- combined %>%
  filter(date > as.Date("2021-01-01")) %>%      # ensure balanced panel
  mutate(
    Year          = lubridate::year(date),
    Asset_Factor  = factor(Asset),
    Net_Longs     = non_commercial_longs - non_commercial_shorts,
    Log_Net_Longs = log(abs(Net_Longs) + 1)     # +1 avoids log(0)
  ) %>%
  na.omit()
  • filter() subsets rows by a condition; mutate() adds or updates columns.
  • lubridate::year(date) extracts the year as an integer — useful for grouping or faceting.
  • Derive new variables (here: net longs, log-transformed) inside mutate() rather than as separate steps.
  • na.omit() drops rows with any NA — be aware this can silently drop a lot of data; check nrow() before/after.

Group by and summarise

# Per-asset, per-year summary
summary_assets <- combined_clean %>%
  group_by(Asset, Year) %>%
  summarise(
    Mean_Net_Longs = mean(Net_Longs, na.rm = TRUE),
    SD_Net_Longs   = sd(Net_Longs,   na.rm = TRUE),
    .groups = "drop"
  )

# Z-scores within each Asset
combined_z <- combined_clean %>%
  group_by(Asset) %>%
  mutate(
    Z_Score = (Net_Longs - mean(Net_Longs, na.rm = TRUE)) /
              sd(Net_Longs, na.rm = TRUE)
  ) %>%
  ungroup()
  • group_by() partitions a frame into groups; summarise() collapses each group to one row.
  • Combine with mutate() (instead of summarise()) to add a column without collapsing — useful for z-scoring within groups.
  • Always ungroup() after group-wise mutations; lingering groups confuse downstream operations.
  • .groups = "drop" quiets the grouping warning and produces an ungrouped result.

Pivot data

# Wide to long — summary stats for ggplot2
long_summary <- summary_assets %>%
  pivot_longer(cols = c(Mean_Net_Longs, SD_Net_Longs),
               names_to = "Stat", values_to = "Value")

# Long to wide — scatterplot data
scatterplot_data <- combined_clean %>%
  select(date, Asset, Net_Longs) %>%
  pivot_wider(names_from = Asset, values_from = Net_Longs) %>%
  na.omit()

# Pivot + correlate — full correlation matrix
cor_matrix <- combined_clean %>%
  select(date, Asset, Net_Longs) %>%
  pivot_wider(names_from = Asset, values_from = Net_Longs, values_fill = NA) %>%
  select(-date) %>%
  cor(use = "pairwise.complete.obs") %>%
  as.data.frame() %>%
  rownames_to_column(var = "Asset1") %>%
  pivot_longer(-Asset1, names_to = "Asset2", values_to = "Correlation") %>%
  mutate(Correlation = round(Correlation, 2)) %>%
  filter(!is.na(Correlation))
  • Pivoting reshapes data between wide (one row per observation, many columns) and long (many rows per observation, two columns) formats.
  • pivot_longer() collapses many columns into key/value pairs; pivot_wider() does the inverse.
  • Long format plays best with ggplot2’s grammar; wide format is convenient for correlation matrices and pair-wise plots.
  • Chains often go long → reshape → analyse → go wide depending on the next step.

Data visualization in R — package landscape

Package Description Use it for Weaknesses
ggplot2 Layered grammar of graphics Publication-ready plots Steeper learning curve; verbose for simple plots
Base R graphics plot(), hist(), barplot(), … Rapid exploratory checks Limited customization, dated look
lattice Trellis-style multi-panel plots Conditioned/multi-panel views Outdated formula syntax
plotly Interactive ggplot2/base extension Web dashboards, hover/zoom Overhead for static use
shiny Reactive web apps Custom dashboards Steep learning, deployment needed

ggplot2 — the six steps

  1. Select dataggplot(data).
  2. Aesthetic mappingaes(x, y, colour, size, …).
  3. Geometric objectsgeom_*() (point, line, bar, …).
  4. Labelslabs().
  5. Facetsfacet_wrap(), facet_grid().
  6. Themetheme_minimal(), theme_classic(), etc.

Steps 1–3 are mandatory; 4–6 polish the output. Reference the ggplot2 cheat sheet.

ggplot2 — building up step by step

library(ggplot2)

# Step 1: data
ggplot(combined_clean)

# Step 2: aesthetic mapping
ggplot(combined_clean, aes(x = date, y = Net_Longs, color = Asset))

# Step 3: geometric object
ggplot(combined_clean, aes(x = date, y = Net_Longs, color = Asset)) +
  geom_line()

# Step 4: labels
ggplot(combined_clean, aes(x = date, y = Net_Longs, color = Asset)) +
  geom_line() +
  labs(title = "Normalized Net Longs: Relative Changes",
       x = "Date", y = "Net Long Positions")

# Step 5: facets
ggplot(combined_clean, aes(x = date, y = Net_Longs, color = Asset)) +
  geom_line() +
  labs(title = "Normalized Net Longs: Relative Changes",
       x = "Date", y = "Net Long Positions") +
  facet_wrap(~ Asset)

# Step 6: theme (or tidyquant's theme_tq)
ggplot(combined_clean, aes(x = date, y = Net_Longs, color = Asset)) +
  geom_line() +
  labs(title = "Normalized Net Longs: Relative Changes",
       x = "Date", y = "Net Long Positions") +
  facet_wrap(~ Asset) +
  theme_tq()
  • ggplot2 is declarative — you describe what to plot, not how.
  • Layers compose with +; each layer adds a geom, scale, or theme.
  • facet_wrap() splits one chart into one panel per category — great for cross-asset comparisons.
  • Pick a theme_*() once and reuse — keeps the deck visually consistent.

Example 1 — line chart with smoother

library(ggplot2)

ggplot(combined_z, aes(x = date, y = Z_Score, color = Asset)) +
  geom_smooth(method = "loess", se = FALSE) +                 # local-polynomial trend
  geom_hline(yintercept = 0, lty = 2) +                       # mean reference
  geom_vline(xintercept = as.Date("2022-01-01"), lty = 2) +   # 2022 break
  theme_minimal() +
  labs(title = "Z-Score Normalized Trends: Deviations from Mean Sentiment",
       subtitle = "Positive = above-average bullishness; e.g., crypto spikes in 2023 recovery")
  • geom_smooth(method = "loess") adds a non-parametric trend — useful when a straight line under-fits.
  • geom_hline() / geom_vline() mark reference levels (mean = 0, structural-break date).
  • Use a subtitle to encode interpretation directly in the chart — saves explaining in the caption.

Example 2 — bar chart with error bars

ggplot(summary_assets, aes(x = factor(Year), y = Mean_Net_Longs, fill = Asset)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_errorbar(aes(ymin = Mean_Net_Longs - SD_Net_Longs,
                    ymax = Mean_Net_Longs + SD_Net_Longs),
                position = position_dodge(width = 0.9), width = 0.2) +
  theme_minimal() +
  labs(title = "Mean Net Longs by Asset & Year", x = "Year", y = "Mean Net Longs")
  • geom_bar(stat = "identity") plots the column value as the bar height (stat = "count" would tally rows instead).
  • position = "dodge" puts groups side-by-side; position = "stack" would stack them.
  • Error bars (here ±1 SD) communicate dispersion alongside the mean.

Example 3 — faceted boxplots

ggplot(combined_clean, aes(x = factor(Year), y = Net_Longs, fill = Asset)) +
  geom_boxplot() +
  facet_wrap(~ Asset, scales = "free_y") +
  theme_minimal() +
  labs(title = "Annual Distributions: Volatility Comparison",
       x = "Year", y = "Net Longs")
  • Boxplots show median, IQR, and outliers — much more informative than means alone.
  • facet_wrap(~ Asset, scales = "free_y") lets each panel use its own y-range — better when assets are on different scales.
  • Useful for spotting outliers / distribution shifts year over year.

Example 4 — scatter + regression line

ggplot(scatterplot_data, aes(x = Gold, y = Bitcoin)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", colour = "red", se = FALSE) +
  theme_minimal() +
  labs(title = "Gold vs Bitcoin Net Longs (paired dates only)",
       x = "Gold Net Longs", y = "Bitcoin Net Longs")
  • alpha = 0.5 makes overlapping points readable in dense scatterplots.
  • geom_smooth(method = "lm") overlays an OLS regression line — quickly shows whether two series co-move.
  • Pair this with cor() to put a number on the visual relationship.

Example 5 — correlation heatmap

ggplot(cor_matrix, aes(x = Asset2, y = Asset1, fill = Correlation)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = Correlation), color = "black", size = 3, fontface = "bold") +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red",
                       midpoint = 0, limits = c(-1, 1)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
        axis.text.y = element_text(size = 10)) +
  labs(title = "Correlation Matrix of Net Longs (Assets)",
       subtitle = "Negative values show decoupling (e.g., Gold vs Bitcoin in crises)",
       x = "Asset", y = "Asset")
  • Heatmaps make pairwise correlations easy to scan — colour shows sign and magnitude at a glance.
  • scale_fill_gradient2() with midpoint = 0 is the right default for correlation data (red/blue diverging palette).
  • Annotate cells with the value (geom_text()) so the reader doesn’t have to map colour to number.

Exporting plots

ggsave("myplot.png")
ggsave("myplot.pdf")
ggsave("myplot.svg")
ggsave("C:/Users/R Intro/myplot.tiff")

# Custom size (default unit: inches)
ggsave("plot.pdf", width = 6, height = 4)
ggsave("plot.pdf", width = 15, height = 10, units = "cm")

Troubleshooting R

  • RStudio Help tab (lower-right pane) — home icon for resources/manuals; search bar for functions and packages.
  • In the Script Editor:
    • help.start() opens the main Help page.
    • help(FunctionName) or ?FunctionName shows function help.
  • Online: RStudio cheatsheets, Posit Community, Stack Overflow, GitHub, LLMs.

2.4 Introduction to Overleaf

  • 2.1 Course objectives
  • 2.2 Recap from Lecture 1
  • 2.3 Live Coding Session 2
  • 2.4 Introduction to Overleaf
  • 2.5 Conclusion of Lecture 2
  • LaTeX vs Overleaf
  • Getting started with Overleaf

LaTeX vs Overleaf

  • A markup language for professional documents with precise control over layout, equations, tables — widely used in academia.
  • Plain text → formatted PDF; ideal for papers/theses with complex math (regressions, matrices).
  • Basic structure uses commands in braces, e.g. \section{Title}, and environments for tables/figures.
  • Advantages: automatic numbering, bibliographies, consistency. Steeper learning curve for beginners.
  • A free online editor for LaTeX with real-time collaboration and instant PDF previews — no local install.
  • Create a project, paste code, hit Recompile. Perfect for group assignments / thesis drafts.
  • Templates, spell-check, Git integration; sign in with university email for full access.
  • Start with our Moodle template: copy-paste into a new project, edit sections, export PDF.

Getting started with Overleaf

  • Setup: create a free account at overleaf.com; New project → Blank Project or import template. Edit left pane (TeX), preview right pane.
  • Key features: syntax highlighting, error finder, Git integration; share links for collaboration.
  • Workflow: write in sections (\section{}), insert tables (\begin{table}…\end{table} with tabular), add R figures (\includegraphics{plot.png}), cite (\cite{key} with BibTeX).
  • Assignment template: download from Moodle. Copy-paste into a new Overleaf project; replace placeholders (e.g., \section{Results}) with your content.
  • Tips: \usepackage{graphicx} for images, \begin{figure}…\end{figure} for floats. Debug errors shown in red (e.g., missing $ for math). Practice with a 1-page test before the full paper.

2.5 Conclusion of Lecture 2

  • 2.1 Course objectives
  • 2.2 Recap from Lecture 1
  • 2.3 Live Coding Session 2
  • 2.4 Introduction to Overleaf
  • 2.5 Conclusion of Lecture 2
  • Course at a glance
  • Further reading
  • Prepare before next lecture
  • See you next time
  • References

Course at a glance

Basics

Week 1

29.10.2025

Course objectives, schedule, assignments · Introduction to R · Live coding

  • Course objectives, schedule and assignments
  • Introduction to R and RStudio
  • Live coding: variables, vectors, matrices, data frames, lists, functions, loops
  • Data import and export

Data Handling & Visualization

Week 2

05.11.2025

API access, merging, cleansing, transforming and visualising financial data in R · Introduction to Overleaf

  • API access (Nasdaq Data Link / Quandl, FRED, Yahoo, Coingecko, Polygon)
  • Import and cleanse: read_csv, mutate, types
  • Merge and append data (merge, bind_rows)
  • Filter and mutate (dplyr): subset rows, derive variables
  • Group by and summarise
  • Pivot wide / long
  • Data visualization with ggplot2 (six-step pipeline)
  • Introduction to LaTeX and Overleaf

Statistical Analysis

Week 3

12.11.2025

Descriptive · inferential · modelling — applied in R

  • Descriptive statistics in R
  • Correlation matrix and Pearson correlation test
  • t-Test and Wilcoxon test
  • Shapiro-Wilk and Kolmogorov-Smirnov tests
  • Linear regression with fixed effects
  • Clustered standard errors
  • Exporting regression tables with stargazer
  • Discussion of Assignment I (Problem Set)

Academic Publishing & Refereeing

Week 4

19.11.2025

What makes a great empirical paper · publication process · how to write a referee report

  • What makes a good empirical paper (contribution, identification, write-up)
  • The publication process step by step
  • Top finance and economics journals
  • Bad outcome vs revise & resubmit
  • Referee Reports — summary, major issues, minor issues
  • Referee checklist (question, identification, data, econometrics, results)
  • Discussion of Assignment II (Referee Report)

Brown Bag Seminar

Week 13

20.01.2026

Engage with doctoral research and prepare your referee report

  • Doctoral research presentations
  • Apply empirical / writing tips for the referee report
  • Group discussion and Q&A

Further reading

  • Posit (2022) — quick-reference for the ggplot2 grammar.
  • Wickham, Navarro, and Pedersen (2024) — comprehensive ggplot2 reference, free draft at https://ggplot2-book.org/.
  • All Lecture 1 R recommendations.
  • Lamport (1994) — Leslie Lamport’s classic LaTeX manual.
  • Flynn (2019)lshort, a beginner-friendly free LaTeX intro.
  • Overleaf (2023) — Overleaf’s hands-on 30-minute LaTeX tutorial.

Prepare before next lecture

  1. Document today’s code in a clean way and save as .Rmd.
  2. Try out a few visualizations based on the data we prepared today.

See you next time

Reminder

  • Register for “exam” 13337 in campusonline by 30 November 2025.
  • Lecture 3: Statistical analysis in R — correlations, t-tests, Wilcoxon, Shapiro-Wilk, KS, linear regression, fixed effects.

References

Flynn, Peter. 2019. “Formatting Information: Introduction to LaTeX.” TeX Users Group (TUG). https://www.ctan.org/tex-archive/info/lshort/english/lshort.pdf.
Lamport, Leslie. 1994. LaTeX: A Document Preparation System. 2nd ed. Reading, MA: Addison-Wesley. https://www.latex-project.org/help/documentation/.
Overleaf. 2023. “Overleaf Tutorial: Basic Guide to LaTeX.” https://www.overleaf.com/learn/latex/Learn_LaTeX_in_30_minutes.
Posit. 2022. “Ggplot2 Cheat Sheet.” https://posit.co/wp-content/uploads/2022/10/data-visualization-1.pdf.
Wickham, Hadley, and Mine Çetinkaya-Rundel. 2023. R for Data Science. 2nd ed. O’Reilly. https://r4ds.hadley.nz/.
Wickham, Hadley, Danielle Navarro, and Thomas Lin Pedersen. 2024. Ggplot2: Elegant Graphics for Data Analysis. 3rd ed. https://ggplot2-book.org/.