Lecture 2: Data Handling & Visualization

API access, merging, cleansing, transforming and visualising financial data in R · Introduction to Overleaf

Prof. Dr. Andre Guettler Director of the Institute
Helmholtzstraße 22, Room 205
andre.guettler@uni-ulm.de
+49 731 50 31 030

Oliver Padmaperuma Doctoral Candidate
Helmholtzstraße 22, Room 203
oliver.padmaperuma@uni-ulm.de
+49 731 50 31 036

2.1 Course objectives

2.1 Course objectives
2.2 Recap from Lecture 1
2.3 Live Coding Session 2
2.4 Introduction to Overleaf
2.5 Conclusion of Lecture 2

Welcome to
Course Objective
Course at a glance
Assignments / Exams

Welcome to Research in Finance

Register for “exam” 13337 in campusonline by 30 November 2025. The registration is what binds you to the course requirements; without it you cannot submit. If you are registered but don’t submit, you receive a fail grade (5.0).
Ask questions during or right after each session — that is the preferred channel.
Admin / studies / exam-eligibility questions go to the registrar’s office (Studiensekretariat) at studiensekretariat@uni-ulm.de.
Course-content questions outside class: email oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de.
We also recommend the student advisory service.

Course Objective

Scope

We will:

Prepare Master students for their empirical thesis
Hands-on R intro for data management, visualization, cleaning, basic modelling
Writing tips for theses, including LaTeX & Overleaf
Referee reviews on research presentations for empirical critique skills

We will NOT:

Deep dive into advanced stats or ML methods
Specific finance topics (asset pricing, etc.)
Full thesis writing / research design training

Approach

Part I — Learn the Basics

Hands-on R intro: a widely used language for statistical computing
Manage, visualize and clean data; run and interpret statistical models
Solve a real empirical problem set in R, in groups

Part II — Apply your learnings

Mandatory participation in the institute’s Brown Bag Seminar
Two assignments (group work and individual referee report) — see Assignments / Exams

Course at a glance

Basics

Week 1

29.10.2025

Course objectives, schedule, assignments · Introduction to R · Live coding

Course objectives, schedule and assignments
Introduction to R and RStudio
Live coding: variables, vectors, matrices, data frames, lists, functions, loops
Data import and export

Data Handling & Visualization

Week 2

05.11.2025

API access, merging, cleansing, transforming and visualising financial data in R · Introduction to Overleaf

API access (Nasdaq Data Link / Quandl, FRED, Yahoo, Coingecko, Polygon)
Import and cleanse: read_csv, mutate, types
Merge and append data (merge, bind_rows)
Filter and mutate (dplyr): subset rows, derive variables
Group by and summarise
Pivot wide / long
Data visualization with ggplot2 (six-step pipeline)
Introduction to LaTeX and Overleaf

Statistical Analysis

Week 3

12.11.2025

Descriptive · inferential · modelling — applied in R

Descriptive statistics in R
Correlation matrix and Pearson correlation test
t-Test and Wilcoxon test
Shapiro-Wilk and Kolmogorov-Smirnov tests
Linear regression with fixed effects
Clustered standard errors
Exporting regression tables with stargazer
Discussion of Assignment I (Problem Set)

Academic Publishing & Refereeing

Week 4

19.11.2025

What makes a great empirical paper · publication process · how to write a referee report

What makes a good empirical paper (contribution, identification, write-up)
The publication process step by step
Top finance and economics journals
Bad outcome vs revise & resubmit
Referee Reports — summary, major issues, minor issues
Referee checklist (question, identification, data, econometrics, results)
Discussion of Assignment II (Referee Report)

Brown Bag Seminar

Week 13

20.01.2026

Engage with doctoral research and prepare your referee report

Doctoral research presentations
Apply empirical / writing tips for the referee report
Group discussion and Q&A

Assignments / Exams

Assignment I — Problem Set 50% of your grade

Documented .R script + PDF write-up (Overleaf)

Group of up to 5.

Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Research in Finance_assignment-1-problem-set_surname1_surname2_…

19 January 2026

Assignment II — Referee Report 50% of your grade

2.5–3 page referee report on a Brown-Bag presentation

Group of up to 5.

Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Research in Finance_assignment-2-referee-report_surname1_surname2_…

3 February 2026

2.2 Recap from Lecture 1

2.1 Course objectives
2.2 Recap from Lecture 1
2.3 Live Coding Session 2
2.4 Introduction to Overleaf
2.5 Conclusion of Lecture 2

What we covered

What we covered

Topic	What is it?	What for?	Example syntax
Basic R	Free language for statistical computing	Calculations, exploring data	`2 + 3`, `log(10)`
Variables & basic ops	Store data; perform math/logic	Assigning prices, ratios	`var <- value`, `var > 0`
Vectors	Ordered collections of one type	Time series of returns	`c(1, 2, 3)`, `length(vec)`
Matrices	2D arrays	Covariance matrices	`matrix(...)`, `mat1 %*% mat2`
Data frames	Mixed-column tables	Datasets for analysis	`data.frame(...)`, `df$col`
Lists	Flexible mixed-type containers	Storing model outputs	`list(a = 1, b = c("x", "y"))`
Functions	Reusable code blocks	Sharpe ratio calc	`f <- function(x) x*2`
Loops	Iterate over sequences	Monte Carlo runs	`for (i in 1:3) {…}`
Import / export	Load and save	CSV, RDS workflow	`read_csv("f.csv")`

2.3 Live Coding Session 2

2.1 Course objectives
2.2 Recap from Lecture 1
2.3 Live Coding Session 2
2.4 Introduction to Overleaf
2.5 Conclusion of Lecture 2

API access — overview
API access — Quandl example
Key columns in CFTC datasets
Import and cleanse CSV
Merge and append data
Filter and mutate
Group by and summarise
Pivot data
Data visualization in R — package landscape
ggplot2 — the six steps
ggplot2 — building up step by step
Example 1 — line chart with smoother
Example 2 — bar chart with error bars
Example 3 — faceted boxplots
Example 4 — scatter + regression line
Example 5 — correlation heatmap
Exporting plots
Troubleshooting R

API access — overview

APIs (Application Programming Interfaces) let R fetch real-time / structured data from external sources — automating data collection for reproducible empirical analysis.
Many APIs exist (thousands for finance/STEM); effort varies — some (like Quandl) deliver ready tables, others require JSON parsing, field specification, authentication. Start simple to avoid frustration.
Use packages like httr/jsonlite for general calls, or wrappers (e.g., Quandl) for ease.
Whenever possible, get a free key and test with small queries first.

API	Description	Ease	Best for
Nasdaq Data Link (Quandl)	Comprehensive finance datasets (e.g., CFTC positions) as tables	Easy (`Quandl` package)	Futures / sentiment analysis
FRED (Federal Reserve)	Economic indicators (CPI, unemployment)	Easy (`fredr` package)	Macro trends in empirical work
Yahoo Finance	Historical prices/volumes via `quantmod`	Easy (`quantmod` package)	Quick stock data for portfolios
Coingecko	Crypto prices, historical charts	Medium (JSON parsing)	Crypto time series for volatility studies
Polygon.io	Real-time stocks/crypto (free tier)	Medium (API calls)	High-frequency data for advanced models

The shift from “downloading a CSV by hand” to “calling an API in code” is the single biggest reproducibility win in empirical finance. With manual downloads, the data-as-of date is whatever you happened to click; with an API call inside your script, the same script run a year later produces the same dataset (or pulls fresh data if you choose). Both modes have their place — for one-off exploratory work, a manual CSV is fine — but the assignment expects an API-driven pipeline so a marker can re-run your analysis end-to-end.

A pragmatic order to learn the APIs: FRED (the simplest, no JSON parsing — fredr_series_observations("UNRATE") and you’re done) → Yahoo Finance via quantmod (one-line price downloads) → Quandl / Nasdaq Data Link (today’s example, slightly more contract-code lookup) → Polygon / Coingecko (raw JSON, more plumbing). Don’t fight a complex API for a simple question; FRED has thousands of macro series and is enough for most term papers.

API keys belong in environment variables (.Renviron), not in code that gets committed. Sys.setenv(NASDAQ_API_KEY = "...") for a session, or persist in .Renviron for permanence; usethis::edit_r_environ() opens the file at the right location. Never paste a key into a script you push to GitHub.

API access — Quandl example

library(Quandl)
library(tidyverse)
library(tidyquant)

# Specify API key (paste yours)
Quandl.api_key("57NfEEmRhMSGmivKw_kx")

# Gold futures: positions and concentration ratios
com_gold  <- Quandl.datatable("QDL/LFON", contract_code = "088691", type = "FO_L_ALL")
conc_gold <- Quandl.datatable("QDL/FCR",  contract_code = "088691", type = "FO_L_ALL_CR")

# Silver, Bitcoin, Ethereum — same pattern, different contract codes
com_silver  <- Quandl.datatable("QDL/LFON", contract_code = "084691", type = "FO_L_ALL")
conc_silver <- Quandl.datatable("QDL/FCR",  contract_code = "084691", type = "FO_L_ALL_CR")
com_btc     <- Quandl.datatable("QDL/LFON", contract_code = "133741", type = "FO_L_ALL")
conc_btc    <- Quandl.datatable("QDL/FCR",  contract_code = "133741", type = "FO_L_ALL_CR")
com_eth     <- Quandl.datatable("QDL/LFON", contract_code = "146021", type = "FO_L_ALL")
conc_eth    <- Quandl.datatable("QDL/FCR",  contract_code = "146021", type = "FO_L_ALL_CR")

Futures are contracts to buy/sell assets at future prices/dates, used for hedging or speculation.
CFTC data is published weekly: positions by trader type (LFON) and concentration ratios (FCR).
Positions split by commercials (hedgers) vs non-commercials (speculators); spreads net out neutral bets.
Net positions (longs − shorts) gauge bullish/bearish sentiment; high concentration signals herding risk.

The CFTC’s Commitments of Traders (COT) report is one of the cleanest publicly available windows into who is positioned where in the futures market. Every Friday the CFTC releases a snapshot of open positions as of the prior Tuesday, broken down by trader type. That weekly cadence is short enough to react to news and long enough to be free of intraday noise — the academic literature uses COT data extensively to study sentiment, momentum, and the disaggregated demand for hedging vs speculation.

The two datasets we pull have complementary roles:

QDL/LFON (positions): how many contracts are held long vs short, by trader type. The headline number is the non-commercial net position, often used as a sentiment proxy — a large net long among speculators is conventionally read as bullish, though the literature is mixed on whether it predicts returns.
QDL/FCR (concentration ratios): what share of positions sits with the largest 4 or 8 traders. High concentration means a few players dominate; if their views happen to be wrong, unwinding is a source of price-impact risk.

The contract codes (088691 = gold, 084691 = silver, 133741 = bitcoin futures, 146021 = ether futures) are CFTC-internal identifiers — the full list is in the CFTC explanatory notes; assignment groups will look these up when extending the analysis to other markets.

Key columns in CFTC datasets

Column	Description
`contract_code`	Financial contract code
`type`	F (futures) / FO (futures + options)
`date`	Date of record
`market_participation`	Number of traders
`non_commercial_longs`	Long positions, non-commercials
`non_commercial_shorts`	Short positions, non-commercials
`commercial_longs`	Long, commercial entities
`commercial_shorts`	Short, commercial entities
`total_reportable_longs`	Total reportable longs
`total_reportable_shorts`	Total reportable shorts

Column	Description
`largest_4_longs_gross`	Gross long, top 4 traders
`largest_4_shorts_gross`	Gross short, top 4
`largest_8_longs_gross`	Gross long, top 8
`largest_8_shorts_gross`	Gross short, top 8
`largest_4_longs_net`	Net long, top 4
`largest_4_shorts_net`	Net short, top 4
`largest_8_longs_net`	Net long, top 8
`largest_8_shorts_net`	Net short, top 8

The two tables share contract_code, type and date as the natural join keys. The columns you’ll most often build derived metrics from:

Net positions = non_commercial_longs − non_commercial_shorts. A widely cited “speculator sentiment” measure. The corresponding commercial number is its near-mirror image — commercials hedge their physical exposure, so a long-cash producer is short futures.
Net % of open interest = net position / total open interest. Removes the scale effect (markets grow over time); allows cross-market comparison.
Top-4 / top-8 net concentration (from QDL/FCR) — early-warning indicator for crowded trades. A reading near the historical maximum often coincides with reversals (when the crowd unwinds).

For the assignment, you’ll typically want a panel with one row per (date, asset) and columns for net positions, concentration, and any derived measures — that’s exactly what the merge + bind_rows pipeline two slides ahead constructs. The “long” panel layout makes group-wise statistics (group_by(Asset)) and faceted plots (facet_wrap(~ Asset)) one-liners.

Import and cleanse CSV

library(readr)

# Gold
com_gold  <- read_csv("com_gold.csv") %>%
  mutate(date = as.Date(date), contract_code = as.character(contract_code))
conc_gold <- read_csv("conc_gold.csv") %>%
  mutate(date = as.Date(date), contract_code = as.character(contract_code))

# Silver, Bitcoin, Ethereum follow the same pattern …

read_csv() parses delimited text into a tibble (the tidyverse data frame).
mutate() updates / creates columns row-wise — here we coerce types so later joins behave.
Always handle dates explicitly (as.Date) — leaving them as strings breaks merge() and time-series plots later.
The full tidyverse workflow: read → clean → merge → analyse → visualise.

The CSV-import pattern shown here is what you’ll use whenever the API call has been done once and saved to disk (the typical workflow during assignment work — pull once, cache, iterate locally). read_csv() from readr is preferred over base read.csv() for three reasons: (1) it’s much faster on files of any meaningful size, (2) it doesn’t silently coerce strings to factors, (3) the parsed-column-types summary it prints when you load a file is the first sanity check that nothing went sideways.

Two type-coercion habits worth burning in:

Dates as Date (not character) — as.Date(date_string) if your strings are ISO-format (“2024-03-15”), otherwise lubridate::ymd() / mdy() / dmy() parse most ambiguous formats. Date-typed columns sort correctly, support > / < comparisons, and play nicely with ggplot2’s date axes.
IDs as character (not numeric) — CFTC contract codes are integer-looking but you should never do arithmetic on them. Storing as character avoids leading-zero loss ("088691" stays “088691”; numeric coercion drops the leading zero).

The %>% pipe (or |> in base R 4.1+) is the tidyverse’s chaining operator: x %>% f(y) is the same as f(x, y). Chains read top-to-bottom like a recipe and avoid deeply nested function calls.

Merge and append data

# Merge positions with concentration ratios per asset
merged_gold   <- merge(com_gold,   conc_gold,   by = "date", all = TRUE) %>% arrange(date)
merged_silver <- merge(com_silver, conc_silver, by = "date", all = TRUE) %>% arrange(date)
merged_btc    <- merge(com_btc,    conc_btc,    by = "date", all = TRUE) %>% arrange(date)
merged_eth    <- merge(com_eth,    conc_eth,    by = "date", all = TRUE) %>% arrange(date)

# Stack all four into one long data frame
combined <- bind_rows(
  merged_gold   %>% mutate(Asset = "Gold"),
  merged_silver %>% mutate(Asset = "Silver"),
  merged_btc    %>% mutate(Asset = "Bitcoin"),
  merged_eth    %>% mutate(Asset = "Ethereum")
)

merge() joins two data frames by a common key (date here); all = TRUE is an outer join (keeps NAs).
bind_rows() appends data frames row-wise — useful for stacking per-asset frames into a long panel.
arrange() sorts a frame by one or more columns.
Long-format panels are the natural input for dplyr and ggplot2.

Two operations to keep mentally distinct:

merge() (or dplyr::left_join / inner_join etc.) — combines two data frames side-by-side by matching rows on a shared key. Here we join positions and concentration tables for the same asset on date. all = TRUE is an outer join: rows that exist in only one table are kept, with NA filling the missing columns. all.x = TRUE is a left outer join (keep all rows from the left frame); all = FALSE (the default) is an inner join (keep only matched rows). Choose based on whether you want to preserve “we have positions data on this date but no concentration data” rows.
bind_rows() — stacks data frames on top of each other. Both frames need compatible columns. Useful when each asset’s data lives in its own frame (as here) and you want one long panel keyed by asset + date.

The classic workflow: per-source read_csv() → per-asset merge() of related tables → bind_rows() of all assets into one long panel → arrange() for chronological order. Once the long panel is in hand, dplyr verbs become the analysis language and the per-asset frames can be discarded.

The Asset column added by mutate(Asset = "Gold") etc. is what makes the stack distinguishable downstream — without it you’d lose the per-asset identity in the bind. This pattern (key-tag-then-stack) is ubiquitous in panel construction.

Filter and mutate

combined_clean <- combined %>%
  filter(date > as.Date("2021-01-01")) %>%      # ensure balanced panel
  mutate(
    Year          = lubridate::year(date),
    Asset_Factor  = factor(Asset),
    Net_Longs     = non_commercial_longs - non_commercial_shorts,
    Log_Net_Longs = log(abs(Net_Longs) + 1)     # +1 avoids log(0)
  ) %>%
  na.omit()

filter() subsets rows by a condition; mutate() adds or updates columns.
lubridate::year(date) extracts the year as an integer — useful for grouping or faceting.
Derive new variables (here: net longs, log-transformed) inside mutate() rather than as separate steps.
na.omit() drops rows with any NA — be aware this can silently drop a lot of data; check nrow() before/after.

filter() and mutate() are the two dplyr verbs you’ll use most. Their distinction is sometimes confusing because both take a frame and return a frame, but their effects are different:

filter(condition) keeps rows where condition is TRUE. Drops rows. Number of columns unchanged.
mutate(new_col = expr) adds (or overwrites) a column computed from existing columns. Number of rows unchanged.

The example does three useful things at once:

Filter for a balanced panel — filter(date > as.Date("2021-01-01")) ensures every asset has data over the same window. Without this, descriptive statistics across assets would be incomparable (one asset’s mean computed over a longer history than another’s).
Derive analytical variables in one place — Net_Longs, Year, the log-transform — all near the top of the script. Putting derivations in a single mutate() chain (rather than scattered through the file) makes the script easier to audit.
Drop NAs explicitly — na.omit() is a sledgehammer; it drops a row if any column is NA. Always print nrow() before and after to know how much data you’re throwing away. For more surgical missing-data handling use filter(!is.na(specific_col)) or tidyr::drop_na(col1, col2).

The log(abs(Net_Longs) + 1) transform handles two issues simultaneously: abs() because net longs can be negative; + 1 so log(0) doesn’t blow up. This is the “asinh” / “log1p”-style stabilising transform you’ll see throughout empirical finance.

Group by and summarise

# Per-asset, per-year summary
summary_assets <- combined_clean %>%
  group_by(Asset, Year) %>%
  summarise(
    Mean_Net_Longs = mean(Net_Longs, na.rm = TRUE),
    SD_Net_Longs   = sd(Net_Longs,   na.rm = TRUE),
    .groups = "drop"
  )

# Z-scores within each Asset
combined_z <- combined_clean %>%
  group_by(Asset) %>%
  mutate(
    Z_Score = (Net_Longs - mean(Net_Longs, na.rm = TRUE)) /
              sd(Net_Longs, na.rm = TRUE)
  ) %>%
  ungroup()

group_by() partitions a frame into groups; summarise() collapses each group to one row.
Combine with mutate() (instead of summarise()) to add a column without collapsing — useful for z-scoring within groups.
Always ungroup() after group-wise mutations; lingering groups confuse downstream operations.
.groups = "drop" quiets the grouping warning and produces an ungrouped result.

group_by + summarise is the equivalent of SQL’s GROUP BY + aggregation, and once you have the pattern in your head it replaces 90 % of the for-loops you might otherwise write.

The two patterns shown here:

Collapsing aggregation — group_by(Asset, Year) %>% summarise(...): produces one row per (Asset, Year) combination, with summary statistics as columns. Use when you want a smaller frame describing each group.
Window operation (group_by + mutate) — group_by(Asset) %>% mutate(Z_Score = ...): leaves the row count unchanged but the calculation inside mutate() happens within each group. The Z-score example computes deviation-from-mean within each asset, so a value of +2 means “2 SDs above this asset’s own mean”, not “2 SDs above the pooled mean”.

The within-group window pattern is essential in panel finance: lagged returns within a stock, cumulative returns within a portfolio, ranking within a date — all rely on group_by() + mutate() (or summarise()).

ungroup() is non-negotiable after group-wise operations. A frame that quietly retains its grouping will silently group future operations too — summarise() will produce per-group output instead of a global summary, leading to head-scratching debugging. The .groups = "drop" argument inside summarise() does the ungroup automatically.

Pivot data

# Wide to long — summary stats for ggplot2
long_summary <- summary_assets %>%
  pivot_longer(cols = c(Mean_Net_Longs, SD_Net_Longs),
               names_to = "Stat", values_to = "Value")

# Long to wide — scatterplot data
scatterplot_data <- combined_clean %>%
  select(date, Asset, Net_Longs) %>%
  pivot_wider(names_from = Asset, values_from = Net_Longs) %>%
  na.omit()

# Pivot + correlate — full correlation matrix
cor_matrix <- combined_clean %>%
  select(date, Asset, Net_Longs) %>%
  pivot_wider(names_from = Asset, values_from = Net_Longs, values_fill = NA) %>%
  select(-date) %>%
  cor(use = "pairwise.complete.obs") %>%
  as.data.frame() %>%
  rownames_to_column(var = "Asset1") %>%
  pivot_longer(-Asset1, names_to = "Asset2", values_to = "Correlation") %>%
  mutate(Correlation = round(Correlation, 2)) %>%
  filter(!is.na(Correlation))

Pivoting reshapes data between wide (one row per observation, many columns) and long (many rows per observation, two columns) formats.
pivot_longer() collapses many columns into key/value pairs; pivot_wider() does the inverse.
Long format plays best with ggplot2’s grammar; wide format is convenient for correlation matrices and pair-wise plots.
Chains often go long → reshape → analyse → go wide depending on the next step.

The wide vs. long distinction is one of the foundational ideas in tidy data and underpins almost every data manipulation in modern R. Hadley Wickham’s “Tidy Data” paper formalised three rules: each variable has its own column, each observation has its own row, each type of observational unit is a table — see (Wickham and Çetinkaya-Rundel 2023) for the canonical exposition.

In a long frame, each row is one (Asset, Date) observation with a single value column. In a wide frame, each row is one Date with one value column per Asset. The same information, two layouts.

When to use each: - Long for dplyr group-wise operations, ggplot2 plotting (mapping a column like Asset to colour requires it to live in a single column), and panel regressions (fixest::feols, plm::plm expect long format). - Wide for correlation matrices (cor() operates on a numeric matrix), pair-wise scatterplots, and presentation tables for humans.

The third example (the correlation pipeline) is the classic round-trip: long → wide for the cor() call → back to long for ggplot2. It looks complicated but each step is reversible. Once you internalise the pattern, building correlation heatmaps or pairwise comparisons becomes mechanical.

Data visualization in R — package landscape

Package	Description	Use it for	Weaknesses
ggplot2	Layered grammar of graphics	Publication-ready plots	Steeper learning curve; verbose for simple plots
Base R graphics	`plot()`, `hist()`, `barplot()`, …	Rapid exploratory checks	Limited customization, dated look
lattice	Trellis-style multi-panel plots	Conditioned/multi-panel views	Outdated formula syntax
plotly	Interactive `ggplot2`/base extension	Web dashboards, hover/zoom	Overhead for static use
shiny	Reactive web apps	Custom dashboards	Steep learning, deployment needed

For empirical-finance work — papers, theses, term-paper figures — ggplot2 is the default choice and is what the rest of this lecture uses. Its core advantages: (a) the same grammar handles everything from a histogram to a faceted multi-panel chart with conditional aesthetics; (b) the output is publication-quality with little tweaking; (c) the ecosystem (gganimate, ggtext, patchwork, ggrepel) extends it for almost any need.

When to reach for one of the others:

Base R plot() / hist() — for very quick exploratory looks. hist(returns) is shorter to type than the ggplot equivalent. Once you intend to keep the chart, switch to ggplot2.
plotly — when interactivity matters: hover tooltips, zoom, pan. Easy to add via plotly::ggplotly(p) over an existing ggplot.
shiny — for fully interactive dashboards (parameter sliders, reactive plots). Steep learning curve; usually overkill for an academic paper but excellent for prototypes you want stakeholders to play with.

lattice predates ggplot2 by years and you’ll occasionally see it in older R code; for new work, ggplot2 has decisively won the academic mindshare.

ggplot2 — the six steps

Select data — ggplot(data).
Aesthetic mapping — aes(x, y, colour, size, …).
Geometric objects — geom_*() (point, line, bar, …).
Labels — labs().
Facets — facet_wrap(), facet_grid().
Theme — theme_minimal(), theme_classic(), etc.

Steps 1–3 are mandatory; 4–6 polish the output. Reference the ggplot2 cheat sheet.

The “grammar of graphics” idea is that any chart can be described as a small number of orthogonal choices: what’s the data, how are columns mapped to visual properties (aesthetics), and what shape draws the data (geom). Once that mental model clicks, ggplot2 stops feeling verbose — you’re literally specifying each independent choice.

The six-step list maps onto a typical authoring sequence:

Data — ggplot(df) opens an empty canvas tied to a frame. The frame should usually be in long format (one row per observation).
Aesthetic mapping — aes(x = …, y = …, colour = …) says “use this column for the x-axis, that one for y, that one for colour”. Aesthetics are mappings, not constant settings — aes(colour = "red") makes “red” a category, not the colour red. To set a constant, put it outside aes(): geom_point(colour = "red").
Geom — geom_point(), geom_line(), geom_bar(), geom_smooth(), etc. The geom decides the visual encoding. Multiple geoms in one chart layer naturally (geom_point() + geom_smooth()).
Labels — labs(title, subtitle, x, y, caption). A subtitle that encodes the takeaway (not just a longer title) earns its place; readers should grasp the chart’s message in 5 seconds.
Facets — facet_wrap(~ var) or facet_grid(rowvar ~ colvar) for small-multiple comparisons. Often more legible than colour-coding many series on one panel.
Theme — theme_minimal(), theme_classic(), tidyquant::theme_tq() for a finance look. Customise once, reuse across the project for visual consistency.

Reference the Posit ggplot2 cheat sheet (Posit 2022) when reaching for an unfamiliar geom; the canonical book is ggplot2: Elegant Graphics for Data Analysis (Wickham, Navarro, and Pedersen 2024), free online.

ggplot2 — building up step by step

library(ggplot2)

# Step 1: data
ggplot(combined_clean)

# Step 2: aesthetic mapping
ggplot(combined_clean, aes(x = date, y = Net_Longs, color = Asset))

# Step 3: geometric object
ggplot(combined_clean, aes(x = date, y = Net_Longs, color = Asset)) +
  geom_line()

# Step 4: labels
ggplot(combined_clean, aes(x = date, y = Net_Longs, color = Asset)) +
  geom_line() +
  labs(title = "Normalized Net Longs: Relative Changes",
       x = "Date", y = "Net Long Positions")

# Step 5: facets
ggplot(combined_clean, aes(x = date, y = Net_Longs, color = Asset)) +
  geom_line() +
  labs(title = "Normalized Net Longs: Relative Changes",
       x = "Date", y = "Net Long Positions") +
  facet_wrap(~ Asset)

# Step 6: theme (or tidyquant's theme_tq)
ggplot(combined_clean, aes(x = date, y = Net_Longs, color = Asset)) +
  geom_line() +
  labs(title = "Normalized Net Longs: Relative Changes",
       x = "Date", y = "Net Long Positions") +
  facet_wrap(~ Asset) +
  theme_tq()

ggplot2 is declarative — you describe what to plot, not how.
Layers compose with +; each layer adds a geom, scale, or theme.
facet_wrap() splits one chart into one panel per category — great for cross-asset comparisons.
Pick a theme_*() once and reuse — keeps the deck visually consistent.

The progressive build is the most useful debugging mode for a ggplot — when a chart doesn’t look right, peel off layers from the bottom until you see the issue. Common surprises:

Step 1 alone shows a blank canvas with axes — that’s correct. ggplot2 needs at least an aesthetic mapping (step 2) and a geom (step 3) before drawing anything.
Step 2 alone shows axes with no points — also correct. You’ve declared the mapping but haven’t told ggplot what to draw at each (x, y) position.
Adding geom_line() first is the moment a chart appears. From there, every subsequent layer composes with +.

facet_wrap(~ Asset) produces one panel per asset using the same y-scale — useful when you want to compare magnitudes. Use scales = "free_y" (as the boxplot example does later) when assets are on very different scales and you care about shape more than level.

theme_tq() from tidyquant gives a clean finance-paper look with a slightly tinted background; it’s a stylistic choice, not a substantive one.

Example 1 — line chart with smoother

library(ggplot2)

ggplot(combined_z, aes(x = date, y = Z_Score, color = Asset)) +
  geom_smooth(method = "loess", se = FALSE) +                 # local-polynomial trend
  geom_hline(yintercept = 0, lty = 2) +                       # mean reference
  geom_vline(xintercept = as.Date("2022-01-01"), lty = 2) +   # 2022 break
  theme_minimal() +
  labs(title = "Z-Score Normalized Trends: Deviations from Mean Sentiment",
       subtitle = "Positive = above-average bullishness; e.g., crypto spikes in 2023 recovery")

geom_smooth(method = "loess") adds a non-parametric trend — useful when a straight line under-fits.
geom_hline() / geom_vline() mark reference levels (mean = 0, structural-break date).
Use a subtitle to encode interpretation directly in the chart — saves explaining in the caption.

The smoother (geom_smooth) is the headline addition here. Two flavours matter:

method = "loess" (locally-estimated scatterplot smoothing) — non-parametric trend that adapts to the local data without imposing a functional form. Good first look at “is there a trend, and if so what shape?” Avoid in production figures unless you can defend the bandwidth choice.
method = "lm" — linear regression line. Use when you want a single straight-line summary, e.g. on a scatter of two series.

se = FALSE suppresses the confidence ribbon. The default se = TRUE is informative for academic figures (it shows uncertainty around the trend) but adds visual noise on dense plots — judge per chart.

The reference lines (geom_hline, geom_vline) are a small but important communication device: they orient the reader to the right comparison. Here, the horizontal y = 0 line says “above this, sentiment is bullish; below, bearish”; the vertical 2022-01-01 line marks a period of interest. Always pair reference lines with axis labels or a subtitle that explains why they’re there — readers should never need to guess.

Example 2 — bar chart with error bars

ggplot(summary_assets, aes(x = factor(Year), y = Mean_Net_Longs, fill = Asset)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_errorbar(aes(ymin = Mean_Net_Longs - SD_Net_Longs,
                    ymax = Mean_Net_Longs + SD_Net_Longs),
                position = position_dodge(width = 0.9), width = 0.2) +
  theme_minimal() +
  labs(title = "Mean Net Longs by Asset & Year", x = "Year", y = "Mean Net Longs")

geom_bar(stat = "identity") plots the column value as the bar height (stat = "count" would tally rows instead).
position = "dodge" puts groups side-by-side; position = "stack" would stack them.
Error bars (here ±1 SD) communicate dispersion alongside the mean.

Bar charts of means are everywhere in finance papers but they are also routinely abused. Two pitfalls to avoid:

Always show dispersion, not just the mean. The error bars in the example show ±1 SD; better still are 95 % confidence intervals (mean_se(...) × 1.96, or mean_cl_normal() from Hmisc). A bar of 10 ± 9 tells a very different story from 10 ± 1; without the error bar both look identical.
Bar charts hide distributional information. The boxplot example two slides down is usually more informative for continuous data because it shows median, IQR, and outliers. Reserve bar-of-mean charts for discrete categorical comparisons where you genuinely want to encode the value as bar height (counts, sums, percentages of a total).

position = "dodge" places groups side-by-side; position = "stack" stacks them within each x. position_dodge(width = 0.9) for the error bars matches the bar offset (width = 0.9) so error bars sit centred over each bar — a small detail that’s easy to forget.

Example 3 — faceted boxplots

ggplot(combined_clean, aes(x = factor(Year), y = Net_Longs, fill = Asset)) +
  geom_boxplot() +
  facet_wrap(~ Asset, scales = "free_y") +
  theme_minimal() +
  labs(title = "Annual Distributions: Volatility Comparison",
       x = "Year", y = "Net Longs")

Boxplots show median, IQR, and outliers — much more informative than means alone.
facet_wrap(~ Asset, scales = "free_y") lets each panel use its own y-range — better when assets are on different scales.
Useful for spotting outliers / distribution shifts year over year.

A boxplot displays the five-number summary (min ignoring outliers, Q1, median, Q3, max ignoring outliers) plus dots for outliers (default: points beyond Q1−1.5×IQR or Q3+1.5×IQR). It tells you, at a glance: median, spread, skewness (if the median sits off-centre in the box), and outlier presence. For comparing distributions across several groups (here, several years), nothing beats it.

facet_wrap(~ Asset, scales = "free_y") is the right choice when assets are on different absolute scales — Bitcoin net longs may dwarf gold net longs in raw numbers, and a fixed common y-axis would compress gold into a sliver. Free scales let each panel use its own axis range; the trade-off is that magnitudes are no longer directly comparable across panels (so always label the scale clearly).

A useful enhancement when doing exploratory work: overlay raw points (geom_jitter(width = 0.2, alpha = 0.3)) on the boxplot. Box + points together show both the summary and the underlying distribution, which is especially useful when group sizes are small enough that the box’s IQR is a rough approximation.

Example 4 — scatter + regression line

ggplot(scatterplot_data, aes(x = Gold, y = Bitcoin)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", colour = "red", se = FALSE) +
  theme_minimal() +
  labs(title = "Gold vs Bitcoin Net Longs (paired dates only)",
       x = "Gold Net Longs", y = "Bitcoin Net Longs")

alpha = 0.5 makes overlapping points readable in dense scatterplots.
geom_smooth(method = "lm") overlays an OLS regression line — quickly shows whether two series co-move.
Pair this with cor() to put a number on the visual relationship.

The scatterplot + regression-line is the workhorse plot for “do these two things move together?” A few production-quality refinements worth knowing:

alpha = 0.5 (or lower) is essential whenever you have many points overlapping. Without it, the densest regions look identical to the sparsest, hiding the structure.
geom_smooth(method = "lm", se = TRUE) by default shows a confidence ribbon around the line — informative, especially in small samples where the regression slope is uncertain. Suppress with se = FALSE only when the ribbon clutters a complex chart.
Add the regression equation on the plot for academic figures: ggpmisc::stat_poly_eq() overlays the fitted equation and R² in a single line; useful when the figure has to stand alone in a paper.

For a more rigorous statistical answer you’d run lm(Bitcoin ~ Gold, data = scatterplot_data) and report the slope, standard error, and R² — Lecture 3 covers regression in depth. The visual scatter is meant as the first look; the regression makes it precise.

Example 5 — correlation heatmap

ggplot(cor_matrix, aes(x = Asset2, y = Asset1, fill = Correlation)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = Correlation), color = "black", size = 3, fontface = "bold") +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red",
                       midpoint = 0, limits = c(-1, 1)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
        axis.text.y = element_text(size = 10)) +
  labs(title = "Correlation Matrix of Net Longs (Assets)",
       subtitle = "Negative values show decoupling (e.g., Gold vs Bitcoin in crises)",
       x = "Asset", y = "Asset")

Heatmaps make pairwise correlations easy to scan — colour shows sign and magnitude at a glance.
scale_fill_gradient2() with midpoint = 0 is the right default for correlation data (red/blue diverging palette).
Annotate cells with the value (geom_text()) so the reader doesn’t have to map colour to number.

Correlation heatmaps are an extraordinarily efficient way to surface dependence structure in a panel of assets. Three design decisions worth being deliberate about:

Diverging palette centred at zero — scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0). Sequential palettes (e.g. viridis) make positive and negative correlations indistinguishable; diverging palettes encode sign in colour and magnitude in saturation.
Symmetric limits — limits = c(-1, 1) ensures that white means “exactly zero” and that a +0.4 and a −0.4 are visually equally far from white. Skipping this and letting ggplot auto-scale destroys the diverging-palette interpretation.
Annotate the cells — geom_text(aes(label = Correlation)). Visual encoding alone is too imprecise for many readers to extract numeric values from; printing the rounded correlation in each cell removes the ambiguity. Use bold + a contrasting text colour so values stay readable across the whole gradient.

For finance work specifically, correlations near 1 in a heatmap of returns warn of redundant series (e.g. two share classes of the same stock); near 0 (or negative) is the cluster you’d want for diversification. During crises correlations rise toward 1 across asset classes — a heatmap done period-by-period makes this dramatic shift visible immediately.

Exporting plots

ggsave("myplot.png")
ggsave("myplot.pdf")
ggsave("myplot.svg")
ggsave("C:/Users/R Intro/myplot.tiff")

# Custom size (default unit: inches)
ggsave("plot.pdf", width = 6, height = 4)
ggsave("plot.pdf", width = 15, height = 10, units = "cm")

ggsave() saves the last plot displayed by default, or whichever ggplot object you pass as the first argument: ggsave("plot.pdf", plot = my_plot). The format is inferred from the file extension — .png, .pdf, .svg, .tiff, .jpg. For academic papers, PDF is the right default (vector graphics scale perfectly when the editor reflows the document); PNG is appropriate for slides and the web.

Always specify width and height explicitly. The ggsave default uses the current plotting device’s size, which depends on your RStudio pane geometry — saving without dimensions produces inconsistent output between your machine and a co-author’s. Typical academic-paper figure sizes: full-width 6×4 inches, half-width 3.5×2.5 inches. Set dpi = 300 for raster formats destined for print.

For a multi-figure paper, define a single set of size constants at the top of your script and reuse them: FIG_W <- 6; FIG_H <- 4 and then every ggsave() call uses them. Keeps figures visually consistent without per-call typing.

Troubleshooting R

RStudio Help tab (lower-right pane) — home icon for resources/manuals; search bar for functions and packages.
In the Script Editor:
- help.start() opens the main Help page.
- help(FunctionName) or ?FunctionName shows function help.
Online: RStudio cheatsheets, Posit Community, Stack Overflow, GitHub, LLMs.

A productive triage when an R command misbehaves:

?function_name — the local help page is usually the fastest answer. Pay attention to the Examples section at the bottom: a working example from the docs that you can paste into the console and modify is often more illuminating than reading the argument descriptions.
traceback() after an error — when a function fails with an opaque message, traceback() shows the full call stack so you can see where the failure happened, not just what failed.
The error message verbatim into Google or your LLM of choice — modern R errors (especially from tidyverse packages) are usually distinctive enough to land on a relevant Stack Overflow answer or a GitHub issue.
Reach for an LLM for “how do I…?” questions (“how do I add a regression line to a ggplot grouped by a factor?”). Caveat: LLMs often hallucinate package names or argument names, particularly for fast-moving packages. Always run the suggested code and check it does what you expect — never paste blindly.
Stack Overflow as the last resort for genuinely hard questions; the [r] and [ggplot2] tags are well-curated. Search before posting — your question has almost certainly been asked.

Reading other people’s R code (CRAN package vignettes, GitHub repos, Tidy Finance code blocks) is the highest-leverage way to internalise idioms beyond what tutorials cover.

2.4 Introduction to Overleaf

2.1 Course objectives
2.2 Recap from Lecture 1
2.3 Live Coding Session 2
2.4 Introduction to Overleaf
2.5 Conclusion of Lecture 2

LaTeX vs Overleaf
Getting started with Overleaf

LaTeX vs Overleaf

A markup language for professional documents with precise control over layout, equations, tables — widely used in academia.
Plain text → formatted PDF; ideal for papers/theses with complex math (regressions, matrices).
Basic structure uses commands in braces, e.g. \section{Title}, and environments for tables/figures.
Advantages: automatic numbering, bibliographies, consistency. Steeper learning curve for beginners.

A free online editor for LaTeX with real-time collaboration and instant PDF previews — no local install.
Create a project, paste code, hit Recompile. Perfect for group assignments / thesis drafts.
Templates, spell-check, Git integration; sign in with university email for full access.
Start with our Moodle template: copy-paste into a new project, edit sections, export PDF.

LaTeX is the typesetting system for academic finance papers — every top journal expects LaTeX submissions, and most templates ship as .tex files. The pay-off is precise control over equations, tables, references, and cross-referencing — exactly the things Word handles poorly. The cost is a learning curve: a .tex source file looks more like code than a document, and small syntax errors (a missing }, an unclosed environment) prevent compilation.

Overleaf sidesteps the install pain entirely: edit in the browser, compile in the cloud, get a PDF preview in seconds. You sign in once with your university email and get free access (with university-edition perks like unlimited collaborators and Git integration). The university’s institutional licence likely already covers you — check the IT page if Overleaf asks for payment.

Lamport (Lamport 1994) is the foundational reference (Leslie Lamport invented LaTeX); Flynn’s lshort (Flynn 2019) is the more practical 100-page introduction most people learn from; Overleaf’s own 30-minute tutorial (Overleaf 2023) is the fastest “from zero to your first compiled paper” path. For the assignment we provide a Moodle template that already has the right structure (abstract, sections, bibliography); your job is to fill the prose.

Getting started with Overleaf

Setup: create a free account at overleaf.com; New project → Blank Project or import template. Edit left pane (TeX), preview right pane.
Key features: syntax highlighting, error finder, Git integration; share links for collaboration.
Workflow: write in sections (\section{}), insert tables (\begin{table}…\end{table} with tabular), add R figures (\includegraphics{plot.png}), cite (\cite{key} with BibTeX).
Assignment template: download from Moodle. Copy-paste into a new Overleaf project; replace placeholders (e.g., \section{Results}) with your content.
Tips: \usepackage{graphicx} for images, \begin{figure}…\end{figure} for floats. Debug errors shown in red (e.g., missing $ for math). Practice with a 1-page test before the full paper.

A starter Overleaf workflow that gets you to a compiled draft fast:

Create a project from a template — New Project → Templates → Academic Journal is a sensible default. The template ships with a working \documentclass, hard-coded packages, and a \begin{document} … \end{document} skeleton.
Compile early and often — every section heading you add, hit Recompile (or Ctrl + S, which auto-recompiles). LaTeX errors are easiest to fix when only the last few lines are new; if you write 10 pages without compiling and then hit a cascade of errors, untangling is painful.
Wrap R figures in \begin{figure} … \end{figure} — the figure environment makes the image a float that LaTeX places near the reference, not necessarily where you put the code. Always include \caption{...} and \label{fig:something}, then reference with Figure~\ref{fig:something} in the prose.
For tables, use booktabs (\toprule, \midrule, \bottomrule) for professional-looking horizontal rules. The xtable and kableExtra R packages can export a fitted regression directly to a booktabs LaTeX table — eliminating manual transcription and the typos it causes.
References go in a .bib file; cite with \cite{key} (or \citet/\citep if you load natbib). Your assignment template comes with a starter .bib; add new entries to it as you cite new papers — same workflow as the references.bib we use in this teaching site.

Common LaTeX errors and their meaning: - Missing $ inserted — math notation (e.g. x_t) outside a $ … $ environment. Wrap math in dollar signs. - Undefined control sequence — typo in a command name (\beggin{equation}) or you forgot to \usepackage{...} the package that defines it. - Underfull / overfull hbox — line-breaking warnings, usually safe to ignore in drafts but worth fixing for the final PDF.

2.5 Conclusion of Lecture 2

2.1 Course objectives
2.2 Recap from Lecture 1
2.3 Live Coding Session 2
2.4 Introduction to Overleaf
2.5 Conclusion of Lecture 2

Course at a glance
Further reading
Prepare before next lecture
See you next time
References

Course at a glance

Basics

Week 1

29.10.2025

Course objectives, schedule, assignments · Introduction to R · Live coding

Course objectives, schedule and assignments
Introduction to R and RStudio
Live coding: variables, vectors, matrices, data frames, lists, functions, loops
Data import and export

Data Handling & Visualization

Week 2

05.11.2025

API access, merging, cleansing, transforming and visualising financial data in R · Introduction to Overleaf

API access (Nasdaq Data Link / Quandl, FRED, Yahoo, Coingecko, Polygon)
Import and cleanse: read_csv, mutate, types
Merge and append data (merge, bind_rows)
Filter and mutate (dplyr): subset rows, derive variables
Group by and summarise
Pivot wide / long
Data visualization with ggplot2 (six-step pipeline)
Introduction to LaTeX and Overleaf

Statistical Analysis

Week 3

12.11.2025

Descriptive · inferential · modelling — applied in R

Descriptive statistics in R
Correlation matrix and Pearson correlation test
t-Test and Wilcoxon test
Shapiro-Wilk and Kolmogorov-Smirnov tests
Linear regression with fixed effects
Clustered standard errors
Exporting regression tables with stargazer
Discussion of Assignment I (Problem Set)

Academic Publishing & Refereeing

Week 4

19.11.2025

What makes a great empirical paper · publication process · how to write a referee report

What makes a good empirical paper (contribution, identification, write-up)
The publication process step by step
Top finance and economics journals
Bad outcome vs revise & resubmit
Referee Reports — summary, major issues, minor issues
Referee checklist (question, identification, data, econometrics, results)
Discussion of Assignment II (Referee Report)

Brown Bag Seminar

Week 13

20.01.2026

Engage with doctoral research and prepare your referee report

Doctoral research presentations
Apply empirical / writing tips for the referee report
Group discussion and Q&A

Prepare before next lecture

Document today’s code in a clean way and save as .Rmd.
Try out a few visualizations based on the data we prepared today.

Two pieces of practice before Lecture 3:

Document the script as .Rmd — same logic as Lecture 1: take the live-coding script, intersperse prose explaining each block, render to HTML and check the narrative reads cleanly. Doing this is the fastest way to find bits you didn’t fully understand in class — if you can’t write a sentence describing what a chunk does, that’s the chunk to revisit.

Try a few additional visualisations on the CFTC data — pick one chart from today’s examples, swap the asset, change the time window, add a geom_* you haven’t used. The mechanical practice builds the muscle memory; once aes() + geom_x() + facet_y() + theme_z() is automatic, complex compositions stop feeling intimidating.

If you finish quickly, an excellent stretch is to compute a rolling 12-week correlation between two assets’ net longs (zoo::rollapplyr or slider::slide_dbl) and plot it — that’s a good exam-level exercise blending today’s dplyr work with a time-series construct.

See you next time

Reminder

Register for “exam” 13337 in campusonline by 30 November 2025.
Lecture 3: Statistical analysis in R — correlations, t-tests, Wilcoxon, Shapiro-Wilk, KS, linear regression, fixed effects.

References

Flynn, Peter. 2019. “Formatting Information: Introduction to LaTeX.” TeX Users Group (TUG). https://www.ctan.org/tex-archive/info/lshort/english/lshort.pdf.

Lamport, Leslie. 1994. LaTeX: A Document Preparation System. 2nd ed. Reading, MA: Addison-Wesley. https://www.latex-project.org/help/documentation/.

Overleaf. 2023. “Overleaf Tutorial: Basic Guide to LaTeX.” https://www.overleaf.com/learn/latex/Learn_LaTeX_in_30_minutes.

Posit. 2022. “Ggplot2 Cheat Sheet.” https://posit.co/wp-content/uploads/2022/10/data-visualization-1.pdf.

Wickham, Hadley, and Mine Çetinkaya-Rundel. 2023. R for Data Science. 2nd ed. O’Reilly. https://r4ds.hadley.nz/.

Wickham, Hadley, Danielle Navarro, and Thomas Lin Pedersen. 2024. Ggplot2: Elegant Graphics for Data Analysis. 3rd ed. https://ggplot2-book.org/.

Lecture 2: Data Handling & Visualization

2.1 Course objectives

Welcome to Research in Finance

Course Objective

Course at a glance

Assignments / Exams

2.2 Recap from Lecture 1

What we covered

2.3 Live Coding Session 2

API access — overview

API access — Quandl example

Key columns in CFTC datasets

Import and cleanse CSV

Merge and append data

Filter and mutate

Group by and summarise

Pivot data

Data visualization in R — package landscape

ggplot2 — the six steps

ggplot2 — building up step by step

Example 1 — line chart with smoother

Example 2 — bar chart with error bars

Example 3 — faceted boxplots

Example 4 — scatter + regression line

Example 5 — correlation heatmap

Exporting plots

Troubleshooting R

2.4 Introduction to Overleaf

LaTeX vs Overleaf

Getting started with Overleaf

2.5 Conclusion of Lecture 2

Course at a glance

Further reading

Prepare before next lecture

See you next time

References