API access, merging, cleansing, transforming and visualising financial data in R · Introduction to Overleaf
Scope
We will:
We will NOT:
Approach
Part I — Learn the Basics
Part II — Apply your learnings
Basics
Week 1
29.10.2025
Course objectives, schedule, assignments · Introduction to R · Live coding
Data Handling & Visualization
Week 2
05.11.2025
API access, merging, cleansing, transforming and visualising financial data in R · Introduction to Overleaf
Statistical Analysis
Week 3
12.11.2025
Descriptive · inferential · modelling — applied in R
Academic Publishing & Refereeing
Week 4
19.11.2025
What makes a great empirical paper · publication process · how to write a referee report
Brown Bag Seminar
Week 13
20.01.2026
Engage with doctoral research and prepare your referee report
Assignment I — Problem Set 50% of your grade
Documented .R script + PDF write-up (Overleaf)
Group of up to 5.
Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Research in Finance_assignment-1-problem-set_surname1_surname2_…
19 January 2026
Assignment II — Referee Report 50% of your grade
2.5–3 page referee report on a Brown-Bag presentation
Group of up to 5.
Submit by emailing oliver.padmaperuma@uni-ulm.de, CC andre.guettler@uni-ulm.de. Subject pattern: Research in Finance_assignment-2-referee-report_surname1_surname2_…
3 February 2026
| Topic | What is it? | What for? | Example syntax |
|---|---|---|---|
| Basic R | Free language for statistical computing | Calculations, exploring data | 2 + 3, log(10) |
| Variables & basic ops | Store data; perform math/logic | Assigning prices, ratios | var <- value, var > 0 |
| Vectors | Ordered collections of one type | Time series of returns | c(1, 2, 3), length(vec) |
| Matrices | 2D arrays | Covariance matrices | matrix(...), mat1 %*% mat2 |
| Data frames | Mixed-column tables | Datasets for analysis | data.frame(...), df$col |
| Lists | Flexible mixed-type containers | Storing model outputs | list(a = 1, b = c("x", "y")) |
| Functions | Reusable code blocks | Sharpe ratio calc | f <- function(x) x*2 |
| Loops | Iterate over sequences | Monte Carlo runs | for (i in 1:3) {…} |
| Import / export | Load and save | CSV, RDS workflow | read_csv("f.csv") |
httr/jsonlite for general calls, or wrappers (e.g., Quandl) for ease.| API | Description | Ease | Best for |
|---|---|---|---|
| Nasdaq Data Link (Quandl) | Comprehensive finance datasets (e.g., CFTC positions) as tables | Easy (Quandl package) |
Futures / sentiment analysis |
| FRED (Federal Reserve) | Economic indicators (CPI, unemployment) | Easy (fredr package) |
Macro trends in empirical work |
| Yahoo Finance | Historical prices/volumes via quantmod |
Easy (quantmod package) |
Quick stock data for portfolios |
| Coingecko | Crypto prices, historical charts | Medium (JSON parsing) | Crypto time series for volatility studies |
| Polygon.io | Real-time stocks/crypto (free tier) | Medium (API calls) | High-frequency data for advanced models |
library(Quandl)
library(tidyverse)
library(tidyquant)
# Specify API key (paste yours)
Quandl.api_key("57NfEEmRhMSGmivKw_kx")
# Gold futures: positions and concentration ratios
com_gold <- Quandl.datatable("QDL/LFON", contract_code = "088691", type = "FO_L_ALL")
conc_gold <- Quandl.datatable("QDL/FCR", contract_code = "088691", type = "FO_L_ALL_CR")
# Silver, Bitcoin, Ethereum — same pattern, different contract codes
com_silver <- Quandl.datatable("QDL/LFON", contract_code = "084691", type = "FO_L_ALL")
conc_silver <- Quandl.datatable("QDL/FCR", contract_code = "084691", type = "FO_L_ALL_CR")
com_btc <- Quandl.datatable("QDL/LFON", contract_code = "133741", type = "FO_L_ALL")
conc_btc <- Quandl.datatable("QDL/FCR", contract_code = "133741", type = "FO_L_ALL_CR")
com_eth <- Quandl.datatable("QDL/LFON", contract_code = "146021", type = "FO_L_ALL")
conc_eth <- Quandl.datatable("QDL/FCR", contract_code = "146021", type = "FO_L_ALL_CR")LFON) and concentration ratios (FCR).| Column | Description |
|---|---|
contract_code |
Financial contract code |
type |
F (futures) / FO (futures + options) |
date |
Date of record |
market_participation |
Number of traders |
non_commercial_longs |
Long positions, non-commercials |
non_commercial_shorts |
Short positions, non-commercials |
commercial_longs |
Long, commercial entities |
commercial_shorts |
Short, commercial entities |
total_reportable_longs |
Total reportable longs |
total_reportable_shorts |
Total reportable shorts |
| Column | Description |
|---|---|
largest_4_longs_gross |
Gross long, top 4 traders |
largest_4_shorts_gross |
Gross short, top 4 |
largest_8_longs_gross |
Gross long, top 8 |
largest_8_shorts_gross |
Gross short, top 8 |
largest_4_longs_net |
Net long, top 4 |
largest_4_shorts_net |
Net short, top 4 |
largest_8_longs_net |
Net long, top 8 |
largest_8_shorts_net |
Net short, top 8 |
library(readr)
# Gold
com_gold <- read_csv("com_gold.csv") %>%
mutate(date = as.Date(date), contract_code = as.character(contract_code))
conc_gold <- read_csv("conc_gold.csv") %>%
mutate(date = as.Date(date), contract_code = as.character(contract_code))
# Silver, Bitcoin, Ethereum follow the same pattern …read_csv() parses delimited text into a tibble (the tidyverse data frame).mutate() updates / creates columns row-wise — here we coerce types so later joins behave.as.Date) — leaving them as strings breaks merge() and time-series plots later.tidyverse workflow: read → clean → merge → analyse → visualise.# Merge positions with concentration ratios per asset
merged_gold <- merge(com_gold, conc_gold, by = "date", all = TRUE) %>% arrange(date)
merged_silver <- merge(com_silver, conc_silver, by = "date", all = TRUE) %>% arrange(date)
merged_btc <- merge(com_btc, conc_btc, by = "date", all = TRUE) %>% arrange(date)
merged_eth <- merge(com_eth, conc_eth, by = "date", all = TRUE) %>% arrange(date)
# Stack all four into one long data frame
combined <- bind_rows(
merged_gold %>% mutate(Asset = "Gold"),
merged_silver %>% mutate(Asset = "Silver"),
merged_btc %>% mutate(Asset = "Bitcoin"),
merged_eth %>% mutate(Asset = "Ethereum")
)merge() joins two data frames by a common key (date here); all = TRUE is an outer join (keeps NAs).bind_rows() appends data frames row-wise — useful for stacking per-asset frames into a long panel.arrange() sorts a frame by one or more columns.dplyr and ggplot2.filter() subsets rows by a condition; mutate() adds or updates columns.lubridate::year(date) extracts the year as an integer — useful for grouping or faceting.mutate() rather than as separate steps.na.omit() drops rows with any NA — be aware this can silently drop a lot of data; check nrow() before/after.# Per-asset, per-year summary
summary_assets <- combined_clean %>%
group_by(Asset, Year) %>%
summarise(
Mean_Net_Longs = mean(Net_Longs, na.rm = TRUE),
SD_Net_Longs = sd(Net_Longs, na.rm = TRUE),
.groups = "drop"
)
# Z-scores within each Asset
combined_z <- combined_clean %>%
group_by(Asset) %>%
mutate(
Z_Score = (Net_Longs - mean(Net_Longs, na.rm = TRUE)) /
sd(Net_Longs, na.rm = TRUE)
) %>%
ungroup()group_by() partitions a frame into groups; summarise() collapses each group to one row.mutate() (instead of summarise()) to add a column without collapsing — useful for z-scoring within groups.ungroup() after group-wise mutations; lingering groups confuse downstream operations..groups = "drop" quiets the grouping warning and produces an ungrouped result.# Wide to long — summary stats for ggplot2
long_summary <- summary_assets %>%
pivot_longer(cols = c(Mean_Net_Longs, SD_Net_Longs),
names_to = "Stat", values_to = "Value")
# Long to wide — scatterplot data
scatterplot_data <- combined_clean %>%
select(date, Asset, Net_Longs) %>%
pivot_wider(names_from = Asset, values_from = Net_Longs) %>%
na.omit()
# Pivot + correlate — full correlation matrix
cor_matrix <- combined_clean %>%
select(date, Asset, Net_Longs) %>%
pivot_wider(names_from = Asset, values_from = Net_Longs, values_fill = NA) %>%
select(-date) %>%
cor(use = "pairwise.complete.obs") %>%
as.data.frame() %>%
rownames_to_column(var = "Asset1") %>%
pivot_longer(-Asset1, names_to = "Asset2", values_to = "Correlation") %>%
mutate(Correlation = round(Correlation, 2)) %>%
filter(!is.na(Correlation))pivot_longer() collapses many columns into key/value pairs; pivot_wider() does the inverse.ggplot2’s grammar; wide format is convenient for correlation matrices and pair-wise plots.| Package | Description | Use it for | Weaknesses |
|---|---|---|---|
| ggplot2 | Layered grammar of graphics | Publication-ready plots | Steeper learning curve; verbose for simple plots |
| Base R graphics | plot(), hist(), barplot(), … |
Rapid exploratory checks | Limited customization, dated look |
| lattice | Trellis-style multi-panel plots | Conditioned/multi-panel views | Outdated formula syntax |
| plotly | Interactive ggplot2/base extension |
Web dashboards, hover/zoom | Overhead for static use |
| shiny | Reactive web apps | Custom dashboards | Steep learning, deployment needed |
ggplot(data).aes(x, y, colour, size, …).geom_*() (point, line, bar, …).labs().facet_wrap(), facet_grid().theme_minimal(), theme_classic(), etc.Steps 1–3 are mandatory; 4–6 polish the output. Reference the ggplot2 cheat sheet.
library(ggplot2)
# Step 1: data
ggplot(combined_clean)
# Step 2: aesthetic mapping
ggplot(combined_clean, aes(x = date, y = Net_Longs, color = Asset))
# Step 3: geometric object
ggplot(combined_clean, aes(x = date, y = Net_Longs, color = Asset)) +
geom_line()
# Step 4: labels
ggplot(combined_clean, aes(x = date, y = Net_Longs, color = Asset)) +
geom_line() +
labs(title = "Normalized Net Longs: Relative Changes",
x = "Date", y = "Net Long Positions")
# Step 5: facets
ggplot(combined_clean, aes(x = date, y = Net_Longs, color = Asset)) +
geom_line() +
labs(title = "Normalized Net Longs: Relative Changes",
x = "Date", y = "Net Long Positions") +
facet_wrap(~ Asset)
# Step 6: theme (or tidyquant's theme_tq)
ggplot(combined_clean, aes(x = date, y = Net_Longs, color = Asset)) +
geom_line() +
labs(title = "Normalized Net Longs: Relative Changes",
x = "Date", y = "Net Long Positions") +
facet_wrap(~ Asset) +
theme_tq()ggplot2 is declarative — you describe what to plot, not how.+; each layer adds a geom, scale, or theme.facet_wrap() splits one chart into one panel per category — great for cross-asset comparisons.theme_*() once and reuse — keeps the deck visually consistent.library(ggplot2)
ggplot(combined_z, aes(x = date, y = Z_Score, color = Asset)) +
geom_smooth(method = "loess", se = FALSE) + # local-polynomial trend
geom_hline(yintercept = 0, lty = 2) + # mean reference
geom_vline(xintercept = as.Date("2022-01-01"), lty = 2) + # 2022 break
theme_minimal() +
labs(title = "Z-Score Normalized Trends: Deviations from Mean Sentiment",
subtitle = "Positive = above-average bullishness; e.g., crypto spikes in 2023 recovery")geom_smooth(method = "loess") adds a non-parametric trend — useful when a straight line under-fits.geom_hline() / geom_vline() mark reference levels (mean = 0, structural-break date).ggplot(summary_assets, aes(x = factor(Year), y = Mean_Net_Longs, fill = Asset)) +
geom_bar(stat = "identity", position = "dodge") +
geom_errorbar(aes(ymin = Mean_Net_Longs - SD_Net_Longs,
ymax = Mean_Net_Longs + SD_Net_Longs),
position = position_dodge(width = 0.9), width = 0.2) +
theme_minimal() +
labs(title = "Mean Net Longs by Asset & Year", x = "Year", y = "Mean Net Longs")geom_bar(stat = "identity") plots the column value as the bar height (stat = "count" would tally rows instead).position = "dodge" puts groups side-by-side; position = "stack" would stack them.facet_wrap(~ Asset, scales = "free_y") lets each panel use its own y-range — better when assets are on different scales.alpha = 0.5 makes overlapping points readable in dense scatterplots.geom_smooth(method = "lm") overlays an OLS regression line — quickly shows whether two series co-move.cor() to put a number on the visual relationship.ggplot(cor_matrix, aes(x = Asset2, y = Asset1, fill = Correlation)) +
geom_tile(color = "white", linewidth = 0.5) +
geom_text(aes(label = Correlation), color = "black", size = 3, fontface = "bold") +
scale_fill_gradient2(low = "blue", mid = "white", high = "red",
midpoint = 0, limits = c(-1, 1)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.text.y = element_text(size = 10)) +
labs(title = "Correlation Matrix of Net Longs (Assets)",
subtitle = "Negative values show decoupling (e.g., Gold vs Bitcoin in crises)",
x = "Asset", y = "Asset")scale_fill_gradient2() with midpoint = 0 is the right default for correlation data (red/blue diverging palette).geom_text()) so the reader doesn’t have to map colour to number.help.start() opens the main Help page.help(FunctionName) or ?FunctionName shows function help.\section{Title}, and environments for tables/figures.\section{}), insert tables (\begin{table}…\end{table} with tabular), add R figures (\includegraphics{plot.png}), cite (\cite{key} with BibTeX).\section{Results}) with your content.\usepackage{graphicx} for images, \begin{figure}…\end{figure} for floats. Debug errors shown in red (e.g., missing $ for math). Practice with a 1-page test before the full paper.Basics
Week 1
29.10.2025
Course objectives, schedule, assignments · Introduction to R · Live coding
Data Handling & Visualization
Week 2
05.11.2025
API access, merging, cleansing, transforming and visualising financial data in R · Introduction to Overleaf
Statistical Analysis
Week 3
12.11.2025
Descriptive · inferential · modelling — applied in R
Academic Publishing & Refereeing
Week 4
19.11.2025
What makes a great empirical paper · publication process · how to write a referee report
Brown Bag Seminar
Week 13
20.01.2026
Engage with doctoral research and prepare your referee report
.Rmd.Reminder
Institute of Strategic Management and Finance · Ulm University