Foundations
The foundations module is where the platform starts. It establishes the tutorial’s vocabulary, the local DuckDB substrate, the news-event shape every later module depends on, and the ADR trail that pins down the choices.
Read this module first if you have not touched DuckDB before, or if you want to anchor every later term — impression, click, session, candidate set, ranking, editorial constraint — in concrete tables you can query.
Why DuckDB and not Postgres or Snowflake
Section titled “Why DuckDB and not Postgres or Snowflake”DuckDB earns its central role here because the tutorial’s whole shape leans on a single-machine, file-backed analytical engine. ADR-0016 frames DuckDB over Parquet as the analytical platform, not just a convenient development database — the same engine that scales to a billion rows in storage / lakehouse is the one you query in the very first lesson. That continuity is the point. Foundations is small, the production lesson is large, the substrate is unchanged.
What lesson 1 does
Section titled “What lesson 1 does”The first lesson is a single self-contained SQL file:
tutorial/foundations/lesson-01-news-events.sql.
Open it in your editor or run it directly:
duckdb < tutorial/foundations/lesson-01-news-events.sqlThe lesson builds the smallest dataset that still teaches the recommender
shape — three users, six articles, fourteen events — straight from VALUES.
It introduces:
- A news event as the atomic record:
(user_id, article_id, event_type, event_at, topic). Twoevent_types:impression(an article appeared in the user’s view) andclick(the user opened it). - An article-metrics view computing per-article impressions, clicks, and click-through rate (CTR) with a divide-by-zero guard.
- A user-topic-affinity view computing per-(user, topic) impressions, clicks, and topic CTR.
- A starter candidate score combining user-topic affinity and article
CTR (
0.7 · topic_ctr + 0.3 · article_ctr) for every (user, article) pair the user hasn’t already seen. This is the simplest possible candidate-generation slice. - A checkpoint question at the end: which user has the strongest sports signal, and which unseen article would the starter score rank highest for that user? Run the file and answer it from the output — that confirms the lesson worked.
Vocabulary this module introduces
Section titled “Vocabulary this module introduces”The terminology established here is canonical for the rest of the tutorial.
The repo’s CONTEXT.md is the authoritative glossary; lesson 1 is its
runnable companion.
- News event — one row of behaviour. Either an
impression(article shown) or aclick(article opened). - Article metrics — per-article aggregates, including CTR.
- User topic affinity — per-(user, topic) aggregates. A simple proxy for what a user is interested in. The first cold-start signal.
- Candidate set — the pool of articles eligible for ranking for a given user. In lesson 1 this is “every article the user hasn’t seen yet.” Later modules replace the trivial scoring with sentence embeddings + a proper ranker, but the shape — candidate set first, ranking second — is established here.
Why this is lesson 1, not a sidebar
Section titled “Why this is lesson 1, not a sidebar”The platform-as-leverage thesis (ADR-0004) needs editorial decisions to be visible at every layer. That starts with the data shape. Lesson 1 deliberately keeps everything in SQL, in fourteen visible rows, with no Python, no orchestration, no platform plumbing. Once you have run it, you can argue about every later abstraction against a concrete substrate. That is the foundations claim.
After this module
Section titled “After this module”The natural next step is ingestion — replacing the toy
VALUES block with real dlt-driven loads of EB-NeRD, Adressa, and MIND.
The data shape is the same; the volume is not.
The evaluation module eventually closes the loop on these same events: it measures NDCG@10 over actual clicks, computes intra-list diversity over the categories you saw in lesson 1, and renders the Pareto chart on top of the same DuckDB substrate.