Foundations

The foundations module is where the platform starts — and, under ADR-0033, it is rep #1: the first lesson you re-run until its concepts are automatic, and the quality template every later module copies. It establishes the tutorial’s vocabulary, the local DuckDB substrate, the news-event shape every later module depends on, and the ADR trail that pins down the choices.

Read this module first if you have not touched DuckDB before, or if you want to anchor every later term — impression, click, candidate set, ranking, editorial constraint — in concrete tables you can query.

Why DuckDB and not Postgres or Snowflake

DuckDB earns its central role here because the tutorial’s whole shape leans on a single-machine, file-backed analytical engine. ADR-0016 frames DuckDB over Parquet as the analytical platform, not just a convenient development database — the same engine that scales to a billion rows in storage is the one you query in the very first lesson. That continuity is the point. Foundations is small, the production lesson is large, the substrate is unchanged.

How to run the rep

The lesson is one self-contained SQL file: tutorial/foundations/lesson-01-news-events.sql.

Run it and read the labelled output:

make -C tutorial/foundations run

make run uses the duckdb Python library through uv, so it works even on a machine without the DuckDB command-line client. If you do have the CLI, the original form still works and is worth seeing too:

duckdb < tutorial/foundations/lesson-01-news-events.sql

When you think you understand it, prove the rep passed:

make -C tutorial/foundations check

make check is the checkpoint turned into assertions. A rep is not “it printed something” — it is “the output says what I predicted.” run, check, and clean are the three verbs every module’s Makefile offers.

What the lesson builds

The lesson builds the smallest dataset that still teaches the recommender shape — three users, six articles, fourteen events — straight from VALUES, in three phases you will meet again in every later module:

Seed. A news_events table where one row is the atomic record: (user_id, article_id, event_type, event_at, topic). Two event_types: impression (an article appeared in the user’s view) and click (the user opened it). Every click is preceded by an impression — you cannot click what you were never shown.
Shape. Two views turn raw behaviour into metrics: an article_metrics view (per-article impressions, clicks, click-through rate) and a user_topic_affinity view (per-(user, topic) CTR). Both use the nullif(count, 0) divide-by-zero guard — a pattern you will reuse every time a later metric divides by a count.
Score. A starter_candidate_scores view — the simplest possible candidate generator — blends the two signals (0.7 · user_topic_ctr + 0.3 · article_ctr) for every (user, article) pair the user has not already seen.

Vocabulary this module introduces

The terminology established here is canonical for the rest of the tutorial. The repo’s CONTEXT.md is the authoritative glossary; lesson 1 is its runnable companion.

News event — one row of behaviour. Either an impression (article shown) or a click (article opened).
Article metrics — per-article aggregates, including CTR.
User topic affinity — per-(user, topic) aggregates. A simple proxy for what a user is interested in, and the first cold-start signal.
Candidate set — the pool of articles eligible for ranking for a given user. In lesson 1 this is “every article the user hasn’t seen yet.” Later modules replace the trivial scoring with sentence embeddings and a proper ranker, but the shape — candidate set first, ranking second — is established here.

The checkpoint, worked

The lesson ends with a question you should answer from section 4’s output before reading on:

Which user has the strongest sports signal, and which unseen article would the starter score rank highest for that user?

Strongest sports signal: u003. It is the only user who has ever clicked a sports article (it clicked a002), so its sports topic_ctr is 1.0 while u001 and u002 sit at 0.0.

Highest-scoring unseen article for u003: a001 (politics), score 0.85. This is the instructive twist. You might expect a sports article, since u003 loves sports — but the only unseen sports article, a005, has zero clicks, so its article_ctr is 0 and it scores just 0.7. Meanwhile u003 also has a perfect politics signal (topic_ctr 1.0 from clicking a006), and a001 is a politics article with a respectable article_ctr of 0.5:

a001:  0.7 · 1.0 (politics affinity) + 0.3 · 0.5 (article ctr) = 0.85
a005:  0.7 · 1.0 (sports affinity)   + 0.3 · 0.0 (article ctr) = 0.70

The lesson in the lesson: a strong topic signal does not guarantee a topic match in the results, because article quality (CTR) is mixed in too. That tension between “what the user likes” and “what performs” is the whole reason the later platform separates a model’s candidate set from an editor’s ranking.

The platform-as-leverage thesis (ADR-0004) needs editorial decisions to be visible at every layer. That starts with the data shape. Lesson 1 deliberately keeps everything in SQL, in fourteen visible rows, with no real Python logic, no orchestration, no platform plumbing. Once you have run it, you can argue about every later abstraction against a concrete substrate. That is the foundations claim.

What makes this a good rep (the template)

Because foundations is rep #1, it sets the bar the other modules copy:

Standalone. The DROP ... IF EXISTS header resets everything, so the lesson runs identically the first time and the hundredth.
One run command, one check command. make run works without any CLI install; make check mechanically proves the checkpoint.
A self-test, not just output. The checkpoint question is the rep’s correctness gate, and tests/test_lesson_01.py encodes its answer.
Verbose, not minimal. The SQL is heavily commented and this page works the checkpoint in full. “Focused” here means deep enough to stick, not short.

After this module

The natural next step is ingestion — replacing the toy VALUES block with real dlt-driven loads of EB-NeRD, Adressa, and MIND. The data shape is the same; the volume is not.

The evaluation module eventually closes the loop on these same events: it measures NDCG@10 over actual clicks, computes intra-list diversity over the categories you saw in lesson 1, and renders the Pareto chart on top of the same DuckDB substrate.