Ingestion

The ingestion module is where the platform stops being a toy. Lesson 1 of foundations built a news_events table from a literal VALUES block; ingestion keeps that exact shape but changes where the data comes from — it ingests rows with dlt, lands them as Parquet, and reads them back with DuckDB. The promise foundations made — “the data shape is the same; only the source changes” — is the thing this module delivers.

The point of the module is not “we can call an HTTP endpoint.” The point is that each publisher has its own raw shape, and the ingestion layer has to preserve that fidelity while still letting downstream models work against one canonical unified view.

The runnable rep

Rep #2 is a single self-contained lesson: tutorial/ingestion/lesson-01-dlt-news-events.py. It teaches dlt from scratch on two tiny made-up publishers, so you build the muscle, not just run someone else’s pipeline.

make -C tutorial/ingestion run     # land the feeds and print the unified result
make -C tutorial/ingestion check   # assert the unification + checkpoint answer
make -C tutorial/ingestion clean    # remove the Parquet the lesson wrote

It runs in three phases — the same SEED → LAND → UNIFY you repeat whenever a new source enters the platform:

Seed. Two feeds with deliberately different raw vocabularies. Ekstra Bladet logs (reader_id, story_id, behaviour, happened_at, section) and calls a click open; MIND logs (uid, news_id, interaction, ts, category) and calls a click click. Neither is wrong — real publishers never agree on schema, and the lesson refuses to pretend they do.
Land. A dlt pipeline writes each feed — raw shape untouched — to partitioned Parquet under a work directory, using the filesystem destination (ADR-0016). The dlt resource is the unit you learn here: a Python generator yielding the publisher’s rows verbatim.
Unify. DuckDB reads the Parquet back and projects both publishers onto one canonical shape — (user_id, article_id, event_type, event_at, topic, publisher) — the same shape as foundations, now with a publisher column.

The big idea: dlt preserves each source’s raw shape; the unification happens downstream, in plain SQL, where it is visible and testable. The 'open' → 'click' rename is not hidden inside an ingest script — it is a CASE expression you can read in the unified view.

The checkpoint, worked

The lesson ends with a question you answer from the output:

Ekstra Bladet logs a click as open and MIND logs it as click. After unification into one event_type vocabulary, how many click events does each publisher contribute, and how many unified events are there in total?

Ekstra Bladet contributes 2 clicks (its two open rows become click), MIND contributes 1 click, and there are 8 unified events in total (5 from Ekstra Bladet, 3 from MIND). The lesson inside the lesson: once both feeds speak the same event_type vocabulary, a metric like “clicks per publisher” is a one-line GROUP BY — but only because the messy rename happened explicitly at the unification boundary, not silently at ingest time.

The grown-up version: three real publishers

The toy lesson scales to the real thing without changing shape. The production ingestion code lives in tutorial/serving/src/serving/dlt_pipeline.py and lands three real datasets into Parquet under data/raw/<publisher>/ (git-ignored by design — ADR-0014: data is regenerated, never committed):

EB-NeRD — the primary substrate. Danish article text, per-article sentiment, impression logs at real volume (Ekstra Bladet, inside the JP/Politikens media group).
Adressa — a Norwegian local-news comparison point. Different session shape, no sentiment scores, more cold-start sparsity.
MIND — the Microsoft News dataset. English, broader taxonomy, longer impression slates; it tests whether the platform survives a publisher whose semantics differ at the schema level.

Their payoff is the same as the lesson’s, one rung up: the stg_unified_impressions (source) view, where every publisher meets on equal terms. Everything publisher-specific stays in stg_<publisher>_* models; everything cross-publisher reads the unified view. A sentiment metric cannot pretend Adressa has EB-NeRD’s sentiment_score; the ranker degrades gracefully and the evaluation harness reports metric availability per publisher rather than inventing values. Graceful degradation is an accountability feature.

What makes this a good rep

Following the foundations template (ADR-0033):

Standalone and hermetic. The lesson wipes its work directory on each run and dlt uses write_disposition="replace", so it is identical the first time and the hundredth — no network, no real datasets, no shared state.
run / check / clean. make run needs no DuckDB CLI; make check turns the checkpoint into assertions over the real dlt → Parquet → DuckDB path.
Teaches the concept, not the tool’s output. You write the dlt resources and read the unification SQL; the serving pipeline is the grown-up version you graduate to, not a black box you invoke.

After this module

Once data is landed and unified, transformation reshapes it into mart-level features: article metadata, user histories, embeddings. Orchestration is where the dlt sources become Dagster assets with checks, schedules, and sensors.