Skip to content

Ingestion

The ingestion module is where the platform stops being a toy. Lesson 1 of foundations builds tables from VALUES; ingestion replaces that with dlt pipelines that pull three real Scandinavian and international news-recommendation datasets into local Parquet files queryable by DuckDB.

The point of the module is not “we can call an HTTP endpoint.” The point is that each publisher has its own raw shape, and the ingestion layer has to preserve that fidelity while still letting downstream models work against a canonical unified view.

  • EB-NeRD is the primary substrate. Released by Ekstra Bladet (inside the JP/Politikens media group), it carries Danish article text, per-article sentiment scores, and impression logs at real volume.
  • Adressa adds a Norwegian local-news comparison point. Different session shape, no sentiment scores, more cold-start sparsity.
  • MIND is the Microsoft News dataset — English, broader category taxonomy, longer impression slates. It tests whether the platform survives a publisher whose semantics differ at the schema level.

The cross-publisher work is the test of the architectural claim. If the ranker, evaluation, and editor only worked on EB-NeRD, the platform would not actually be a platform — it would be a single-publisher application with a docs site stapled on.

The ingestion code lives in tutorial/serving/src/serving/dlt_pipeline.py. Three dlt sources land into local Parquet under data/raw/<publisher>/, git-ignored by design (ADR-0014: data is regenerated or re-downloaded, never committed). The Dagster wrapping that exposes each source as a materialisable asset is in tutorial/serving/src/serving/definitions.py — see the orchestration module for how schedules and sensors are wired on top.

Tests anchor the publisher-specific behaviour:

The whole module’s payoff is one downstream view: stg_unified_impressions (source). It is the first place every publisher meets on equal terms — user_id, article_id, event_type, event_at, publisher. Everything publisher- specific stays in stg_<publisher>_* models; everything cross-publisher reads from the unified view.

That boundary is not just plumbing. It is how the platform keeps comparison honest. A sentiment metric cannot pretend Adressa has EB-NeRD’s sentiment_score. The ranker degrades gracefully (the sentiment soft-term contributes zero when no score exists), and the evaluation harness reports metric availability per publisher rather than inventing values. Graceful degradation is an accountability feature.

Once data is landed and unified, transformation reshapes it into mart-level features: article metadata, user histories, embeddings. Orchestration is where the dlt sources become Dagster assets with checks, schedules, and sensors.