Transformation
Transformation is where raw landed data becomes the analytical contract that the FastAPI app and notebook-driven analysts both consume. The role of this module is to make the platform’s data model — articles, impressions, clicks, embeddings, editorial config, sensitivity flags — first-class, documented, and tested.
Every model lives in tutorial/serving/dbt/.
The dbt project is materialised against the same DuckDB+Parquet substrate
the FastAPI service queries, which is the operational shape of the
two-contracts argument (ADR-0006).
What the dbt project covers
Section titled “What the dbt project covers”Three model groups live under
dbt/models/:
- Staging (
models/staging/): one staging model per publisher per source table (stg_ebnerd_*,stg_adressa_*,stg_mind_*) plus the unified viewstg_unified_impressions.sql. Renames + casts to canonical column types. Nothing semantic — that work happens downstream. - Editorial (
models/editorial/): the constraint configuration table (constraint_configurations.sql) and thearticle_sensitivityPython model (article_sensitivity.py) that combines NER hits and EB-NeRD sentiment scores into a booleanis_sensitiveflag. This is where editorial concerns enter the analytical contract as queryable rows, not buried code. - Embeddings (
models/staging/article_embeddings.py): a dbt Python model that runs an off-the-shelf sentence-transformer once per article and writes the vector column into Parquet for the recommender to pick up.
Tests + docs are the deliverable, not a side effect
Section titled “Tests + docs are the deliverable, not a side effect”The whole module’s value is that the analytical contract is falsifiable. That means:
- Every model has a
schema.ymldeclaring expected columns and at least one test (non-null primary keys, referential checks across publishers). - A custom data test lives at
dbt/tests/assert_article_sensitivity_seeded_cases.sql— it asserts that known-sensitive seed articles get flagged and known-benign ones don’t. dbt docs generateruns in CI and the output is embedded in this docs site under Data reference so an analyst can browse lineage, column descriptions, and tests without reading Python or SQL.
Why dbt and not raw SQL scripts
Section titled “Why dbt and not raw SQL scripts”The thesis (ADR-0004) says the platform is the product. That only works if the analytical contract is documented, tested, and lineage-tracked at the same fidelity as the HTTP contract. Raw SQL scripts hide the dependency graph; dbt forces it into the open. The cost — one more tool — is bought back many times over by the auto-generated docs, the per-column lineage, and the test discipline forcing schema changes to flow through the platform deliberately.
After this module
Section titled “After this module”The transformation outputs feed three downstream consumers:
- Modeling reads article and user features to compute embeddings + candidate sets.
- Editorial reads
constraint_configurationsandarticle_sensitivityto apply ranking. - Evaluation sweeps configurations and writes results back into the same DuckDB substrate as the analytical contract.
Orchestration wraps the whole dbt project as Dagster
assets so a single dagster materialize traces all the way from raw landing
to evaluated outputs.