Skip to content

Modeling lessons

The modeling lessons cover the recommender model — deliberately modest per ADR-0007. Everything lives in the serving package: tutorial/serving/src/serving/embeddings.py and tutorial/serving/src/serving/recommendations.py.

L01 — Sentence embeddings as a dbt model

Section titled “L01 — Sentence embeddings as a dbt model”

The article embedding pass runs as a dbt Python model so it gets the same lineage tracking and documentation as every other column in the analytical contract: dbt/models/staging/article_embeddings.py.

Terminal window
uv run --package tutorial-serving dbt run --select article_embeddings \
--project-dir tutorial/serving/dbt --profiles-dir tutorial/serving/dbt

The model uses an off-the-shelf multilingual sentence-transformer that handles Danish, Norwegian, and English article text. Encoding is deterministic for a given model + input, so the output Parquet is reproducible across re-runs.

User vectors are computed in recommendations.py: take the user’s last-K read articles, look up their embeddings, average them. That mean vector is the user’s representation.

It is the simplest possible user model. It does not learn user preferences over time. It does not weight recent reads more heavily. It does not penalise topics the user has unsubscribed from. All of those would be recommender-research moves, and per ADR-0007 the model is not where the tutorial competes.

For a known user: take the user vector, compute cosine similarity against every article embedding, exclude already-read articles, return the top-N nearest. That’s the candidate set.

For the cold-start path (no read history): fall back to popularity-by- category over a recent time window. Articles are scored by category-level CTR plus recency, then sampled to ensure a mix of categories. The cold- start fallback is in the same file so the routing logic (if user has embedding...) sits next to both paths.

test_embeddings.py covers the encoder behaviour. The deeper test is test_recommendations.py, which exercises candidate generation with seeded user histories and embedding fixtures. The cold-start path has its own test — confirming that a user with no reads returns popular-by-category articles, not an empty list.

All tests run in milliseconds against in-memory fixtures, which is the test discipline ADR-0007 promises.

  • No training. No fine-tuning. No domain adaptation.
  • No two-stage architecture (collaborative recall + content rerank).
  • No personalised content models, no sequence models, no learned re-rankers.

Each of those would put the model in the spotlight, which is exactly where the platform-as-leverage thesis says the model should not be.

The candidate sets produced here flow into the editorial module, where the ranker applies the five editorial constraints. The evaluation module sweeps constraint configurations against these candidates.