Evaluation README

The evaluation module makes recommendation tradeoffs visible through metrics, fixtures, and Pareto-frontier artefacts.

The runnable rep

Rep #7 is a self-contained sweep: tutorial/evaluation/lesson-01-constraint-sweep.py. It takes one candidate set with known clicks, ranks it under a sweep of diversity weights, and reports two metrics at once — accuracy and breadth.

make -C tutorial/evaluation run     # print the NDCG-vs-diversity frontier
make -C tutorial/evaluation check   # assert the frontier + the checkpoint
make -C tutorial/evaluation clean    # nothing to clean (pure in-memory)

The clicks are concentrated in sport, so the frontier shows the editorial cost of clicks directly:

  div_weight  top-3       NDCG@3   diversity@3
  ----------  ----------  -------  -----------
         0.0  a1 a2 a3      0.765        0.667   <- click-only: most accurate
        0.25  a1 a3 a2      0.704        0.667   <- dominated: loses NDCG, no gain
         0.5  a1 a3 a5      0.469          1.0   <- full diversity, real cost
         1.0  a1 a3 a5      0.469          1.0

The checkpoint, worked

Which is the smallest diversity weight that reaches full topic diversity (1.0) at k=3, and what does it cost in NDCG@3?

Weight 0.5, costing NDCG@3 0.765 → 0.469 (a drop of 0.296). Two lessons hide in the table. First, 0.25 is a dominated point — it sacrifices accuracy for no diversity gain, so no rational editor would pick it. Second, there is no “right” answer to the trade: the platform’s job is to make the frontier visible so the newsroom can choose where on it to sit. That is exactly the click-cost argument (ADR-0009) made measurable, not the model “being good.”

What makes this a good rep

Following the foundations template (ADR-0033): pure functions, deterministic, no I/O — make run prints the frontier, make check asserts every cell plus the checkpoint. The grown-up version (below) sweeps a 5×5×5 grid across three publishers; the rep teaches the shape of the argument first.

Cross-publisher Pareto chart

The evaluation harness sweeps the soft editorial-control grid as a 5 by 5 by 5 configuration matrix:

diversity_weight: 0.00, 0.15, 0.30, 0.60, 1.00
recency_weight: 0.00, 0.20, 0.40, 0.70, 1.00
sentiment_weight: 0.00, 0.10, 0.20, 0.50, 1.00

That grid includes the click-only baseline (0, 0, 0) and the platform default (0.30, 0.40, 0.20). The Dagster asset now runs the same sweep over the EB-NeRD, Adressa, and MIND holdouts, then materialises eval_sweep_results with one row per (config, dataset, metric), including the full configuration columns so analysts can query the same contract from SQL or notebooks.

The metric set is NDCG@10, MRR, hit-rate@10, intra-list diversity, catalog coverage, median recency, sentiment-distribution divergence, and sensitive-topic exposure.

The chart below plots the smaller labelled diversity slice, NDCG@10 against intra-list diversity, with one Pareto frontier per publisher. The same ranking code produces different tradeoff curves because the publisher context changes the candidate shape:

EB-NeRD is the primary Danish-news substrate. In this holdout it behaves like a deep single-topic session: click accuracy is easiest when the list stays narrow, and diversity has a visible cost.
Adressa’s local-news fixture has shorter, more mixed local sessions. The curve moves toward higher diversity earlier, which is the cold-start risk in miniature: fewer repeated signals make broad local coverage easier to justify but harder to rank confidently.
MIND exposes a broader category taxonomy and longer impression slates. Its session-length shape starts more diverse, so the diversity frontier is flatter than EB-NeRD’s.

EB-NeRD, Adressa, and MIND NDCG@10 versus intra-list diversity Pareto frontiers

Deterministic metrics stay in dbt because they are reproducible checks over materialised platform outputs: diversity, recency, source mix, sensitivity exposure, and similar table-shaped facts. Evalite is reserved for editorial assistant behaviour where model text has to be scored: faithfulness, reason validity, changed-constraint coverage, register, and length. See ADR-0020 for the split.