Skip to content

Evaluation README

The evaluation module makes recommendation tradeoffs visible through metrics, fixtures, and Pareto-frontier artefacts.

The evaluation harness sweeps the soft editorial-control grid as a 5 by 5 by 5 configuration matrix:

  • diversity_weight: 0.00, 0.15, 0.30, 0.60, 1.00
  • recency_weight: 0.00, 0.20, 0.40, 0.70, 1.00
  • sentiment_weight: 0.00, 0.10, 0.20, 0.50, 1.00

That grid includes the click-only baseline (0, 0, 0) and the platform default (0.30, 0.40, 0.20). The Dagster asset now runs the same sweep over the EB-NeRD, Adressa, and MIND holdouts, then materialises eval_sweep_results with one row per (config, dataset, metric), including the full configuration columns so analysts can query the same contract from SQL or notebooks.

The metric set is NDCG@10, MRR, hit-rate@10, intra-list diversity, catalog coverage, median recency, sentiment-distribution divergence, and sensitive-topic exposure.

The chart below plots the smaller labelled diversity slice, NDCG@10 against intra-list diversity, with one Pareto frontier per publisher. The same ranking code produces different tradeoff curves because the publisher context changes the candidate shape:

  • EB-NeRD is the primary Danish-news substrate. In this holdout it behaves like a deep single-topic session: click accuracy is easiest when the list stays narrow, and diversity has a visible cost.
  • Adressa’s local-news fixture has shorter, more mixed local sessions. The curve moves toward higher diversity earlier, which is the cold-start risk in miniature: fewer repeated signals make broad local coverage easier to justify but harder to rank confidently.
  • MIND exposes a broader category taxonomy and longer impression slates. Its session-length shape starts more diverse, so the diversity frontier is flatter than EB-NeRD’s.

EB-NeRD, Adressa, and MIND NDCG@10 versus intra-list diversity Pareto frontiers

Deterministic metrics stay in dbt because they are reproducible checks over materialised platform outputs: diversity, recency, source mix, sensitivity exposure, and similar table-shaped facts. Evalite is reserved for editorial assistant behaviour where model text has to be scored: faithfulness, reason validity, changed-constraint coverage, register, and length. See ADR-0020 for the split.