Accountable recommendations need a platform

The editorial cost of clicks is easy to state: a newsroom can become narrower, faster, and more reactive without ever deciding to become that. A recommender does not need to be malicious or badly built to create that cost. If it is asked to maximise clicks, and the platform only measures clicks, it will keep learning from the narrowest signal in the room.

That is the problem this tutorial is built around. A news recommender changes what readers see. That makes it an editorial system, not just a machine-learning component. The newsroom therefore needs more than a ranked list of articles. It needs to see which editorial constraints shaped that list, who can change those constraints, how a change affects the reader-facing slate, and what evidence is available after the change ships.

I argue that editorial accountability is a platform-layer concern, not a model-layer concern. That is the core thesis recorded in ADR-0004, and it is the reason the recommender model in this tutorial stays deliberately simple. The model is not the star. The platform around the model is the point.

Where this essay fits. This is the domain-reasoning half of what the lab teaches — one of the two things its owner is here to practise (the other is the technical stack). It makes the platform-as-leverage argument recorded in ADR-0004. If you are asking the prior question — why does this lab exist at all? — that is answered on the front door and in ADR-0033: the lab exists so its owner builds durable fluency by re-running this material many times. The essay below is the thing worth understanding deeply enough to rebuild and defend from memory.

The shortest version

The model finds articles that might be relevant. The platform decides how those articles can be used responsibly.

In implementation terms, the model produces a candidate set and the ranker turns that candidate set into the final recommendation list. In plain English: the model says, “these articles look plausible for this reader.” The ranker says, “given the current editorial policy, this is the order we are willing to show.”

That split matters. If the model and the editorial policy are blended together, the newsroom has very little room to govern the result. If they are separated, the same candidate set can be ranked one way for a click-only baseline, another way for a high-diversity front, and another way for a breaking-news situation where freshness matters more than usual. The model does not have to be retrained every time the editorial stance changes.

This is the platform leverage: editorial judgment becomes a configurable, inspectable part of the system.

What the editor is actually controlling

The tutorial uses five editorial constraints. Three are soft weights:

Topical diversity: how much the list should spread across categories.
Recency: how strongly newer stories should be preferred.
Sentiment balance: how close the list should stay to the target tone.

Two are hard rules:

Editorial promotion: articles the newsroom has consciously decided must appear.
Sensitive-topic exposure: a cap that prevents sensitive material from dominating a reader’s list.

ADR-0010 names this mixed enforcement model: hard rules plus weighted soft constraints. ADR-0015 gives the scoring detail. The distinction is important because not every editorial concern should be reduced to one more number inside a score. Some concerns are preferences. Others are promises.

Topical diversity can be a judgment call. On a one-topic news day, a tighter front may be justified. On a quieter day, a broader mix can be healthier. Recency is similar: a live news period and a weekend feature package should not use the same freshness bias. Sentiment balance also depends on context.

Promotion and sensitivity are different. A promoted investigation should appear because the newsroom made a conscious editorial choice. A sensitive-topic cap should hold because the newsroom has decided that vulnerable or distressing material should not take over the list. Those rules may still require careful design, but they should not disappear as weak hints inside a weighted average.

Why this belongs in the platform

A newsroom cannot govern a recommender through architectural intent alone. It needs surfaces where responsibility can be exercised.

That is why the tutorial has an editor interface. Sliders, number inputs, saved configurations, previews, and audit records are ordinary on purpose. They are not glamorous, but they make the editorial policy usable. An editor should be able to move the diversity weight, preview the changed recommendation list, and decide whether the new list better matches the editorial intent before committing it.

That is also why the tutorial has two contracts.

The app contract is for editor-facing tools and application clients. It is an HTTP/OpenAPI boundary: stable endpoints, typed requests, typed responses, and forms that can be validated before they reach the platform. The editor interface consumes this contract.

The analytical contract is for analysts. It is the DuckDB-readable table and view schema over the Parquet-backed platform data. Analysts should not have to reverse-engineer an editor UI when the natural question is SQL-shaped. They should be able to inspect articles, impressions, configurations, changes, and evaluation outputs directly.

ADR-0006 frames this two-contract design. Different consumers get different surfaces over the same platform, rather than different systems with drifting definitions.

Why the data substrate matters

Accountability without a queryable record is theatre. A platform can claim to be responsible, but if nobody can inspect what happened, the claim is weak.

ADR-0016 frames DuckDB over Parquet as the local-first analytical substrate. In this tutorial, that choice is not just a convenient development setup. It means the data platform can be exercised on small fixtures while keeping a boundary that still makes sense for larger partitioned data. It also means the same underlying records can serve both application workflows and analyst workflows.

The article tables, impression views, constraint configurations, editorial change records, and evaluation outputs are not private internals of an app. They are part of the evidence layer. A future editor or analyst should be able to ask what changed, who changed it, what recommendation list moved, and whether a click gain came with a visible editorial cost.

What evaluation is for

A standard recommender report can say whether the system predicted clicks well. That is useful, but incomplete. A click-only metric cannot tell the whole editorial story.

ADR-0009 adds a second family of measurements: diversity, coverage, recency, sentiment distribution, and sensitive-topic exposure. The point is not that every metric is perfect. The point is that the tradeoff is visible. A click-only configuration can be compared with a balanced configuration or a high-diversity configuration. If one setting buys a small accuracy gain while losing a large amount of coverage, that is no longer a hidden side effect. It is an editorial choice.

This is where the platform becomes more useful than a single model benchmark. The benchmark asks, “did the recommender predict clicks?” The platform also asks, “what did that optimisation cost the newsroom?”

What this tutorial does not claim

ADR-0007 makes the model intentionally modest: content similarity over article text, with a cold-start path when the platform has no useful read history for a reader. That is enough to create plausible candidates for a teaching platform. It is not enough to claim deep recommender research.

That limitation is deliberate. If the platform argument only works when the model is impressive, the platform is not doing much work. The stronger claim is that even a simple model becomes more useful, inspectable, and governable when the surrounding platform exposes the right controls, contracts, and evidence.

The publisher context needs the same honesty. EB-NeRD is a strong primary dataset for this tutorial because it comes from Ekstra Bladet, which sits inside the broad JP/Politikens Hus media context. Adressa and MIND add comparison points. They do not turn this tutorial into production newsroom experience. This work does not claim production newsroom recommender experience, and it does not overclaim JP/Politikens Hus domain depth.

The point is more practical: use public Scandinavian and news-recommendation datasets to make the learning credible, then put the real effort into the platform and product engineering judgment around the limitations. That is where the interesting work is. A stronger model would not automatically create a system an organisation can operate, question, and review — and learning to build the system that can be operated, questioned, and reviewed is exactly what this material is for.

Why cross-publisher data appears

EB-NeRD, Adressa, and MIND do not share exactly the same raw shape. Article identifiers differ. Category structures differ. Impression and click semantics differ. Sentiment fields are not always available. The tutorial keeps those differences visible at the staging boundary instead of pretending every publisher records the same world.

That is not just data plumbing. It is part of the accountability argument. A sentiment metric cannot pretend Adressa and MIND have EB-NeRD’s sentiment column. A dwell-time analysis cannot pretend every publisher records the same event. Graceful degradation is an accountability feature: the platform should show where comparison is possible and where it is not.

The deeper argument

The deeper work in accountable recommenders is making editorial judgment legible, tunable, testable, and reviewable across the platform.

Legible means the system can be understood without one specialist knowing where every rule is hidden. Tunable means editors can adjust explicit constraints without changing model code. Testable means the ranker can be exercised as a pure transformation from candidate set and configuration to ranked list. Reviewable means the platform stores enough evidence for analysts and managers to reconstruct what happened later.

This is why the tutorial is built as a working platform rather than a detached essay. A reader can inspect the data entering through dlt, landing as Parquet, taking queryable shape through DuckDB and dbt, moving through FastAPI, and appearing in the TypeScript editor interface. The generated data reference makes the analytical contract visible. The generated API reference makes the app contract visible.

The most dangerous recommender failure in a media context is not always an obviously bad recommendation. Sometimes it is a locally successful system whose broader editorial effects nobody can see clearly. A platform-as-leverage approach does not solve that risk by declaring the model responsible. It solves it by giving the organisation surfaces where responsibility can be exercised: candidate generation with modest claims, ranking with explicit mixed enforcement, an editor interface with previewable settings, two contracts for two audiences, and evaluation artefacts that show what click optimisation costs.

That is the architecture this tutorial argues for, and the standard by which the rest of the work should be judged.

From thesis to receipts

ADR-0004 becomes operational in the accountability module README: successful editor workflow mutations become append-only editorial_changes rows, and the deployed change log from the analyst surface lets a reviewer inspect who changed what and when. That receipt does not make the model more sophisticated. It makes the platform more accountable, which is the point of the thesis.