Lakehouse / Storage lessons

The lakehouse lessons walk through the local Parquet substrate, the billion-row scale harness, and the prose-only cloud-migration pathway. The code lives at tutorial/storage/.

L01 — Partitioned Parquet layout

The first lesson reads the harness source (src/storage_scale/harness.py) to understand the partitioning strategy: Parquet files split by publisher and event date so DuckDB can skip irrelevant files at scan time. That partition layout is the only “architecture” the lakehouse needs — no external metadata service, no query planner.

L02 — Running the scale harness

The CLI:

uv run --package tutorial-storage python -m storage_scale.harness --target-rows 1_000_000_000

This generates a synthetic billion-row news-event dataset expanded from the EB-NeRD shape, lands it as partitioned Parquet, and runs representative analytical queries (counts, aggregations, candidate-set derivation) against the result. Memory peak, ingestion rows/sec, and query latency are reported.

L03 — Reading `scale-results.md`

The harness writes its findings to scale-results.md — checked into the repo so a reader can compare their hardware against the published numbers. The file includes:

Host hardware spec (CPU, RAM, disk, OS)
Rows-per-second ingestion
Peak memory during ingest and query
Query latency on at least two representative queries

The point is falsifiability. “DuckDB+Parquet is the analytical platform” becomes a number you can check against your own laptop.

L04 — Cloud-migration pathway (prose only)

Three files document how the same code goes to cloud-backed Parquet without redeploying anything:

object-storage-migration.md — the walkthrough.
cloud.dlt.example.toml — the dlt config diff (local path → S3 URL).
cloud.example.env — the env vars needed for the managed catalog connection.

The lesson is explicitly configuration + prose, no deploy, per ADR-0014. No AWS account is required to run any lesson in this tutorial.

L05 — Tests

tests/test_scale_harness.py runs the harness against a much smaller target (a few thousand rows) and asserts the partition layout, file counts, and report shape. That keeps the lesson runnable in CI without a billion-row generation step every time.

After this module

The lakehouse substrate is what every downstream module reads. Transformation builds dbt models against this Parquet directory; modeling reads the article and embedding columns from it; serving opens DuckDB connections directly against it.