Skip to content

Lakehouse / Storage lessons

The lakehouse lessons walk through the local Parquet substrate, the billion-row scale harness, and the prose-only cloud-migration pathway. The code lives at tutorial/storage/.

The first lesson reads the harness source (src/storage_scale/harness.py) to understand the partitioning strategy: Parquet files split by publisher and event date so DuckDB can skip irrelevant files at scan time. That partition layout is the only “architecture” the lakehouse needs — no external metadata service, no query planner.

The CLI:

Terminal window
uv run --package tutorial-storage python -m storage_scale.harness --target-rows 1_000_000_000

This generates a synthetic billion-row news-event dataset expanded from the EB-NeRD shape, lands it as partitioned Parquet, and runs representative analytical queries (counts, aggregations, candidate-set derivation) against the result. Memory peak, ingestion rows/sec, and query latency are reported.

The harness writes its findings to scale-results.md — checked into the repo so a reader can compare their hardware against the published numbers. The file includes:

  • Host hardware spec (CPU, RAM, disk, OS)
  • Rows-per-second ingestion
  • Peak memory during ingest and query
  • Query latency on at least two representative queries

The point is falsifiability. “DuckDB+Parquet is the analytical platform” becomes a number you can check against your own laptop.

L04 — Cloud-migration pathway (prose only)

Section titled “L04 — Cloud-migration pathway (prose only)”

Three files document how the same code goes to cloud-backed Parquet without redeploying anything:

The lesson is explicitly configuration + prose, no deploy, per ADR-0014. No AWS account is required to run any lesson in this tutorial.

tests/test_scale_harness.py runs the harness against a much smaller target (a few thousand rows) and asserts the partition layout, file counts, and report shape. That keeps the lesson runnable in CI without a billion-row generation step every time.

The lakehouse substrate is what every downstream module reads. Transformation builds dbt models against this Parquet directory; modeling reads the article and embedding columns from it; serving opens DuckDB connections directly against it.