Lakehouse / Storage lessons
The lakehouse lessons walk through the local Parquet substrate, the
billion-row scale harness, and the prose-only cloud-migration pathway.
The code lives at
tutorial/storage/.
L01 — Partitioned Parquet layout
Section titled “L01 — Partitioned Parquet layout”The first lesson reads the harness source
(src/storage_scale/harness.py)
to understand the partitioning strategy: Parquet files split by
publisher and event date so DuckDB can skip irrelevant files at scan
time. That partition layout is the only “architecture” the lakehouse
needs — no external metadata service, no query planner.
L02 — Running the scale harness
Section titled “L02 — Running the scale harness”The CLI:
uv run --package tutorial-storage python -m storage_scale.harness --target-rows 1_000_000_000This generates a synthetic billion-row news-event dataset expanded from the EB-NeRD shape, lands it as partitioned Parquet, and runs representative analytical queries (counts, aggregations, candidate-set derivation) against the result. Memory peak, ingestion rows/sec, and query latency are reported.
L03 — Reading scale-results.md
Section titled “L03 — Reading scale-results.md”The harness writes its findings to
scale-results.md
— checked into the repo so a reader can compare their hardware against
the published numbers. The file includes:
- Host hardware spec (CPU, RAM, disk, OS)
- Rows-per-second ingestion
- Peak memory during ingest and query
- Query latency on at least two representative queries
The point is falsifiability. “DuckDB+Parquet is the analytical platform” becomes a number you can check against your own laptop.
L04 — Cloud-migration pathway (prose only)
Section titled “L04 — Cloud-migration pathway (prose only)”Three files document how the same code goes to cloud-backed Parquet without redeploying anything:
object-storage-migration.md— the walkthrough.cloud.dlt.example.toml— the dlt config diff (local path → S3 URL).cloud.example.env— the env vars needed for the managed catalog connection.
The lesson is explicitly configuration + prose, no deploy, per ADR-0014. No AWS account is required to run any lesson in this tutorial.
L05 — Tests
Section titled “L05 — Tests”tests/test_scale_harness.py
runs the harness against a much smaller target (a few thousand rows) and
asserts the partition layout, file counts, and report shape. That keeps
the lesson runnable in CI without a billion-row generation step every
time.
After this module
Section titled “After this module”The lakehouse substrate is what every downstream module reads. Transformation builds dbt models against this Parquet directory; modeling reads the article and embedding columns from it; serving opens DuckDB connections directly against it.