Lakehouse / Storage

The lakehouse module is where the platform proves it scales without abandoning local-first ergonomics. It is not a separate database; it is the substrate the rest of the platform sits on — Parquet files in a partition layout DuckDB can scan efficiently, with a runnable scale harness that loads and queries at billion-row volume on a developer laptop.

This module’s role is captured by two ADRs:

ADR-0016: DuckDB over Parquet is the analytical platform. Not a development database; the platform.
ADR-0014: Local-first scale story. Every lesson runs on a laptop. The cloud-migration pathway is documented but not deployed.

What the code looks like

The module is rooted under tutorial/storage/:

src/storage_scale/harness.py is the scale harness — generates a billion-row synthetic news-event dataset expanded from the EB-NeRD shape, lands it as partitioned Parquet, and runs representative analytical queries (counts, aggregations, candidate-set derivation) against it.
scripts/run_scale_harness.py is the CLI entry point. It writes results to scale-results.md with explicit hardware spec, ingestion rows/sec, peak memory, and query latency numbers.
object-storage-migration.md, cloud.dlt.example.toml, and cloud.example.env document the diff between local and S3-backed Parquet + managed catalog — configuration only, no deployment in scope.

What “scale demonstrated” means here

The scale harness publishes concrete numbers, not aspirations. The scale-results.md file is checked into the repo so a reader can see — on the exact laptop the test ran on — how many rows per second landed, how much RAM the worst case hit, and how long the representative queries took. That makes the claim “DuckDB+Parquet is the analytical platform” falsifiable rather than ceremonial.

Partitioning is the only “architecture” you really need

The most important file in the module is the partitioning layout itself. The harness writes Parquet files partitioned by publisher and event date, which is what lets DuckDB skip irrelevant files at scan time without an external query planner or metadata service. If you read one thing from this module, read how the partitioning is chosen and how a single SQL query benefits from it.

After this module

The storage substrate is what every other module relies on, but only this module owns it explicitly. Transformation materialises dbt models against the same Parquet directory the harness wrote. Modeling precomputes embeddings into the same substrate. Serving reads from it directly through DuckDB connections.

The cloud-migration lesson does not deploy. It demonstrates that swapping the local Parquet directory for an S3 URL is a dlt config diff plus a DuckDB connection string — not a rewrite.