Lakehouse / Storage
The lakehouse module is where the platform proves it scales without abandoning local-first ergonomics. It is not a separate database; it is the substrate the rest of the platform sits on — Parquet files in a partition layout DuckDB can scan efficiently, with a runnable scale harness that loads and queries at billion-row volume on a developer laptop.
This module’s role is captured by two ADRs:
- ADR-0016: DuckDB over Parquet is the analytical platform. Not a development database; the platform.
- ADR-0014: Local-first scale story. Every lesson runs on a laptop. The cloud-migration pathway is documented but not deployed.
What the code looks like
Section titled “What the code looks like”The module is rooted under
tutorial/storage/:
src/storage_scale/harness.pyis the scale harness — generates a billion-row synthetic news-event dataset expanded from the EB-NeRD shape, lands it as partitioned Parquet, and runs representative analytical queries (counts, aggregations, candidate-set derivation) against it.scripts/run_scale_harness.pyis the CLI entry point. It writes results toscale-results.mdwith explicit hardware spec, ingestion rows/sec, peak memory, and query latency numbers.object-storage-migration.md,cloud.dlt.example.toml, andcloud.example.envdocument the diff between local and S3-backed Parquet + managed catalog — configuration only, no deployment in scope.
What “scale demonstrated” means here
Section titled “What “scale demonstrated” means here”The scale harness publishes concrete numbers, not aspirations. The
scale-results.md
file is checked into the repo so a reader can see — on the exact laptop the
test ran on — how many rows per second landed, how much RAM the worst case
hit, and how long the representative queries took. That makes the claim
“DuckDB+Parquet is the analytical platform” falsifiable rather than
ceremonial.
Partitioning is the only “architecture” you really need
Section titled “Partitioning is the only “architecture” you really need”The most important file in the module is the partitioning layout itself.
The harness writes Parquet files partitioned by publisher and event date,
which is what lets DuckDB skip irrelevant files at scan time without an
external query planner or metadata service. If you read one thing from this
module, read how the partitioning is chosen and how a single SQL query
benefits from it.
After this module
Section titled “After this module”The storage substrate is what every other module relies on, but only this module owns it explicitly. Transformation materialises dbt models against the same Parquet directory the harness wrote. Modeling precomputes embeddings into the same substrate. Serving reads from it directly through DuckDB connections.
The cloud-migration lesson does not deploy. It demonstrates that swapping
the local Parquet directory for an S3 URL is a dlt config diff plus a
DuckDB connection string — not a rewrite.