The Modern Data Lakehouse

For years, data teams were forced to make a choice: build on a data lake for flexibility and cost, or build on a data warehouse for reliability and performance. The data lakehouse pattern has emerged as the industry's answer to this false dichotomy — and it's fundamentally changing how teams architect their analytical systems.

In this article, we'll break down exactly what a lakehouse is, why the open table formats (Delta Lake, Apache Iceberg, and Apache Hudi) are the key enabling technology, and how to think about choosing between them for your organization.

The Problem the Lakehouse Solves

To understand the lakehouse, you need to understand why the previous options were both inadequate. Data lakes — essentially cheap object storage (S3, GCS, ADLS) filled with raw files — offer nearly unlimited scale and support any file format. But they have no ACID guarantees, no schema enforcement, poor query performance on large datasets, and make operations like updates and deletes extremely difficult.

Data warehouses (Snowflake, BigQuery, Redshift) solved those problems but introduced new ones: proprietary formats that lock your data in, high cost at scale, limited support for unstructured data, and friction when data scientists want to access raw data or train ML models.

"The lakehouse is not a compromise between lakes and warehouses. It's a new category that adds a transactional metadata layer on top of open object storage — giving you warehouse-grade reliability on lake-grade infrastructure."

The Three Open Table Formats

The lakehouse pattern is enabled by a metadata layer that sits between your raw files and your query engines. The three dominant formats all solve the same core problems — ACID transactions, schema evolution, time travel, and efficient reads — but they have different design philosophies and trade-offs.

Delta Lake

Delta Lake was created by Databricks and open-sourced in 2019. It stores metadata in a _delta_log directory of JSON files that form a transaction log. Every write operation appends a new entry to this log. On read, Delta constructs the current state of the table by replaying the log.

-- Creating a Delta table in Spark
CREATE TABLE orders
USING delta
LOCATION 's3://my-bucket/data/orders'
AS SELECT * FROM raw_orders;

-- Time travel: query data as of 7 days ago
SELECT * FROM orders
TIMESTAMP AS OF date_sub(current_timestamp(), 7);

-- MERGE (upsert) operation
MERGE INTO orders AS target
USING new_orders AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Delta's strengths are its deep Spark integration, excellent performance optimizations (Z-ordering, data skipping, auto-optimize), and the maturity of the Databricks ecosystem around it. Its weakness historically has been interoperability — other engines had to support Delta explicitly.

Apache Iceberg

Iceberg, originally created at Netflix, takes a more ambitious approach. Its metadata is a tree structure: a catalog points to metadata files, which point to manifest lists, which point to manifests, which point to data files. This design makes certain operations extremely efficient that are slow in Delta — particularly partition evolution and hidden partitioning.

# Iceberg with PyIceberg
from pyiceberg.catalog import load_catalog

catalog = load_catalog("glue", **{
    "type": "glue",
    "warehouse": "s3://my-bucket/warehouse"
})

table = catalog.load_table("analytics.orders")

# Schema evolution without rewriting data
with table.update_schema() as update:
    update.add_column("customer_tier", StringType())

# Partition evolution - change partitioning without rewriting
with table.update_spec() as update:
    update.add_field("month", MonthTransform(), "order_date")

Iceberg's interoperability story is the best of the three — it's a true open standard with native support across Spark, Flink, Trino, Dremio, Athena, and BigQuery Omni. If multi-engine access is a requirement, Iceberg is often the right choice.

Apache Hudi

Hudi (Hadoop Upserts Deletes and Incrementals) was created at Uber for a specific use case: efficiently updating and deleting records in a large dataset. It pioneered the concept of record-level upserts and incremental processing on top of object storage.

Hudi offers two storage types: Copy on Write (CoW), which rewrites entire Parquet files on every write (optimized for reads), and Merge on Read (MoR), which appends delta logs and merges on read (optimized for writes). This flexibility makes Hudi well-suited for CDC (Change Data Capture) workloads where you need near-real-time data with frequent updates.

💡 Choosing Between the Three

Delta Lake if your primary compute is Databricks/Spark and you want the deepest optimization. Iceberg if you need multi-engine access or plan to use Trino, Athena, or Flink heavily. Hudi if your primary workload is CDC-based ingestion with frequent row-level updates.

What You Actually Gain

Beyond the marketing, here's what the lakehouse pattern delivers in practice:

ACID Transactions. Multiple writers can update a table concurrently without corruption. Failed writes don't leave partial data. This alone eliminates entire categories of data quality bugs that plague pure lake architectures.

Time Travel & Audit. Every version of your table is preserved in the transaction log. You can query data as it existed at any point in time, or roll back to a previous version after a bad write. This is invaluable for debugging and compliance.

Schema Evolution. Add columns, rename columns, change types — all without rewriting your data files. The table format handles the translation between old and new schemas transparently.

Efficient Updates and Deletes. Previously, updating a single row in a 500GB Parquet file meant rewriting the entire file. Open table formats track which files contain which records and only rewrite the affected files.

The Architecture in Practice

A typical lakehouse architecture looks like this: raw data lands in object storage (usually in its original format — JSON, CSV, Avro, whatever the source system produces). An ingestion layer (Spark, Flink, or a purpose-built tool like Fivetran or Airbyte) reads this raw data and writes it into your open table format as a Bronze layer.

Transformation jobs (typically dbt on Spark or Trino) then read from Bronze and produce Silver (cleaned, validated, conformed) and Gold (aggregated, business-logic-applied) layers — all stored as the same open table format. BI tools, notebooks, and ML platforms then query the Gold layer through SQL engines like Trino or Athena.

# Example: Ingesting CDC data into Iceberg Bronze layer with Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, current_timestamp

spark = SparkSession.builder \
    .config("spark.sql.extensions",
            "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.glue_catalog",
            "org.apache.iceberg.spark.SparkCatalog") \
    .getOrCreate()

# Read from Kafka CDC stream
cdc_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "broker:9092") \
    .option("subscribe", "orders_cdc") \
    .load()

# Write to Bronze Iceberg table with merge-on-read
cdc_stream.writeStream \
    .format("iceberg") \
    .outputMode("append") \
    .option("path", "glue_catalog.bronze.orders_raw") \
    .option("checkpointLocation", "s3://bucket/checkpoints/orders") \
    .start()

What to Watch Out For

The lakehouse is not without its challenges. Small file proliferation is a real operational concern — streaming workloads and frequent upserts generate many small files that degrade read performance over time. All three formats have compaction mechanisms, but you need to run them regularly (typically via scheduled Spark jobs or Databricks OPTIMIZE).

Catalog management is another complexity: you need a catalog service (AWS Glue, Hive Metastore, Nessie, Unity Catalog) to store table metadata and enable cross-engine access. Getting this wrong causes hard-to-debug consistency issues.

Finally, don't underestimate the learning curve. Your team needs to understand transaction semantics, compaction, vacuum (cleaning up old files), and snapshot expiry. These are new operational concerns that didn't exist in a pure warehouse world.

Should You Migrate to a Lakehouse?

If you're running significant analytical workloads and face any of the following, the lakehouse pattern is worth serious consideration: your data warehouse costs are exploding with scale; you need to support ML workloads that need raw or semi-structured data; you're doing CDC and struggling with updates in a pure lake; or you need to give different teams different query engines over the same data.

If your team is small, your data volumes are manageable, and your workloads are primarily SQL analytics, a managed warehouse (Snowflake, BigQuery) may still be the right choice — the operational simplicity is worth the cost and lock-in trade-offs.

The lakehouse is not a silver bullet. It's a powerful architectural pattern that shifts complexity from the warehouse vendor to your own operations team. Make that trade consciously.