If you've spent any time in the modern data stack, you've probably heard the terms "Bronze," "Silver," and "Gold" thrown around. The Medallion architecture — popularized by Databricks — is a data design pattern that organizes data in a lakehouse into three progressive quality layers. It's elegant in concept, but the implementation details matter enormously.

This guide covers what each layer actually means, how to implement them with Delta Lake, and the common mistakes teams make when adopting the pattern.

Why the Medallion Pattern Exists

Before the lakehouse era, teams faced a familiar dilemma: data lakes gave you cheap storage and flexibility, but poor reliability. Data warehouses gave you reliability and query performance, but at steep cost with limited raw data access. The Medallion architecture bridges this gap by bringing warehouse-like structure (without the rigidity) to a lakehouse.

The Medallion architecture isn't a technology — it's a philosophy of progressive data refinement. The goal is to land data fast and clean it iteratively, not to land it perfectly.

The Three Layers

🥉 Bronze: The Raw Landing Zone

Bronze is your append-only, raw ingestion layer. Data arrives here in its original form — no transformations, no schema enforcement, no business logic. The only things you might add are ingestion metadata: a _ingested_at timestamp, a source system identifier, and maybe a row hash.

Key principles for Bronze:

  • Idempotent ingestion — re-running the ingestion job should not create duplicates
  • Schema-on-read — accept the data as-is, including malformed records
  • Immutability — once written, Bronze records are never modified
  • Full history — keep all versions, including records that will be filtered later
# Example: Bronze ingestion with Auto Loader (Databricks)
from pyspark.sql import functions as F

(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", "/mnt/checkpoints/orders_schema")
  .load("/mnt/raw/orders/")
  .withColumn("_ingested_at", F.current_timestamp())
  .withColumn("_source_file", F.input_file_name())
  .writeStream
  .format("delta")
  .option("checkpointLocation", "/mnt/checkpoints/bronze_orders")
  .outputMode("append")
  .table("bronze.orders")
)
💡
Pro Tip: Schema Evolution Use mergeSchema for Auto Loader to handle schema evolution gracefully. When upstream systems add columns, your Bronze layer absorbs them without pipeline failures.

🥈 Silver: The Cleansed & Conformed Layer

Silver is where data earns its reliability. Here you apply:

  • Deduplication (using MERGE INTO for CDC patterns)
  • Data type casting and null handling
  • Business rule validation (reject or quarantine invalid records)
  • Light joins across Bronze sources (e.g., enriching orders with customer data)
  • Consistent naming conventions and column standardization
# Example: Bronze → Silver with MERGE (CDC pattern)
from delta.tables import DeltaTable

silver_orders = DeltaTable.forName(spark, "silver.orders")

# Deduplicate and upsert
updates = (spark.table("bronze.orders")
  .filter("_ingested_at > (SELECT MAX(updated_at) FROM silver.orders)")
  .dropDuplicates(["order_id"])
  .withColumn("total_amount", F.col("quantity").cast("int") * F.col("unit_price").cast("decimal(10,2)"))
  .filter("order_id IS NOT NULL AND total_amount > 0")
)

silver_orders.alias("target").merge(
  updates.alias("source"),
  "target.order_id = source.order_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

🥇 Gold: The Business-Ready Layer

Gold tables are purpose-built for consumption — by BI tools, data scientists, or application APIs. Each Gold table answers a specific business question or serves a specific domain.

Gold layer characteristics:

  • Aggregated and pre-joined — optimized for query performance
  • Business terminology — column names that non-technical stakeholders understand
  • Slowly Changing Dimensions (SCDs) handled explicitly
  • Documented and contractual — downstream consumers can rely on stability
# Example: Gold layer — Daily Revenue by Region
(spark.table("silver.orders")
  .join(spark.table("silver.customers"), "customer_id")
  .join(spark.table("silver.products"), "product_id")
  .groupBy(
    F.to_date("order_timestamp").alias("order_date"),
    "region",
    "product_category"
  )
  .agg(
    F.sum("total_amount").alias("gross_revenue"),
    F.countDistinct("order_id").alias("order_count"),
    F.countDistinct("customer_id").alias("unique_customers"),
    F.avg("total_amount").alias("avg_order_value")
  )
  .write
  .format("delta")
  .mode("overwrite")
  .option("replaceWhere", "order_date >= current_date() - interval 7 days")
  .saveAsTable("gold.daily_revenue_by_region")
)

Common Mistakes to Avoid

Mistake 1: Putting business logic in Bronze

Bronze should be a faithful copy of source data. The moment you start filtering, renaming, or transforming in Bronze, you lose the ability to replay from raw when business definitions change. And they always change.

Mistake 2: Too many Gold tables

Teams often end up with dozens of Gold tables, each slightly different, serving different consumers. This creates a "Gold spaghetti" problem. Use a semantic layer (Looker, dbt metrics, or Cube) to serve variations without duplicating Gold tables.

Mistake 3: Ignoring Silver data quality metrics

Silver is where quality is established. Without tracking rejection rates, null rates, and deduplication counts at this layer, you're flying blind. Build quality metrics into your Silver jobs and surface them in a data catalog or observability tool.

⚠️
On the "Platinum" layer Some organizations add a "Platinum" layer for ML feature stores or specific high-SLA use cases. This is fine — the pattern is a guideline, not a law. Name your layers consistently and document them clearly.

Putting It Together: Orchestration

The Medallion pattern works best when each hop is orchestrated with explicit dependencies. Use Databricks Workflows, Apache Airflow, or Prefect to chain Bronze → Silver → Gold with proper error handling and alerting at each stage.

A healthy Medallion pipeline has: Bronze jobs running every 5–15 minutes for near-real-time freshness, Silver jobs running every 15–60 minutes with quality checks, and Gold jobs running hourly or on-demand, triggered by Silver completion.

Conclusion

The Medallion architecture is powerful precisely because it's simple: land it raw, clean it progressively, serve it reliably. The implementation details — deduplication strategy, quality enforcement, orchestration — are where the craft lives. Get those right, and you have a data platform that teams can trust and build on.

In the next article, we'll look at how to implement data contracts between your Silver and Gold layers so downstream consumers are protected from breaking changes upstream.