dbt has become the de facto standard for the SQL transformation layer — and for good reason. But "using dbt" and "using dbt well" are two different things. Projects that start clean tend to develop entropy as teams grow. This guide documents the organizational patterns that scale.

Folder Structure: The Foundation

dbt's recommended folder structure maps to the layers of your data architecture:

models/
├── staging/          # 1:1 with source tables, light casting only
│   ├── salesforce/
│   │   ├── _salesforce__sources.yml
│   │   ├── _salesforce__models.yml
│   │   ├── stg_salesforce__accounts.sql
│   │   └── stg_salesforce__opportunities.sql
│   └── stripe/
│       ├── _stripe__sources.yml
│       └── stg_stripe__charges.sql
├── intermediate/     # Business logic, not exposed to BI
│   └── int_orders__joined.sql
├── marts/            # Business domain groupings
│   ├── finance/
│   │   ├── _finance__models.yml
│   │   ├── fct_orders.sql
│   │   └── dim_customers.sql
│   └── marketing/
│       ├── fct_campaigns.sql
│       └── dim_channels.sql
└── utilities/        # Shared macros, date spines, etc.
💡
Naming Convention Prefix model names with their type: stg_ for staging, int_ for intermediate, fct_ for fact tables, dim_ for dimensions. Include the source system in staging names: stg_salesforce__accounts (double underscore separates source from entity).

Staging Models: The Right Level of Abstraction

Staging models should do exactly one thing: faithfully represent the source table with minimal, non-destructive transformations. That means:

-- stg_salesforce__opportunities.sql
-- GOOD: light transforms only
WITH source AS (
  SELECT * FROM {{ source('salesforce', 'opportunity') }}
),

renamed AS (
  SELECT
    id                          AS opportunity_id,
    account_id,
    name                        AS opportunity_name,
    amount                      AS opportunity_amount_usd,
    stagename                   AS stage_name,
    closedate                   AS close_date,
    isclosed                    AS is_closed,
    iswon                       AS is_won,
    createddate                 AS created_at,
    lastmodifieddate            AS updated_at
  FROM source
)

SELECT * FROM renamed

Testing Strategy

dbt's built-in tests cover a lot of ground. A good baseline:

# _finance__models.yml
version: 2

models:
  - name: fct_orders
    description: "One row per order. Grain: order_id."
    columns:
      - name: order_id
        description: "Primary key from the orders source system."
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id
      - name: order_status
        tests:
          - accepted_values:
              values: ['pending', 'processing', 'shipped', 'delivered', 'cancelled']
      - name: total_amount_usd
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"

Custom Tests for Business Logic

-- tests/assert_orders_have_valid_totals.sql
-- This test should return zero rows to pass
SELECT order_id
FROM {{ ref('fct_orders') }}
WHERE total_amount_usd < 0
   OR (order_status = 'delivered' AND total_amount_usd = 0)

Materializations: Choosing Wisely

The default materialization choice has huge performance and cost implications:

  • Staging modelsview (always fresh, no storage cost)
  • Intermediate modelsephemeral (inlined as CTEs, no table created) or view
  • Fact tablesincremental (append only new/changed rows)
  • Dimension tablestable (full refresh, usually fast enough)
  • Large historical tablesincremental with unique_key for upsert behavior
-- fct_orders.sql (incremental example)
{{ config(
  materialized='incremental',
  unique_key='order_id',
  incremental_strategy='merge',
  cluster_by=['order_date', 'customer_id']
) }}

SELECT
  o.order_id,
  o.customer_id,
  o.order_date,
  o.total_amount_usd,
  c.region,
  CURRENT_TIMESTAMP() AS dbt_updated_at
FROM {{ ref('stg_orders__orders') }} o
JOIN {{ ref('dim_customers') }} c USING (customer_id)

{% if is_incremental() %}
  WHERE o.updated_at > (SELECT MAX(dbt_updated_at) FROM {{ this }})
{% endif %}

Documentation That Stays Current

The hardest part of dbt documentation is keeping it accurate. Two practices that help:

1. Use doc blocks for shared definitions. Define key business terms once in docs/ and reference them across models:

-- docs/definitions.md
{% docs order_status %}
The current fulfillment status of the order. Values: pending (payment received, not yet 
picked), processing (in warehouse), shipped (with carrier), delivered (confirmed receipt), 
cancelled (order voided).
{% enddocs %}

2. Run dbt docs generate in CI. Make documentation generation part of your deployment pipeline so the catalog is always in sync with your models. Publish the generated site to an internal tool like Netlify or an S3 static site.

Packages Worth Adding

  • dbt-utils — surrogate keys, date spines, generic tests
  • dbt-expectations — Great Expectations-style tests in YAML
  • dbt-audit-helper — compare model outputs between branches
  • dbt-meta-testing — enforce that all models have tests and descriptions

The "Definition of Done" for a dbt Model

A model isn't done until it has: a grain statement in its description, a primary key with unique + not_null tests, at least one relationship test linking it to upstream models, accepted_values tests on all enum/status columns, and a business owner documented in its .yml meta section. Encode this as a dbt-meta-testing check in CI and it enforces itself.