The word "governance" makes data engineers reach for the exit. It conjures images of approval committees, endless policy documents, and bureaucratic processes that slow everything down. But ungoverned data has its own costs: nobody knows who to trust, critical decisions get made on wrong numbers, and compliance audits turn into all-hands emergencies.
The solution isn't less governance or more governance. It's right-sized governance — just enough structure to maintain trust, applied where it matters most, automated wherever possible.
The Minimal Viable Governance Stack
If you're starting from zero, focus on four things: data catalog, data ownership, access control, and change management. Everything else is optional until you're larger.
1. Data Catalog: Make Data Discoverable
Before you can govern data, people need to know it exists. A data catalog is a searchable inventory of your data assets — tables, columns, lineage, owners, and descriptions. Without this, governance is impossible because people don't know what they're supposed to be governing.
Start simple. Even a well-maintained README and dbt docs site is better than nothing. Purpose-built catalogs (DataHub, Atlan, Alation, Collibra) add lineage tracking, business glossaries, and usage analytics — valuable when you're at scale.
2. Data Ownership: Assign Accountability
Every dataset needs a named owner — a specific person or team who is accountable for its quality, freshness, and documentation. Without explicit ownership, governance questions go unanswered and quality declines by default.
# Ownership manifest — track this in version control
datasets:
- id: orders_daily
owner: data-platform-team
steward: jane.smith@company.com
classification: business_critical
consumers: [finance, analytics, ml-platform]
sla:
freshness_hours: 2
availability_pct: 99.9
- id: user_events_raw
owner: product-analytics-team
steward: bob.jones@company.com
classification: pii_sensitive
pii_fields: [user_id, email, ip_address]
retention_days: 365
access_policy: restricted
3. Access Control: Protect What Needs Protection
Not all data needs the same access controls. Classify your datasets into tiers and apply access policies accordingly. A practical three-tier model:
- Open: Aggregated, anonymized business metrics. Accessible to all employees by default.
- Restricted: Contains PII, financial data, or customer behavior. Requires team-level access request.
- Confidential: M&A data, executive compensation, security logs. Access by exception only, time-limited.
Implement access control through your data platform's native RBAC, not manually. Every data access decision that requires a Jira ticket and a 3-day wait creates shadow IT — people find workarounds rather than going through proper channels.
4. Change Management: Prevent Breaking Changes
The most common governance failure isn't malicious — it's a well-intentioned engineer renaming a column without realizing that 12 downstream dashboards depend on it. Implement lightweight change management for high-impact datasets.
# Schema change detection in CI/CD (GitHub Actions)
name: Schema Change Check
on: [pull_request]
jobs:
schema-check:
runs-on: ubuntu-latest
steps:
- name: Detect breaking schema changes
run: |
# Compare schema against registered data contracts
python scripts/check_schema_changes.py \
--contract-dir contracts/ \
--changed-models "$(git diff --name-only origin/main)"
- name: Block if breaking change without deprecation notice
run: |
if breaking_change_detected && ! deprecation_notice_filed; then
echo "Breaking schema change detected."
echo "File a deprecation notice and notify consumers."
exit 1
fi
The Business Glossary: Solving the "One Number" Problem
Every organization has a version of this problem: Finance says revenue was $12.4M last quarter. Sales says it was $13.1M. Both are technically correct — they're just measuring different things with the same word.
A business glossary defines canonical terms with agreed-upon business logic. Not every term, just the ones that matter most — the ones that appear in executive presentations, that multiple teams measure independently, or that have caused confusion in the past.
# Business glossary entry example
term: Monthly Recurring Revenue (MRR)
alias: [mrr, monthly_revenue, subscription_revenue]
definition: >
The total normalized monthly subscription revenue from active paying
customers. Includes only recurring subscription components.
EXCLUDES: one-time setup fees, professional services, usage overages,
trials (even paid trials in first 30 days), and churned customers.
CALCULATION: SUM(subscription_amount / billing_period_months)
for all active subscriptions as of the last day of the month.
canonical_table: marts.finance.fct_mrr_monthly
canonical_column: mrr_amount_usd
owner: finance-analytics
approved_by: [CFO, Head of Finance, Head of Data]
last_reviewed: 2025-03-01
Governance That Works in Practice
The governance programs that succeed share a few characteristics. They are automated first — manual processes rot. Schema validation in CI/CD beats a "please check with the data team" Slack reminder. They are enforced at the platform level, not by asking people nicely. And they are proportional to risk — a table used by one analyst for exploratory work needs different governance than the revenue table that feeds the board deck.
Week 1: Inventory your top 20 datasets and assign owners. Week 2: Classify each as Open/Restricted/Confidential and set access policies. Week 3: Document the top 10 most-contested business terms in a glossary. Week 4: Add schema change detection to CI/CD for your 5 most critical datasets. That's a governance program you can actually maintain.
Governance is not a project you complete. It's a practice you maintain. Start small, make it easy to comply, and expand as your organization grows. The goal isn't perfect governance — it's enough governance that people trust the data and can find what they need.