Real-Time Data Streaming with Apache Kafka

Kafka is everywhere in data engineering. It's in every modern data stack diagram, every job description, and every architecture review. But most tutorials show you a docker-compose file and call it production-ready. Here's what actually matters when you're running Kafka at scale.

Understanding Kafka's Core Model

Before touching configuration, internalize the data model: Kafka is a distributed, partitioned, replicated commit log. Producers write records to topics. Topics are divided into partitions. Partitions are replicated across brokers. Consumers read from partitions, and their position (offset) is tracked per-consumer-group.

Everything else — throughput, ordering guarantees, consumer lag, replication — flows from this model. Understanding it deeply lets you reason about behavior without consulting the docs for every issue.

# Topic design: partitions determine max parallelism
# Rule of thumb: target_throughput / throughput_per_partition
# Measure your throughput_per_partition in staging first

kafka-topics.sh --create \
  --bootstrap-server broker:9092 \
  --topic orders \
  --partitions 12 \
  --replication-factor 3 \
  --config retention.ms=604800000 \        # 7 days
  --config min.insync.replicas=2 \         # Safety
  --config compression.type=lz4            # ~4x compression

Producer Configuration for Reliability

The default producer settings are optimized for throughput, not reliability. For production data pipelines, you need to be explicit about your durability guarantees.

# Python producer with reliability settings
from confluent_kafka import Producer
import json

producer = Producer({
    'bootstrap.servers': 'broker1:9092,broker2:9092,broker3:9092',

    # Reliability: wait for all in-sync replicas to acknowledge
    'acks': 'all',

    # Exactly-once semantics (requires Kafka 0.11+)
    'enable.idempotence': True,

    # Retry on transient errors, but respect ordering
    'retries': 10,
    'max.in.flight.requests.per.connection': 5,

    # Compression: reduces network + storage cost significantly
    'compression.type': 'lz4',

    # Batching: increases throughput, adds small latency
    'linger.ms': 5,
    'batch.size': 65536,  # 64KB
})

def delivery_callback(err, msg):
    if err:
        print(f'Delivery failed for {msg.key()}: {err}')
    # Log to your observability system

producer.produce(
    topic='orders',
    key=order['customer_id'].encode(),  # Key determines partition
    value=json.dumps(order).encode(),
    callback=delivery_callback
)
producer.poll(0)  # Trigger delivery callbacks

The Operational Realities

Consumer lag is your most important metric. Set up alerts for when any consumer group's lag exceeds your latency SLA. Use Burrow or the Confluent metrics reporter to track this. Consumer lag that grows unboundedly means your consumer can't keep up with your producer — you need to either increase partition count, add consumer instances, or optimize your processing logic.

Log retention is a cost lever. Kafka stores data on disk until the retention period expires. A topic with 100 producers writing 10MB/s with 7-day retention uses about 60TB of disk (with 3x replication). Model this before going to production.

Schema Registry is non-negotiable for team use. When multiple teams produce and consume from the same topics, you need a schema registry (Confluent Schema Registry or AWS Glue Schema Registry) to prevent producers from publishing breaking schema changes that silently corrupt downstream consumers.

⚠️ The Partition Count Problem

You cannot easily decrease partition count after a topic has data. And increasing partitions breaks key-based ordering guarantees. Plan your partition counts carefully upfront: too few limits throughput; too many wastes resources and increases rebalance time. Start with a rough calculation and validate with load testing before production.

Kafka Connect: Getting Data In and Out

Kafka Connect is the missing piece most tutorials skip. It's a framework for scalable, reliable data pipelines between Kafka and external systems — databases, object stores, search engines, SaaS APIs.

# Debezium connector for PostgreSQL CDC
curl -X POST http://connect:8083/connectors -H 'Content-Type: application/json' -d '{
  "name": "postgres-cdc",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "replicator",
    "database.password": "${file:/secrets/postgres.properties:password}",
    "database.dbname": "production",
    "database.server.name": "prod_pg",
    "table.include.list": "public.orders,public.customers",
    "plugin.name": "pgoutput",
    "publication.name": "dbz_publication",
    "slot.name": "debezium",
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "false"
  }
}'

Running Kafka at scale is genuinely hard. But the combination of Kafka + Connect + Schema Registry gives you a data streaming backbone that can handle virtually any throughput requirement. Build it right from the start.