Spark performance tuning has a reputation for being a dark art. Too much of the advice online is either outdated (written for Spark 2.x) or decontextualized (works in benchmarks, fails in production). This guide focuses on the optimizations that consistently make a difference on real workloads — with the reasoning behind each one.

1. Understand Your Partitions First

Almost every Spark performance problem is a partition problem. Too few partitions and you're underutilizing your cluster. Too many and you're drowning in scheduling overhead and small files.

# Check current partition count and sizes
df.rdd.getNumPartitions()

# Target: 128-256MB per partition
# Rule of thumb: cores_in_cluster * 2-3 for batch jobs
spark.conf.set("spark.sql.shuffle.partitions", "200")  # default, often wrong

# Better: set dynamically based on data size
spark.conf.set("spark.sql.adaptive.enabled", "true")   # Let AQE handle it

2. Enable Adaptive Query Execution (AQE)

AQE is Spark 3's biggest performance gift. It rewrites query plans at runtime based on actual partition statistics — handling skew, coalescing small partitions, and switching join strategies automatically.

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256MB")
AQE is almost always a win. Enable it by default in your Spark configs. The only cases where you might disable it: highly predictable workloads where plan stability matters more than plan quality, or debugging situations where you need deterministic plans.

3. Broadcast Joins for Small Tables

A broadcast join sends a small table to every executor, eliminating the shuffle for the large table. Spark auto-broadcasts tables under spark.sql.autoBroadcastJoinThreshold (default 10MB), but you often want to be explicit:

from pyspark.sql import functions as F
from pyspark.sql.functions import broadcast

# Explicit broadcast hint
result = large_orders.join(
  broadcast(small_product_lookup),
  "product_id"
)

# Or raise the auto-broadcast threshold
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "50MB")

4. Avoid Wide Transformations Where Possible

Every shuffle (groupBy, join, repartition, distinct) is expensive. Reduce shuffles by: filtering before joining to reduce data volume, pushing predicates as early as possible, and using reduceByKey instead of groupByKey when working at the RDD level.

5. Persist/Cache Strategically

from pyspark import StorageLevel

# For DataFrames reused multiple times in the same job
df_cleaned = (
  raw_df
  .filter(...)
  .withColumn(...)
).persist(StorageLevel.MEMORY_AND_DISK_SER)  # Serialized saves memory

# Always unpersist when done
df_cleaned.unpersist()

6. Partition Pruning with Columnar Formats

Using Parquet or Delta Lake with proper partition columns is the single most impactful "optimization" for reducing data scanned. Partition on columns you filter by (date, region, status). Never partition on high-cardinality columns.

# Write with partitioning
df.write.partitionBy("year", "month", "day").parquet("/path/to/output")

# Read will automatically prune irrelevant partitions
spark.read.parquet("/path/to/output").filter("year=2025 AND month=3")

7. Handle Data Skew Explicitly

Skewed data — where a few partition keys have vastly more records than others — causes a single executor to do most of the work while others sit idle. The fix: salting.

import random

# Salt the skewed key before joining
large_df_salted = large_df.withColumn(
  "skewed_key_salted",
  F.concat(F.col("skewed_key"), F.lit("_"), (F.rand() * 10).cast("int"))
)

# Replicate the small side for each salt value
small_df_replicated = small_df.withColumn(
  "salt", F.explode(F.array([F.lit(i) for i in range(10)]))
).withColumn(
  "skewed_key_salted",
  F.concat(F.col("skewed_key"), F.lit("_"), F.col("salt"))
)

result = large_df_salted.join(small_df_replicated, "skewed_key_salted")

8–10: Quick Wins

  • Use columnar formats: Parquet and ORC with Snappy compression outperform CSV/JSON by 10-50x for analytical queries.
  • Predicate pushdown: Filter early and let Spark/connector push filters to the storage layer. Check your query plan with df.explain("extended").
  • Right-size your executors: For most workloads, 4-8 cores per executor with 16-32GB RAM is a good starting point. More smaller executors often outperform fewer large ones due to fault tolerance and GC behavior.

Measuring Impact

Always benchmark before and after. The Spark UI is your friend — look at the Stages tab for shuffle read/write sizes, the SQL tab for physical plans, and the executors tab for GC time. If GC time exceeds 5-10% of task time, you have a memory pressure problem.