Parallel Query Execution: Plan Analysis & Tuning Guide #

PostgreSQL distributes heavy workloads across multiple CPU cores through parallel query execution, dramatically reducing latency for analytical scans, large joins, and complex aggregations — but uncontrolled parallelism introduces CPU contention, memory pressure, and unpredictable plan regressions.

When the Optimizer Chooses a Parallel Plan #

The planner enables parallelism when three conditions align: the table exceeds min_parallel_table_scan_size (default 8 MB), the statement is not inside a function marked PARALLEL UNSAFE, and the estimated cost savings outweigh parallel_setup_cost. The cost model compares the serial plan cost against the parallel plan cost, which it reduces by a factor proportional to the number of workers. When cost estimation models produce inaccurate row estimates — due to stale statistics or high-correlation columns — the optimizer may choose the wrong degree of parallelism.

Cardinality triggers matter too: a highly selective predicate that reduces rows to a few thousand may make parallelism counterproductive even on a 100 GB table. The Gather node’s fixed setup cost (parallel_setup_cost, default 1000 planner units) must be recovered by the per-row savings. For foundational context on how PostgreSQL constructs and traverses these plans, see Reading & Interpreting Query Plans.

The sort and hash node analysis page covers how each parallel worker maintains its own independent sort and hash memory budget — understanding that behavior is essential when tuning work_mem for parallel workloads.

Parallel Execution Node Anatomy #

The following diagram shows how a parallel aggregation query decomposes across a coordinator and background workers:

Annotated EXPLAIN Node Breakdown #

Run EXPLAIN (ANALYZE, BUFFERS, VERBOSE) to see the full parallel node output:

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT customer_id, SUM(order_total)
FROM orders
WHERE created_at > '2023-01-01'
GROUP BY customer_id;

Key fields to inspect in the plan output:

Finalize Aggregate  (cost=98432.50..98433.00 rows=50 width=12)
                    (actual time=1240.821..1241.033 rows=50 loops=1)
  ->  Gather  (cost=98322.08..98332.08 rows=100 width=12)
              (actual time=1238.012..1240.801 rows=150 loops=1)
        Workers Planned: 3          -- ← planner requested 3 workers
        Workers Launched: 3         -- ← all 3 launched; if lower, check system limits
        ->  Partial Aggregate  (cost=97322.08..97323.08 rows=33 width=12)
                               (actual time=1231.045..1231.090 rows=50 loops=3)
              ->  Parallel Seq Scan on orders
                              (cost=0.00..90000.00 rows=2640000 width=16)
                              (actual time=0.052..980.201 rows=2640000 loops=3)
                    Filter: (created_at > '2023-01-01'::date)
                    Rows Removed by Filter: 120000
                    Buffers: shared hit=24000 read=6200
                              -- ↑ 24000 buffer hits vs 6200 disk reads — reasonable ratio

Fields that matter:

Workers Planned vs Workers Launched — a gap means runtime resource exhaustion; check max_parallel_workers and max_worker_processes.
loops=3 on the worker nodes — each worker executed the scan once; multiply per-worker row counts by loops to get total rows processed.
Buffers: shared hit / shared read — per-worker buffer activity; high read counts under parallel load signal cache pressure as workers compete for shared buffer pool space.
Rows Removed by Filter — rows evaluated but discarded per worker; multiplied by loops gives true filter selectivity.

Algorithm Internals: How Parallel Nodes Divide Work #

Parallel Sequential Scan #

The heap is logically partitioned into chunks. Background workers claim chunks through a shared coordination structure (the parallel scan state) — there is no static pre-assignment. This dynamic allocation naturally handles uneven chunk sizes but creates contention on the coordination structure under very high worker counts.

Parallel Hash Join #

The hash join mechanics page explains the build and probe phases in detail. In a parallel hash join, each worker builds a portion of the hash table using its own memory allocation, then the partial tables are merged into a shared structure. Workers probe the shared hash table concurrently. A Parallel Hash node replaces the regular Hash node in the plan output, and the plan will show Batches greater than 1 if the combined hash table exceeds available work_mem.

Gather vs Gather Merge #

Gather collects unordered results from workers — the coordinator makes no attempt to preserve sort order. Use this when the outer query has no ORDER BY requirement or when a downstream Sort node will re-order anyway.

Gather Merge is chosen when the planner can prove workers emit pre-sorted output (e.g., after a parallel index scan or an in-worker sort). The coordinator performs an N-way merge, which avoids a full re-sort at the top level. The plan cost reflects the merge overhead on top of the worker costs.

Partial and Finalize Aggregates #

When parallelizing aggregation, PostgreSQL splits the work into two stages. Worker-level Partial Aggregate nodes compute per-group running totals for their heap chunks. The coordinator’s Finalize Aggregate node combines these partial results into the final output. This only works for decomposable aggregate functions (SUM, COUNT, MIN, MAX, AVG); non-decomposable aggregates force serial execution.

Memory, I/O, and Resource Behavior #

Each parallel worker receives its own independent work_mem allocation. A query running with 4 workers can consume up to 5 × work_mem (4 workers + coordinator). This multiplier effect makes memory planning critical for parallel workloads.

Spill behavior: When a worker’s hash table or sort exceeds its work_mem budget, it spills to temporary files independently of other workers. A query may have some workers spilling while others do not — a sign of data skew. Detecting this requires EXPLAIN (ANALYZE, VERBOSE), which shows per-worker node timings and identifies stragglers.

Cache thrashing: Multiple workers reading different heap sections simultaneously can evict each other’s pages from the shared buffer pool. Monitor Buffers: shared hit ratios; a ratio below 80% under parallel load suggests the combined working set exceeds shared_buffers. For cross-reference with identifying plan bottlenecks, high buffer read counts in parallel nodes are a primary signal of this condition.

Bitmap index scans do not parallelize efficiently. The heap fetch phase requires coordinating which pages each worker will fetch, introducing contention that outweighs the scan savings. Prefer covering index design for queries that currently use bitmap index scans under parallel execution — a covering index eliminates the heap fetch phase entirely, removing the coordination bottleneck.

Step-by-Step Tuning Workflow #

1. Capture a Parallel Baseline #

-- Full parallel plan with per-worker detail
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT customer_id, SUM(order_total)
FROM orders
WHERE created_at > '2023-01-01'
GROUP BY customer_id;
-- Record: Workers Launched, actual time at Gather node, Buffers shared hit/read

2. Check Worker Slot Availability #

-- How many parallel workers are currently active system-wide?
SELECT count(*) AS active_workers
FROM pg_stat_activity
WHERE backend_type = 'parallel worker';
-- Compare against: SHOW max_parallel_workers;

If active_workers is near max_parallel_workers, new queries will launch fewer workers than planned.

3. Inspect Per-Worker Skew #

-- VERBOSE exposes per-worker timing under each worker node
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT region_id, COUNT(*)
FROM orders
WHERE created_at > '2023-01-01'
GROUP BY region_id;
-- Look for: one worker with actual time >> other workers at the same node
-- Cause: data skew, missing statistics, or uneven partition distribution

4. Tune Planner Thresholds with SET LOCAL #

BEGIN;
-- Lower setup cost to encourage parallelism on medium-sized tables
SET LOCAL parallel_setup_cost = 500;        -- default 1000
-- Raise tuple cost to prevent over-parallelization on narrow result sets
SET LOCAL parallel_tuple_cost = 0.1;        -- default 0.1 (leave if adequate)
-- Enforce parallel scans only on tables exceeding your I/O threshold
SET LOCAL min_parallel_table_scan_size = '32MB';  -- default 8MB
-- Allow more workers per gather node
SET LOCAL max_parallel_workers_per_gather = 4;    -- default 2

EXPLAIN (ANALYZE, BUFFERS)
SELECT customer_id, SUM(order_total)
FROM orders
WHERE created_at > '2023-01-01'
GROUP BY customer_id;
ROLLBACK;

5. Compare Serial vs Parallel #

BEGIN;
-- Serial baseline: force single-threaded execution
SET LOCAL max_parallel_workers_per_gather = 0;
EXPLAIN (ANALYZE, BUFFERS)
SELECT customer_id, SUM(order_total)
FROM orders
WHERE created_at > '2023-01-01'
GROUP BY customer_id;
-- Record: actual time at Aggregate node

-- Parallel comparison
SET LOCAL max_parallel_workers_per_gather = 4;
EXPLAIN (ANALYZE, BUFFERS)
SELECT customer_id, SUM(order_total)
FROM orders
WHERE created_at > '2023-01-01'
GROUP BY customer_id;
-- Record: actual time at Finalize Aggregate node
-- If parallel is similar or slower on small tables, Gather overhead > benefit
ROLLBACK;

6. Verify Statistics Currency #

-- Check for stale statistics that cause row estimate errors on the partition key
SELECT tablename, attname, n_distinct, correlation, null_frac
FROM pg_stats
WHERE tablename = 'orders'
  AND attname IN ('created_at', 'customer_id', 'region_id');
-- Run ANALYZE if last_analyze is stale:
SELECT relname, last_analyze, last_autoanalyze
FROM pg_stat_user_tables
WHERE relname = 'orders';

Poor row estimates on filter columns cause the planner to over- or under-estimate the benefit of parallelism. Run ANALYZE orders or adjust default_statistics_target on the high-cardinality column before re-testing.

For deeper diagnosis of parallel worker count decisions, see parallel worker allocation strategies, which covers how max_parallel_workers_per_gather interacts with the system-wide max_parallel_workers ceiling.

Common Pitfalls #

Worker starvation at the system level. Workers Planned: 4 but Workers Launched: 2. The system-wide max_parallel_workers pool is exhausted by concurrent queries. Fix: raise max_parallel_workers, or reduce per-query worker counts for OLTP connections.

Gather overhead exceeds scan savings on small tables. Parallel execution on tables under ~8 MB adds fixed coordination costs with little throughput gain. Fix: raise min_parallel_table_scan_size to exclude these tables from parallel consideration.

Transaction isolation and explicit locks block worker launch. Queries inside a transaction with explicit row locks or SERIALIZABLE isolation may silently fall back to serial execution. Fix: restructure the transaction to acquire locks after the heavy analytical query, or run the query outside the locking transaction.

Per-worker memory pressure causes staggered spills. Workers spill to disk independently, causing one worker to lag the others. Fix: increase work_mem for the session or reduce the number of workers so each worker gets a larger effective memory share from the same pool.

Data skew creates straggler workers. When most matching rows fall into one heap region, one worker does the majority of the work. Fix: run ANALYZE to refresh statistics, or investigate table bloat and dead tuple accumulation with VACUUM VERBOSE.

ORM connection pools disable parallelism silently. Connection poolers that set session-level parameters on connection checkout, or ORMs that wrap every statement in a transaction, can prevent worker launch. Fix: use SET LOCAL within the analytical query or configure a dedicated connection pool for reporting queries with max_parallel_workers_per_gather set appropriately.

Frequently Asked Questions #

Why does EXPLAIN show parallel nodes but ANALYZE runs single-threaded? #

The optimizer plans parallelism at parse time using cost estimates. At execution time, available worker slots, transaction isolation levels, or explicit locks may force a serial fallback. Check max_worker_processes, max_parallel_workers, and whether the query runs inside a transaction with explicit locks. The plan output will still show the parallel node structure even when execution was serial.

How do I prevent parallelism from degrading OLTP latency? #

Use a conservative parallel_setup_cost (raise it above 1000) and raise min_parallel_table_scan_size to 32 MB or higher. Cap max_parallel_workers_per_gather at 0 or 1 for OLTP connection pools. Reserve higher worker counts for dedicated analytical connections or read replicas where CPU contention with OLTP traffic is not a concern.

What metrics indicate parallel worker imbalance? #

In EXPLAIN (ANALYZE, VERBOSE), per-worker execution times appear under each worker node. Frequent disk spills in parallel hash or sort nodes signal per-worker memory pressure. One worker consuming disproportionately more time while others finish early points to data skew or stale statistics on the partition key or filter column.

Does increasing `max_parallel_workers_per_gather` always help? #

No. If max_parallel_workers at the system level is already saturated by concurrent queries, raising the per-gather limit results in workers being silently not launched — Workers Launched will be less than Workers Planned. Always check the system-wide ceiling with SHOW max_parallel_workers and monitor pg_stat_activity before tuning the per-query limit.

Parallel Worker Allocation Strategies — controlling worker counts and system-level ceilings
Sort and Hash Node Analysis — per-worker memory budgets and spill behavior
Identifying Plan Bottlenecks — locating the hot node driving total query cost
Hash Join Mechanics — build and probe phase behavior in parallel hash joins
Reading & Interpreting Query Plans — parent section covering the full operator tree model