Execution Plan Fundamentals #

Q: How often should execution plans be re-evaluated?

Re-evaluate after schema changes, significant data volume shifts, or when query latency exceeds SLO thresholds. Also after any PostgreSQL major version upgrade, since planner cost constants and statistics behavior can change between versions.

Q: What is the difference between EXPLAIN and EXPLAIN ANALYZE?

EXPLAIN shows the optimizer's predicted plan without executing the query. EXPLAIN ANALYZE executes the query and reports actual runtime metrics alongside the plan. Use EXPLAIN (ANALYZE, BUFFERS) to also capture buffer cache hit and miss counts — run it inside a transaction you roll back if the query modifies data.

Q: Can execution plans vary between identical queries on different environments?

Yes. Differences in hardware, postgresql.conf settings (work_mem, effective_cache_size, random_page_cost), data distribution, table sizes, and statistics freshness all cause the planner to choose different plans for the same SQL text.

Q: Why do estimated rows differ from actual rows in EXPLAIN ANALYZE?

The planner derives row estimates from pg_statistic, which is updated by ANALYZE. If statistics are stale or the data has a non-uniform distribution (skew, correlation, or multi-column dependencies), estimates diverge from reality. Extended statistics (CREATE STATISTICS) can improve multi-column accuracy.

Q: When should I use SET LOCAL to override planner settings?

SET LOCAL is safe for per-transaction testing: it scopes the override to the current transaction and reverts automatically on commit or rollback. Use it to validate changes (e.g. SET LOCAL work_mem = '256MB') before committing them to postgresql.conf.

An execution plan reveals exactly how the PostgreSQL query engine resolves a SQL statement — translating declarative intent into a tree of physical operations. Reading these plans accurately is the prerequisite for every other tuning activity: you cannot reliably fix what you cannot correctly read.

This section covers the core mental model behind plan trees, the three primary access and join strategies the planner selects between, how parallel execution changes plan shape, and a repeatable diagnostic workflow you can apply to any slow query.

The Operator-Tree Mental Model #

Execution plans are hierarchical trees of operators. Each node represents one physical operation: a scan, filter, join, sort, or aggregation. Execution flows from leaves to root — in PostgreSQL’s text output that means bottom-to-top. The root node is the last operation; leaf nodes are the first data sources.

Each node exposes four core fields:

Estimated cost (cost=startup..total) — a unitless planner metric based on page I/O and CPU constants; not milliseconds
Estimated row count (rows=) — derived from pg_statistic entries populated by ANALYZE
Row width (width=) — average output tuple size in bytes, used by upper nodes to budget memory
Actual metrics (only with EXPLAIN ANALYZE) — real row counts and cumulative wall-clock time in milliseconds

EXPLAIN (ANALYZE, BUFFERS)
SELECT order_id, total_amount
FROM orders
WHERE created_at > '2024-01-01';

Seq Scan on orders  (cost=0.00..1250.45 rows=45000 width=20)
                    (actual time=0.018..92.340 rows=45000 loops=1)
  Filter: (created_at > '2024-01-01'::date)
  Rows Removed by Filter: 12000
  Buffers: shared hit=210 read=840
Planning Time: 0.18 ms
Execution Time: 93.10 ms

Reading each field:

Seq Scan — the planner found no usable index on created_at, or estimated that reading all 1050 pages sequentially costs less than random index lookups at this selectivity (45k of 57k rows).
cost=0.00..1250.45 — startup cost is zero (no sort/hash initialisation needed). Total cost is the planner’s weighted estimate of full execution effort.
rows=45000 vs actual rows=45000 — estimates match here. When they diverge by 10× or more, statistics are stale: run ANALYZE orders.
Buffers: shared hit=210 read=840 — 210 pages came from the shared buffer cache (free); 840 required disk reads. A low hit ratio on a frequently run query suggests the working set does not fit in shared_buffers.
Execution Time: 93.10 ms — wall-clock reality. Always tune against actual time, not the unitless cost.

The inline SVG below shows how a three-node plan tree flows:

Execution flows from leaf nodes (data sources) upward to the root node (final output). Cost and row estimates accumulate at each level.

Core Mechanics 1 — Access Paths: Sequential vs Index Scans #

The planner’s first decision for each table is how to access its rows. The choice between a sequential scan and an index scan governs whether every page is read in order or only the pages that satisfy the predicate are fetched via the index.

The planner computes an estimated cost for each option using seq_page_cost (default 1.0) and random_page_cost (default 4.0). When a query retrieves a large fraction of the table, random index lookups multiply quickly and sequential reads win. When selectivity is high (few rows match), index access wins.

-- Force a plan comparison: first without an index, then with one
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 9812;

-- Without an index on customer_id:
Seq Scan on orders  (cost=0.00..2240.00 rows=12 width=128)
                    (actual time=3.420..18.620 rows=12 loops=1)
  Filter: (customer_id = 9812)
  Rows Removed by Filter: 89988
  Buffers: shared hit=15 read=985

-- After: CREATE INDEX ON orders(customer_id);
Index Scan using orders_customer_id_idx on orders
                    (cost=0.43..50.12 rows=12 width=128)
                    (actual time=0.040..0.190 rows=12 loops=1)
  Index Cond: (customer_id = 9812)
  Buffers: shared hit=4 read=1

What changed: The index scan reads 5 pages instead of 1000. Rows Removed by Filter disappears because the index condition eliminates non-matching rows before heap access. When actual time drops by two orders of magnitude, the index is essential — not optional.

A covering index takes this further: if all projected columns are included in the index, PostgreSQL can satisfy the query from the index alone without touching the heap at all (Index Only Scan).

Core Mechanics 2 — Join Algorithms: Hash, Merge, and Nested Loop #

When two tables are joined, the planner selects one of three physical algorithms based on cardinality, available indexes, sort order, and work_mem. Understanding hash join mechanics explains the most common algorithm for large unsorted inputs.

EXPLAIN (ANALYZE, BUFFERS)
SELECT o.order_id, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id;

Hash Join  (cost=150.00..890.20 rows=50000 width=64)
           (actual time=1.20..15.80 loops=1)
  Hash Cond: (o.customer_id = c.id)
  Buffers: shared hit=640 read=360
  ->  Seq Scan on orders o  (cost=0.00..500.00 rows=100000 width=32)
                             (actual time=0.01..5.20 loops=1)
        Buffers: shared hit=400 read=100
  ->  Hash  (cost=100.00..100.00 rows=4000 width=32)
             (actual time=1.15..1.15 loops=1)
        Buckets: 4096  Batches: 1  Memory Usage: 250kB
        ->  Seq Scan on customers c  (cost=0.00..100.00 rows=4000 width=32)
                                      (actual time=0.01..0.80 loops=1)
              Buffers: shared hit=240 read=260

Algorithm internals:

Build phase — PostgreSQL scans customers (the smaller table) and loads its rows into an in-memory hash table. Memory Usage: 250kB is the hash table size.
Probe phase — PostgreSQL scans orders and probes each row’s customer_id against the hash table.
Batches: 1 means the hash table fit entirely in work_mem. When Batches > 1, the hash table spilled to disk — the number of batches tells you how many passes were needed.

The alternative for sorted or indexed inputs is a merge join or nested loop. A nested loop is optimal when the inner table is small and indexed; a merge join requires sorted inputs from both sides but avoids the memory allocation of a hash build.

Core Mechanics 3 — Parallel Query Execution #

For large scans and aggregations, PostgreSQL can split work across background workers. The plan shape changes: a Gather or Gather Merge node appears above the parallel portion, collecting results from all workers into the serial portion of the plan.

EXPLAIN (ANALYZE, BUFFERS)
SELECT department_id, COUNT(*), SUM(salary)
FROM employees
GROUP BY department_id;

Finalize GroupAggregate  (cost=2100.50..2150.00 rows=50 width=24)
                          (actual time=14.20..14.50 loops=1)
  Group Key: department_id
  ->  Gather Merge  (cost=2100.50..2130.00 rows=100 width=24)
                     (actual time=12.80..13.90 loops=1)
        Workers Planned: 4
        Workers Launched: 4
        ->  Partial HashAggregate  (cost=2000.50..2010.00 rows=25 width=24)
                                    (actual time=9.10..9.40 loops=5)
              Group Key: department_id
              ->  Parallel Seq Scan on employees
                    (cost=0.00..1800.00 rows=200000 width=16)
                    (actual time=0.01..4.20 loops=5)
                    Buffers: shared hit=2200 read=800
Planning Time: 0.35 ms
Execution Time: 15.10 ms

Parallel diagnostics:

Workers Planned: 4 vs Workers Launched: 4 — all workers started. If launched < planned, check max_worker_processes and max_parallel_workers in postgresql.conf.
loops=5 on Parallel Seq Scan — each of 4 workers plus the leader executed that node once. Actual rows and time for parallel nodes are reported per-loop; multiply to get totals.
Finalize GroupAggregate — the leader merges partial aggregates from each worker. If the Gather Merge node shows significantly longer actual time than its children, data skew is causing one worker to take much longer than others. Diagnose with EXPLAIN (ANALYZE, VERBOSE) which reports per-worker timing.
Parallel plans add overhead: worker launch cost and merge cost. For small tables or queries that return a large fraction of rows, a serial plan often wins.

The parallel query execution topic covers worker allocation, parallel_tuple_cost, and the conditions under which PostgreSQL disables parallelism entirely.

Cost Estimation and Statistics #

The planner’s quality depends entirely on the accuracy of its statistical models. When statistics drift from reality, the planner selects suboptimal access paths and memory grants for sort and hash operations become inaccurate, forcing disk spills.

Understanding cost estimation models explains how PostgreSQL translates page counts, tuple fractions, and operator selectivity into the unitless cost figures you see in plan output. The deeper detail of how PostgreSQL calculates individual node costs shows the exact arithmetic behind each estimate.

Key statistics concepts:

pg_stats — per-column statistics: n_distinct, correlation, MCV (most-common-values) list, and histogram. The planner interpolates from these to estimate row counts.
default_statistics_target — how many histogram buckets and MCV entries ANALYZE collects per column (default: 100). Increase for highly selective or skewed columns: ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
Extended statistics — CREATE STATISTICS captures cross-column correlations and functional dependencies that single-column stats cannot express. Essential when two columns are correlated but queried together.

-- Inspect statistics for a column that has poor estimates
SELECT attname, n_distinct, correlation, most_common_vals, most_common_freqs
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';

-- Check how fresh statistics are
SELECT schemaname, relname, last_analyze, last_autoanalyze, n_live_tup, n_mod_since_analyze
FROM pg_stat_user_tables
WHERE relname = 'orders';

When n_mod_since_analyze is large relative to n_live_tup, statistics are stale. Run ANALYZE orders or lower the autovacuum_analyze_scale_factor threshold.

Systematic Diagnostic Workflow #

Apply this numbered sequence to any slow query. Each step has a concrete action and a decision point.

Capture the baseline plan with buffers.
```
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT /* your query here */;
```
Save the full output. Never tune from memory — you need the original plan to measure improvement.
Identify the hot node.

Scan actual time values from bottom to top. The node whose cumulative time is disproportionately large relative to its children is the bottleneck. For multi-loop nodes, actual time is per-loop; the total cost is actual time × loops.
Check row estimate accuracy.

Compare rows= (estimated) against actual rows= (real). A divergence of 10× or more is a red flag. If estimates are wrong, fix statistics first — adding an index on a column with stale statistics often produces no improvement because the planner will not choose it.
```
ANALYZE your_table;
-- Or for a specific column with skew:
ALTER TABLE your_table ALTER COLUMN your_column SET STATISTICS 500;
ANALYZE your_table;
```

Evaluate buffer hit ratio.

From EXPLAIN (ANALYZE, BUFFERS) output: shared hit pages are free (from cache); shared read pages required I/O.

-- Hit ratio across all queries via pg_stat_statements
SELECT query,
       round(shared_blks_hit::numeric /
             nullif(shared_blks_hit + shared_blks_read, 0) * 100, 1) AS hit_pct,
       total_exec_time, calls
FROM pg_stat_statements
WHERE calls > 50
ORDER BY total_exec_time DESC
LIMIT 20;

A hit ratio below 90% on a frequently executed query suggests the working set does not fit in shared_buffers. Consider increasing shared_buffers or redesigning the query to access fewer pages.

Check for disk spills in sort and hash operations.

Sort and hash node analysis covers interpreting Sort Method: external merge (sort spilled to disk) and Batches > 1 (hash join spilled). Both indicate work_mem is too low for the operation.

-- Identify temp file usage (disk spills)
SELECT query, temp_blks_written, temp_blks_read, calls
FROM pg_stat_statements
WHERE temp_blks_written > 0
ORDER BY temp_blks_written DESC
LIMIT 10;

-- Test with more work_mem (scoped to this transaction)
BEGIN;
SET LOCAL work_mem = '256MB';
EXPLAIN (ANALYZE, BUFFERS) SELECT /* your query */;
ROLLBACK;

Iterate and compare.

After each change (new index, ANALYZE, config tweak), run step 1 again and diff the actual time of the hot node. Do not declare victory until the wall-clock improvement is measurable and repeatable.

Common Pitfalls #

1. Tuning against estimated cost, not actual time. Diagnostic: actual time is much higher than expected even when cost looks reasonable. Fix: always use EXPLAIN ANALYZE; EXPLAIN alone shows estimates only.

2. Stale statistics causing misestimation. Diagnostic: rows= in the plan differs from actual rows by 10× or more. Fix: run ANALYZE table_name, check autovacuum_analyze_scale_factor and autovacuum_analyze_threshold in postgresql.conf.

3. Over-indexing creating write amplification. Diagnostic: INSERT/UPDATE/DELETE is slow; pg_stat_user_indexes shows indexes with zero scans. Fix: drop unused indexes with DROP INDEX CONCURRENTLY; monitor idx_scan in pg_stat_user_indexes.

4. Assuming parallel execution always helps. Diagnostic: parallel plan is slower than expected; worker launch overhead dominates. Fix: check Workers Launched < Workers Planned (resource limits), and test SET LOCAL max_parallel_workers_per_gather = 0 to compare serial vs parallel cost on your data size.

5. Ignoring Rows Removed by Filter on index scans. Diagnostic: Index Scan is in the plan but Rows Removed by Filter is high. Fix: the index satisfies the index condition but a heap-level filter removes most rows. Add the filtered column to a partial index or a composite index to push the filter into the index condition.

6. Missing extended statistics for correlated columns. Diagnostic: estimates are wrong on a multi-predicate query even after ANALYZE. Fix: check pg_stats for low n_distinct or high correlation on the predicate columns; create a multi-column statistics object with CREATE STATISTICS.

Frequently Asked Questions #

How often should execution plans be re-evaluated? #

Re-evaluate after schema changes, significant data volume shifts (large bulk loads, partition additions), or when query latency exceeds SLO thresholds. Also after any PostgreSQL major version upgrade — planner cost constants, statistics sampling, and join algorithm eligibility can change between versions, so a plan that was optimal on v14 may not be on v17.

What is the difference between EXPLAIN and EXPLAIN ANALYZE? #

EXPLAIN shows the optimizer’s predicted plan without executing the query — useful for checking which plan the planner intends before committing to execution. EXPLAIN ANALYZE executes the query and reports actual runtime metrics alongside estimates. Use EXPLAIN (ANALYZE, BUFFERS) to also capture buffer cache hit and miss counts. If the query is a DML statement (INSERT, UPDATE, DELETE), wrap it in a transaction you roll back to avoid mutating data.

Can execution plans vary between identical queries on different environments? #

Yes. random_page_cost, effective_cache_size, work_mem, hardware I/O characteristics, data distribution, table sizes, and statistics freshness all affect the planner’s cost model. A plan that is optimal on a production server with 128 GB RAM and NVMe storage may differ from the plan on a development laptop. Always verify tuning changes against production-representative data and config.

Why do estimated rows differ from actual rows in EXPLAIN ANALYZE? #

The planner derives row estimates from pg_statistic, populated by ANALYZE. When statistics are stale, the distribution is non-uniform (heavy skew, temporal correlation), or two columns are queried together with a cross-column dependency the planner does not know about, estimates diverge from reality. Extended statistics (CREATE STATISTICS) can capture multi-column dependencies. For single-column skew, increase statistics_target on the affected column.

When should I use SET LOCAL to override planner settings? #

SET LOCAL is the safe way to test configuration changes: it scopes the override to the current transaction and reverts automatically on commit or rollback. Use it to test work_mem, enable_hashjoin, enable_seqscan, or max_parallel_workers_per_gather in isolation before deciding whether to commit the change to postgresql.conf or ALTER ROLE ... SET.

Sequential vs Index Scans — when the planner chooses a full table scan over index access, and how to diagnose the decision
Hash Join Mechanics — build and probe phases, memory usage, and when Batches > 1 means disk spill
Merge Join vs Nested Loop — sort prerequisites, index eligibility, and the loop cost model
Cost Estimation Models — how PostgreSQL translates statistics into the unitless cost figures in plan output
Identifying Plan Bottlenecks — a companion guide to pinpointing the single most expensive node in complex plans

Back to home