Execution Plan Fundamentals #
An execution plan reveals exactly how the PostgreSQL query engine resolves a SQL statement — translating declarative intent into a tree of physical operations. Reading these plans accurately is the prerequisite for every other tuning activity: you cannot reliably fix what you cannot correctly read.
This section covers the core mental model behind plan trees, the three primary access and join strategies the planner selects between, how parallel execution changes plan shape, and a repeatable diagnostic workflow you can apply to any slow query.
The Operator-Tree Mental Model #
Execution plans are hierarchical trees of operators. Each node represents one physical operation: a scan, filter, join, sort, or aggregation. Execution flows from leaves to root — in PostgreSQL’s text output that means bottom-to-top. The root node is the last operation; leaf nodes are the first data sources.
Each node exposes four core fields:
- Estimated cost (
cost=startup..total) — a unitless planner metric based on page I/O and CPU constants; not milliseconds - Estimated row count (
rows=) — derived frompg_statisticentries populated byANALYZE - Row width (
width=) — average output tuple size in bytes, used by upper nodes to budget memory - Actual metrics (only with
EXPLAIN ANALYZE) — real row counts and cumulative wall-clock time in milliseconds
EXPLAIN (ANALYZE, BUFFERS)
SELECT order_id, total_amount
FROM orders
WHERE created_at > '2024-01-01';
Seq Scan on orders (cost=0.00..1250.45 rows=45000 width=20)
(actual time=0.018..92.340 rows=45000 loops=1)
Filter: (created_at > '2024-01-01'::date)
Rows Removed by Filter: 12000
Buffers: shared hit=210 read=840
Planning Time: 0.18 ms
Execution Time: 93.10 ms
Reading each field:
Seq Scan— the planner found no usable index oncreated_at, or estimated that reading all 1050 pages sequentially costs less than random index lookups at this selectivity (45k of 57k rows).cost=0.00..1250.45— startup cost is zero (no sort/hash initialisation needed). Total cost is the planner’s weighted estimate of full execution effort.rows=45000vsactual rows=45000— estimates match here. When they diverge by 10× or more, statistics are stale: runANALYZE orders.Buffers: shared hit=210 read=840— 210 pages came from the shared buffer cache (free); 840 required disk reads. A low hit ratio on a frequently run query suggests the working set does not fit inshared_buffers.Execution Time: 93.10 ms— wall-clock reality. Always tune againstactual time, not the unitlesscost.
The inline SVG below shows how a three-node plan tree flows:
Core Mechanics 1 — Access Paths: Sequential vs Index Scans #
The planner’s first decision for each table is how to access its rows. The choice between a sequential scan and an index scan governs whether every page is read in order or only the pages that satisfy the predicate are fetched via the index.
The planner computes an estimated cost for each option using seq_page_cost (default 1.0) and random_page_cost (default 4.0). When a query retrieves a large fraction of the table, random index lookups multiply quickly and sequential reads win. When selectivity is high (few rows match), index access wins.
-- Force a plan comparison: first without an index, then with one
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 9812;
-- Without an index on customer_id:
Seq Scan on orders (cost=0.00..2240.00 rows=12 width=128)
(actual time=3.420..18.620 rows=12 loops=1)
Filter: (customer_id = 9812)
Rows Removed by Filter: 89988
Buffers: shared hit=15 read=985
-- After: CREATE INDEX ON orders(customer_id);
Index Scan using orders_customer_id_idx on orders
(cost=0.43..50.12 rows=12 width=128)
(actual time=0.040..0.190 rows=12 loops=1)
Index Cond: (customer_id = 9812)
Buffers: shared hit=4 read=1
What changed: The index scan reads 5 pages instead of 1000. Rows Removed by Filter disappears because the index condition eliminates non-matching rows before heap access. When actual time drops by two orders of magnitude, the index is essential — not optional.
A covering index takes this further: if all projected columns are included in the index, PostgreSQL can satisfy the query from the index alone without touching the heap at all (Index Only Scan).
Core Mechanics 2 — Join Algorithms: Hash, Merge, and Nested Loop #
When two tables are joined, the planner selects one of three physical algorithms based on cardinality, available indexes, sort order, and work_mem. Understanding hash join mechanics explains the most common algorithm for large unsorted inputs.
EXPLAIN (ANALYZE, BUFFERS)
SELECT o.order_id, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id;
Hash Join (cost=150.00..890.20 rows=50000 width=64)
(actual time=1.20..15.80 loops=1)
Hash Cond: (o.customer_id = c.id)
Buffers: shared hit=640 read=360
-> Seq Scan on orders o (cost=0.00..500.00 rows=100000 width=32)
(actual time=0.01..5.20 loops=1)
Buffers: shared hit=400 read=100
-> Hash (cost=100.00..100.00 rows=4000 width=32)
(actual time=1.15..1.15 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 250kB
-> Seq Scan on customers c (cost=0.00..100.00 rows=4000 width=32)
(actual time=0.01..0.80 loops=1)
Buffers: shared hit=240 read=260
Algorithm internals:
- Build phase — PostgreSQL scans
customers(the smaller table) and loads its rows into an in-memory hash table.Memory Usage: 250kBis the hash table size. - Probe phase — PostgreSQL scans
ordersand probes each row’scustomer_idagainst the hash table. Batches: 1means the hash table fit entirely inwork_mem. WhenBatches > 1, the hash table spilled to disk — the number of batches tells you how many passes were needed.
The alternative for sorted or indexed inputs is a merge join or nested loop. A nested loop is optimal when the inner table is small and indexed; a merge join requires sorted inputs from both sides but avoids the memory allocation of a hash build.
Core Mechanics 3 — Parallel Query Execution #
For large scans and aggregations, PostgreSQL can split work across background workers. The plan shape changes: a Gather or Gather Merge node appears above the parallel portion, collecting results from all workers into the serial portion of the plan.
EXPLAIN (ANALYZE, BUFFERS)
SELECT department_id, COUNT(*), SUM(salary)
FROM employees
GROUP BY department_id;
Finalize GroupAggregate (cost=2100.50..2150.00 rows=50 width=24)
(actual time=14.20..14.50 loops=1)
Group Key: department_id
-> Gather Merge (cost=2100.50..2130.00 rows=100 width=24)
(actual time=12.80..13.90 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Partial HashAggregate (cost=2000.50..2010.00 rows=25 width=24)
(actual time=9.10..9.40 loops=5)
Group Key: department_id
-> Parallel Seq Scan on employees
(cost=0.00..1800.00 rows=200000 width=16)
(actual time=0.01..4.20 loops=5)
Buffers: shared hit=2200 read=800
Planning Time: 0.35 ms
Execution Time: 15.10 ms
Parallel diagnostics:
Workers Planned: 4vsWorkers Launched: 4— all workers started. If launched < planned, checkmax_worker_processesandmax_parallel_workersinpostgresql.conf.loops=5onParallel Seq Scan— each of 4 workers plus the leader executed that node once. Actual rows and time for parallel nodes are reported per-loop; multiply to get totals.Finalize GroupAggregate— the leader merges partial aggregates from each worker. If theGather Mergenode shows significantly longer actual time than its children, data skew is causing one worker to take much longer than others. Diagnose withEXPLAIN (ANALYZE, VERBOSE)which reports per-worker timing.- Parallel plans add overhead: worker launch cost and merge cost. For small tables or queries that return a large fraction of rows, a serial plan often wins.
The parallel query execution topic covers worker allocation, parallel_tuple_cost, and the conditions under which PostgreSQL disables parallelism entirely.
Cost Estimation and Statistics #
The planner’s quality depends entirely on the accuracy of its statistical models. When statistics drift from reality, the planner selects suboptimal access paths and memory grants for sort and hash operations become inaccurate, forcing disk spills.
Understanding cost estimation models explains how PostgreSQL translates page counts, tuple fractions, and operator selectivity into the unitless cost figures you see in plan output. The deeper detail of how PostgreSQL calculates individual node costs shows the exact arithmetic behind each estimate.
Key statistics concepts:
pg_stats— per-column statistics:n_distinct,correlation, MCV (most-common-values) list, and histogram. The planner interpolates from these to estimate row counts.default_statistics_target— how many histogram buckets and MCV entriesANALYZEcollects per column (default: 100). Increase for highly selective or skewed columns:ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;- Extended statistics —
CREATE STATISTICScaptures cross-column correlations and functional dependencies that single-column stats cannot express. Essential when two columns are correlated but queried together.
-- Inspect statistics for a column that has poor estimates
SELECT attname, n_distinct, correlation, most_common_vals, most_common_freqs
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';
-- Check how fresh statistics are
SELECT schemaname, relname, last_analyze, last_autoanalyze, n_live_tup, n_mod_since_analyze
FROM pg_stat_user_tables
WHERE relname = 'orders';
When n_mod_since_analyze is large relative to n_live_tup, statistics are stale. Run ANALYZE orders or lower the autovacuum_analyze_scale_factor threshold.
Systematic Diagnostic Workflow #
Apply this numbered sequence to any slow query. Each step has a concrete action and a decision point.
-
Capture the baseline plan with buffers.
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT /* your query here */;Save the full output. Never tune from memory — you need the original plan to measure improvement.
-
Identify the hot node.
Scan
actual timevalues from bottom to top. The node whose cumulative time is disproportionately large relative to its children is the bottleneck. For multi-loop nodes,actual timeis per-loop; the total cost isactual time × loops. -
Check row estimate accuracy.
Compare
rows=(estimated) againstactual rows=(real). A divergence of 10× or more is a red flag. If estimates are wrong, fix statistics first — adding an index on a column with stale statistics often produces no improvement because the planner will not choose it.ANALYZE your_table; -- Or for a specific column with skew: ALTER TABLE your_table ALTER COLUMN your_column SET STATISTICS 500; ANALYZE your_table; -
Evaluate buffer hit ratio.
From
EXPLAIN (ANALYZE, BUFFERS)output:shared hitpages are free (from cache);shared readpages required I/O.-- Hit ratio across all queries via pg_stat_statements SELECT query, round(shared_blks_hit::numeric / nullif(shared_blks_hit + shared_blks_read, 0) * 100, 1) AS hit_pct, total_exec_time, calls FROM pg_stat_statements WHERE calls > 50 ORDER BY total_exec_time DESC LIMIT 20;A hit ratio below 90% on a frequently executed query suggests the working set does not fit in
shared_buffers. Consider increasingshared_buffersor redesigning the query to access fewer pages. -
Check for disk spills in sort and hash operations.
Sort and hash node analysis covers interpreting
Sort Method: external merge(sort spilled to disk) andBatches > 1(hash join spilled). Both indicatework_memis too low for the operation.-- Identify temp file usage (disk spills) SELECT query, temp_blks_written, temp_blks_read, calls FROM pg_stat_statements WHERE temp_blks_written > 0 ORDER BY temp_blks_written DESC LIMIT 10; -- Test with more work_mem (scoped to this transaction) BEGIN; SET LOCAL work_mem = '256MB'; EXPLAIN (ANALYZE, BUFFERS) SELECT /* your query */; ROLLBACK; -
Iterate and compare.
After each change (new index, ANALYZE, config tweak), run step 1 again and diff the
actual timeof the hot node. Do not declare victory until the wall-clock improvement is measurable and repeatable.
Common Pitfalls #
1. Tuning against estimated cost, not actual time.
Diagnostic: actual time is much higher than expected even when cost looks reasonable. Fix: always use EXPLAIN ANALYZE; EXPLAIN alone shows estimates only.
2. Stale statistics causing misestimation.
Diagnostic: rows= in the plan differs from actual rows by 10× or more. Fix: run ANALYZE table_name, check autovacuum_analyze_scale_factor and autovacuum_analyze_threshold in postgresql.conf.
3. Over-indexing creating write amplification.
Diagnostic: INSERT/UPDATE/DELETE is slow; pg_stat_user_indexes shows indexes with zero scans. Fix: drop unused indexes with DROP INDEX CONCURRENTLY; monitor idx_scan in pg_stat_user_indexes.
4. Assuming parallel execution always helps.
Diagnostic: parallel plan is slower than expected; worker launch overhead dominates. Fix: check Workers Launched < Workers Planned (resource limits), and test SET LOCAL max_parallel_workers_per_gather = 0 to compare serial vs parallel cost on your data size.
5. Ignoring Rows Removed by Filter on index scans.
Diagnostic: Index Scan is in the plan but Rows Removed by Filter is high. Fix: the index satisfies the index condition but a heap-level filter removes most rows. Add the filtered column to a partial index or a composite index to push the filter into the index condition.
6. Missing extended statistics for correlated columns.
Diagnostic: estimates are wrong on a multi-predicate query even after ANALYZE. Fix: check pg_stats for low n_distinct or high correlation on the predicate columns; create a multi-column statistics object with CREATE STATISTICS.
Frequently Asked Questions #
How often should execution plans be re-evaluated? #
Re-evaluate after schema changes, significant data volume shifts (large bulk loads, partition additions), or when query latency exceeds SLO thresholds. Also after any PostgreSQL major version upgrade — planner cost constants, statistics sampling, and join algorithm eligibility can change between versions, so a plan that was optimal on v14 may not be on v17.
What is the difference between EXPLAIN and EXPLAIN ANALYZE? #
EXPLAIN shows the optimizer’s predicted plan without executing the query — useful for checking which plan the planner intends before committing to execution. EXPLAIN ANALYZE executes the query and reports actual runtime metrics alongside estimates. Use EXPLAIN (ANALYZE, BUFFERS) to also capture buffer cache hit and miss counts. If the query is a DML statement (INSERT, UPDATE, DELETE), wrap it in a transaction you roll back to avoid mutating data.
Can execution plans vary between identical queries on different environments? #
Yes. random_page_cost, effective_cache_size, work_mem, hardware I/O characteristics, data distribution, table sizes, and statistics freshness all affect the planner’s cost model. A plan that is optimal on a production server with 128 GB RAM and NVMe storage may differ from the plan on a development laptop. Always verify tuning changes against production-representative data and config.
Why do estimated rows differ from actual rows in EXPLAIN ANALYZE? #
The planner derives row estimates from pg_statistic, populated by ANALYZE. When statistics are stale, the distribution is non-uniform (heavy skew, temporal correlation), or two columns are queried together with a cross-column dependency the planner does not know about, estimates diverge from reality. Extended statistics (CREATE STATISTICS) can capture multi-column dependencies. For single-column skew, increase statistics_target on the affected column.
When should I use SET LOCAL to override planner settings? #
SET LOCAL is the safe way to test configuration changes: it scopes the override to the current transaction and reverts automatically on commit or rollback. Use it to test work_mem, enable_hashjoin, enable_seqscan, or max_parallel_workers_per_gather in isolation before deciding whether to commit the change to postgresql.conf or ALTER ROLE ... SET.
Related #
- Sequential vs Index Scans — when the planner chooses a full table scan over index access, and how to diagnose the decision
- Hash Join Mechanics — build and probe phases, memory usage, and when
Batches > 1means disk spill - Merge Join vs Nested Loop — sort prerequisites, index eligibility, and the loop cost model
- Cost Estimation Models — how PostgreSQL translates statistics into the unitless cost figures in plan output
- Identifying Plan Bottlenecks — a companion guide to pinpointing the single most expensive node in complex plans