Execution Plan Fundamentals #

An execution plan reveals exactly how the PostgreSQL query engine resolves a SQL statement — translating declarative intent into a tree of physical operations. Reading these plans accurately is the prerequisite for every other tuning activity: you cannot reliably fix what you cannot correctly read.

This section covers the core mental model behind plan trees, the three primary access and join strategies the planner selects between, how parallel execution changes plan shape, and a repeatable diagnostic workflow you can apply to any slow query.

The Operator-Tree Mental Model #

Execution plans are hierarchical trees of operators. Each node represents one physical operation: a scan, filter, join, sort, or aggregation. Execution flows from leaves to root — in PostgreSQL’s text output that means bottom-to-top. The root node is the last operation; leaf nodes are the first data sources.

Each node exposes four core fields:

EXPLAIN (ANALYZE, BUFFERS)
SELECT order_id, total_amount
FROM orders
WHERE created_at > '2024-01-01';
Seq Scan on orders  (cost=0.00..1250.45 rows=45000 width=20)
                    (actual time=0.018..92.340 rows=45000 loops=1)
  Filter: (created_at > '2024-01-01'::date)
  Rows Removed by Filter: 12000
  Buffers: shared hit=210 read=840
Planning Time: 0.18 ms
Execution Time: 93.10 ms

Reading each field:

The inline SVG below shows how a three-node plan tree flows:

PostgreSQL plan tree execution flow A three-node operator tree. Two Seq Scan leaf nodes feed into a Hash Join root node. Arrows point upward indicating data flows from leaves to root (bottom to top). Hash Join (root) cost=150..890 rows=50000 Seq Scan: orders cost=0..500 rows=100000 Seq Scan: customers cost=0..100 rows=4000 data flows bottom → top
Execution flows from leaf nodes (data sources) upward to the root node (final output). Cost and row estimates accumulate at each level.

Core Mechanics 1 — Access Paths: Sequential vs Index Scans #

The planner’s first decision for each table is how to access its rows. The choice between a sequential scan and an index scan governs whether every page is read in order or only the pages that satisfy the predicate are fetched via the index.

The planner computes an estimated cost for each option using seq_page_cost (default 1.0) and random_page_cost (default 4.0). When a query retrieves a large fraction of the table, random index lookups multiply quickly and sequential reads win. When selectivity is high (few rows match), index access wins.

-- Force a plan comparison: first without an index, then with one
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 9812;
-- Without an index on customer_id:
Seq Scan on orders  (cost=0.00..2240.00 rows=12 width=128)
                    (actual time=3.420..18.620 rows=12 loops=1)
  Filter: (customer_id = 9812)
  Rows Removed by Filter: 89988
  Buffers: shared hit=15 read=985

-- After: CREATE INDEX ON orders(customer_id);
Index Scan using orders_customer_id_idx on orders
                    (cost=0.43..50.12 rows=12 width=128)
                    (actual time=0.040..0.190 rows=12 loops=1)
  Index Cond: (customer_id = 9812)
  Buffers: shared hit=4 read=1

What changed: The index scan reads 5 pages instead of 1000. Rows Removed by Filter disappears because the index condition eliminates non-matching rows before heap access. When actual time drops by two orders of magnitude, the index is essential — not optional.

A covering index takes this further: if all projected columns are included in the index, PostgreSQL can satisfy the query from the index alone without touching the heap at all (Index Only Scan).

Core Mechanics 2 — Join Algorithms: Hash, Merge, and Nested Loop #

When two tables are joined, the planner selects one of three physical algorithms based on cardinality, available indexes, sort order, and work_mem. Understanding hash join mechanics explains the most common algorithm for large unsorted inputs.

EXPLAIN (ANALYZE, BUFFERS)
SELECT o.order_id, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id;
Hash Join  (cost=150.00..890.20 rows=50000 width=64)
           (actual time=1.20..15.80 loops=1)
  Hash Cond: (o.customer_id = c.id)
  Buffers: shared hit=640 read=360
  ->  Seq Scan on orders o  (cost=0.00..500.00 rows=100000 width=32)
                             (actual time=0.01..5.20 loops=1)
        Buffers: shared hit=400 read=100
  ->  Hash  (cost=100.00..100.00 rows=4000 width=32)
             (actual time=1.15..1.15 loops=1)
        Buckets: 4096  Batches: 1  Memory Usage: 250kB
        ->  Seq Scan on customers c  (cost=0.00..100.00 rows=4000 width=32)
                                      (actual time=0.01..0.80 loops=1)
              Buffers: shared hit=240 read=260

Algorithm internals:

  1. Build phase — PostgreSQL scans customers (the smaller table) and loads its rows into an in-memory hash table. Memory Usage: 250kB is the hash table size.
  2. Probe phase — PostgreSQL scans orders and probes each row’s customer_id against the hash table.
  3. Batches: 1 means the hash table fit entirely in work_mem. When Batches > 1, the hash table spilled to disk — the number of batches tells you how many passes were needed.

The alternative for sorted or indexed inputs is a merge join or nested loop. A nested loop is optimal when the inner table is small and indexed; a merge join requires sorted inputs from both sides but avoids the memory allocation of a hash build.

Core Mechanics 3 — Parallel Query Execution #

For large scans and aggregations, PostgreSQL can split work across background workers. The plan shape changes: a Gather or Gather Merge node appears above the parallel portion, collecting results from all workers into the serial portion of the plan.

EXPLAIN (ANALYZE, BUFFERS)
SELECT department_id, COUNT(*), SUM(salary)
FROM employees
GROUP BY department_id;
Finalize GroupAggregate  (cost=2100.50..2150.00 rows=50 width=24)
                          (actual time=14.20..14.50 loops=1)
  Group Key: department_id
  ->  Gather Merge  (cost=2100.50..2130.00 rows=100 width=24)
                     (actual time=12.80..13.90 loops=1)
        Workers Planned: 4
        Workers Launched: 4
        ->  Partial HashAggregate  (cost=2000.50..2010.00 rows=25 width=24)
                                    (actual time=9.10..9.40 loops=5)
              Group Key: department_id
              ->  Parallel Seq Scan on employees
                    (cost=0.00..1800.00 rows=200000 width=16)
                    (actual time=0.01..4.20 loops=5)
                    Buffers: shared hit=2200 read=800
Planning Time: 0.35 ms
Execution Time: 15.10 ms

Parallel diagnostics:

The parallel query execution topic covers worker allocation, parallel_tuple_cost, and the conditions under which PostgreSQL disables parallelism entirely.

Cost Estimation and Statistics #

The planner’s quality depends entirely on the accuracy of its statistical models. When statistics drift from reality, the planner selects suboptimal access paths and memory grants for sort and hash operations become inaccurate, forcing disk spills.

Understanding cost estimation models explains how PostgreSQL translates page counts, tuple fractions, and operator selectivity into the unitless cost figures you see in plan output. The deeper detail of how PostgreSQL calculates individual node costs shows the exact arithmetic behind each estimate.

Key statistics concepts:

-- Inspect statistics for a column that has poor estimates
SELECT attname, n_distinct, correlation, most_common_vals, most_common_freqs
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';

-- Check how fresh statistics are
SELECT schemaname, relname, last_analyze, last_autoanalyze, n_live_tup, n_mod_since_analyze
FROM pg_stat_user_tables
WHERE relname = 'orders';

When n_mod_since_analyze is large relative to n_live_tup, statistics are stale. Run ANALYZE orders or lower the autovacuum_analyze_scale_factor threshold.

Systematic Diagnostic Workflow #

Apply this numbered sequence to any slow query. Each step has a concrete action and a decision point.

  1. Capture the baseline plan with buffers.

    EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
    SELECT /* your query here */;

    Save the full output. Never tune from memory — you need the original plan to measure improvement.

  2. Identify the hot node.

    Scan actual time values from bottom to top. The node whose cumulative time is disproportionately large relative to its children is the bottleneck. For multi-loop nodes, actual time is per-loop; the total cost is actual time × loops.

  3. Check row estimate accuracy.

    Compare rows= (estimated) against actual rows= (real). A divergence of 10× or more is a red flag. If estimates are wrong, fix statistics first — adding an index on a column with stale statistics often produces no improvement because the planner will not choose it.

    ANALYZE your_table;
    -- Or for a specific column with skew:
    ALTER TABLE your_table ALTER COLUMN your_column SET STATISTICS 500;
    ANALYZE your_table;
  4. Evaluate buffer hit ratio.

    From EXPLAIN (ANALYZE, BUFFERS) output: shared hit pages are free (from cache); shared read pages required I/O.

    -- Hit ratio across all queries via pg_stat_statements
    SELECT query,
           round(shared_blks_hit::numeric /
                 nullif(shared_blks_hit + shared_blks_read, 0) * 100, 1) AS hit_pct,
           total_exec_time, calls
    FROM pg_stat_statements
    WHERE calls > 50
    ORDER BY total_exec_time DESC
    LIMIT 20;

    A hit ratio below 90% on a frequently executed query suggests the working set does not fit in shared_buffers. Consider increasing shared_buffers or redesigning the query to access fewer pages.

  5. Check for disk spills in sort and hash operations.

    Sort and hash node analysis covers interpreting Sort Method: external merge (sort spilled to disk) and Batches > 1 (hash join spilled). Both indicate work_mem is too low for the operation.

    -- Identify temp file usage (disk spills)
    SELECT query, temp_blks_written, temp_blks_read, calls
    FROM pg_stat_statements
    WHERE temp_blks_written > 0
    ORDER BY temp_blks_written DESC
    LIMIT 10;
    
    -- Test with more work_mem (scoped to this transaction)
    BEGIN;
    SET LOCAL work_mem = '256MB';
    EXPLAIN (ANALYZE, BUFFERS) SELECT /* your query */;
    ROLLBACK;
  6. Iterate and compare.

    After each change (new index, ANALYZE, config tweak), run step 1 again and diff the actual time of the hot node. Do not declare victory until the wall-clock improvement is measurable and repeatable.

Common Pitfalls #

1. Tuning against estimated cost, not actual time. Diagnostic: actual time is much higher than expected even when cost looks reasonable. Fix: always use EXPLAIN ANALYZE; EXPLAIN alone shows estimates only.

2. Stale statistics causing misestimation. Diagnostic: rows= in the plan differs from actual rows by 10× or more. Fix: run ANALYZE table_name, check autovacuum_analyze_scale_factor and autovacuum_analyze_threshold in postgresql.conf.

3. Over-indexing creating write amplification. Diagnostic: INSERT/UPDATE/DELETE is slow; pg_stat_user_indexes shows indexes with zero scans. Fix: drop unused indexes with DROP INDEX CONCURRENTLY; monitor idx_scan in pg_stat_user_indexes.

4. Assuming parallel execution always helps. Diagnostic: parallel plan is slower than expected; worker launch overhead dominates. Fix: check Workers Launched < Workers Planned (resource limits), and test SET LOCAL max_parallel_workers_per_gather = 0 to compare serial vs parallel cost on your data size.

5. Ignoring Rows Removed by Filter on index scans. Diagnostic: Index Scan is in the plan but Rows Removed by Filter is high. Fix: the index satisfies the index condition but a heap-level filter removes most rows. Add the filtered column to a partial index or a composite index to push the filter into the index condition.

6. Missing extended statistics for correlated columns. Diagnostic: estimates are wrong on a multi-predicate query even after ANALYZE. Fix: check pg_stats for low n_distinct or high correlation on the predicate columns; create a multi-column statistics object with CREATE STATISTICS.

Frequently Asked Questions #

How often should execution plans be re-evaluated? #

Re-evaluate after schema changes, significant data volume shifts (large bulk loads, partition additions), or when query latency exceeds SLO thresholds. Also after any PostgreSQL major version upgrade — planner cost constants, statistics sampling, and join algorithm eligibility can change between versions, so a plan that was optimal on v14 may not be on v17.

What is the difference between EXPLAIN and EXPLAIN ANALYZE? #

EXPLAIN shows the optimizer’s predicted plan without executing the query — useful for checking which plan the planner intends before committing to execution. EXPLAIN ANALYZE executes the query and reports actual runtime metrics alongside estimates. Use EXPLAIN (ANALYZE, BUFFERS) to also capture buffer cache hit and miss counts. If the query is a DML statement (INSERT, UPDATE, DELETE), wrap it in a transaction you roll back to avoid mutating data.

Can execution plans vary between identical queries on different environments? #

Yes. random_page_cost, effective_cache_size, work_mem, hardware I/O characteristics, data distribution, table sizes, and statistics freshness all affect the planner’s cost model. A plan that is optimal on a production server with 128 GB RAM and NVMe storage may differ from the plan on a development laptop. Always verify tuning changes against production-representative data and config.

Why do estimated rows differ from actual rows in EXPLAIN ANALYZE? #

The planner derives row estimates from pg_statistic, populated by ANALYZE. When statistics are stale, the distribution is non-uniform (heavy skew, temporal correlation), or two columns are queried together with a cross-column dependency the planner does not know about, estimates diverge from reality. Extended statistics (CREATE STATISTICS) can capture multi-column dependencies. For single-column skew, increase statistics_target on the affected column.

When should I use SET LOCAL to override planner settings? #

SET LOCAL is the safe way to test configuration changes: it scopes the override to the current transaction and reverts automatically on commit or rollback. Use it to test work_mem, enable_hashjoin, enable_seqscan, or max_parallel_workers_per_gather in isolation before deciding whether to commit the change to postgresql.conf or ALTER ROLE ... SET.


Back to home