How PostgreSQL Calculates Node Costs: A Diagnostic Guide #
When EXPLAIN reports (cost=0.00..431.20 rows=12450 width=64), those numbers are not guesses — they are the output of deterministic arithmetic that the planner runs before a single row is fetched. Every node in an execution plan carries a startup_cost and a total_cost, each derived from a linear combination of I/O and CPU constants applied to statistical estimates. Once you can reconstruct that arithmetic yourself, you stop treating EXPLAIN output as a black box and start treating it as a precise diagnostic instrument.
This page focuses on the specific formulas for individual node types. For the broader framework of how PostgreSQL’s cost estimation models govern plan selection, including how these constants interact with buffer-pool behavior and cache-hit probability, see that overview.
The Cost Formula: I/O and CPU Components #
Every node cost is a linear combination of page reads and row operations:
node_cost = (pages_read_sequentially × seq_page_cost)
+ (pages_read_randomly × random_page_cost)
+ (rows_processed × cpu_tuple_cost)
+ (index_tuples_scanned × cpu_index_tuple_cost)
+ (operators_evaluated × cpu_operator_cost)
The planner aggregates these values bottom-up, starting from leaf scans and propagating through joins and aggregations. The result is a dimensionless metric — not milliseconds, not CPU seconds — used solely for comparing candidate paths.
The diagram below shows how cost flows from leaf nodes through a join to the root:
Page-Level Costs: seq_page_cost and random_page_cost #
The two page-level parameters dictate the I/O penalty. Their defaults reflect spinning-disk assumptions:
| Parameter | Default | Meaning |
|---|---|---|
seq_page_cost |
1.0 |
Reference unit — one sequentially read page |
random_page_cost |
4.0 |
4× penalty for a page fetched via random seek |
This 4:1 ratio is the single most commonly misaligned constant on modern hardware. On NVMe or SSD storage, random seeks are 1.1–1.5× the cost of sequential reads, not 4×. When random_page_cost is too high, the planner overestimates the cost of every index lookup, systematically preferring sequential vs index scans even on highly selective queries.
A concrete example: with default constants, a query returning 500 rows from a 100,000-row table may show a sequential scan as cheaper than an index scan, even though the index would actually be 10× faster on SSD. Lowering random_page_cost to 1.2 and re-running EXPLAIN is often sufficient to confirm the misalignment.
Tuple-Level Costs: cpu_tuple_cost, cpu_index_tuple_cost, cpu_operator_cost #
CPU costs scale linearly with estimated row counts:
| Parameter | Default | Charged when |
|---|---|---|
cpu_tuple_cost |
0.01 |
Processing one heap tuple |
cpu_index_tuple_cost |
0.005 |
Evaluating one index entry |
cpu_operator_cost |
0.0025 |
Executing one operator or function call |
These values appear small, but they compound quickly. A query scanning 500,000 rows applies 500000 × 0.01 = 5000 cost units purely from tuple processing. If the WHERE clause contains two predicates, add 500000 × 2 × 0.0025 = 2500 more from operator evaluation.
When EXPLAIN shows unexpectedly high total costs on filter-heavy queries, verify that WHERE clause selectivity aligns with pg_stats. High CPU costs on filter-heavy queries often indicate missing partial indexes or inefficient filter pushdown rather than hardware bottlenecks.
Startup vs Total Cost: What the Planner Optimizes #
Every node reports two values:
startup_cost: work required before the first row is emitted — for example, building a hash table or sorting an input.total_cost: cumulative work to return all estimated rows.
The planner minimizes total_cost by default. It switches to minimizing startup_cost only when LIMIT or cursor-based fetching is detected, because in those cases only the first few rows matter.
A Hash Join node always shows a non-zero startup_cost equal to the cost of materializing the build-side (inner) relation into memory. If that startup cost is very high — say, startup_cost=8200 on a join over a large table — the planner is warning you that building the hash table dominates setup time. That is a signal to check work_mem, since insufficient memory forces disk batching that the planner’s cost model only partially anticipates.
Annotated EXPLAIN Evidence #
The output below isolates the fields that confirm cost arithmetic is working correctly:
EXPLAIN (COSTS ON, FORMAT TEXT)
SELECT o.order_id, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.created_at > NOW() - INTERVAL '7 days';
Hash Join (cost=62.40..431.20 rows=312 width=36)
-- startup=62.40: hash build cost (customers)
-- total=431.20: build + probe across filtered orders
Hash Cond: (o.customer_id = c.customer_id)
-> Seq Scan on orders (cost=0.00..322.50 rows=1250 width=20)
-- 1250 rows × cpu_tuple_cost + pages × seq_page_cost
Filter: (created_at > (now() - '7 days'::interval))
-> Hash (cost=50.00..50.00 rows=1000 width=20)
-- startup=0: leaf read; total=build cost
-> Seq Scan on customers (cost=0.00..50.00 rows=1000 width=20)
Key diagnostic fields:
cost=62.40..431.20— startup is exactly the hash build cost from theHashchild noderows=312— the planner’s selectivity estimate for the date filter; verify againstpg_statsif this deviates significantly from actual rows inEXPLAIN ANALYZE- The
Seq Scan on orderscost reflects:orders_pages × 1.0 + 1250 × 0.01
Step-by-Step Resolution Workflow #
1. Baseline current constants — query active planner settings before touching anything:
SELECT name, setting, unit
FROM pg_settings
WHERE name LIKE '%cost%' OR name = 'effective_cache_size'
ORDER BY name;
-- Captures: cpu_index_tuple_cost, cpu_operator_cost, cpu_tuple_cost,
-- parallel_setup_cost, parallel_tuple_cost,
-- random_page_cost, seq_page_cost, effective_cache_size
2. Extract per-node costs in JSON format — parse startup_cost and total_cost to isolate which node contributes most:
EXPLAIN (COSTS ON, FORMAT JSON)
SELECT * FROM orders
WHERE status = 'pending'
AND created_at > NOW() - INTERVAL '30 days';
-- In the JSON output, look for "Startup Cost" and "Total Cost" per node.
-- The node with the largest Total Cost relative to its parent is the hot spot.
3. Test storage-tier alignment — use SET LOCAL to adjust random_page_cost without persisting:
SET LOCAL random_page_cost = 1.2;
EXPLAIN SELECT * FROM users WHERE email = '[email protected]';
-- If the plan switches from Seq Scan to Index Scan, the default 4.0 is
-- overvaluing random I/O for your storage hardware.
4. Compare estimated vs actual page access — run EXPLAIN (ANALYZE, BUFFERS) and confirm the planner’s arithmetic:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE status = 'pending';
-- Check: "Buffers: shared hit=N read=M"
-- If M (blocks read from disk) << planner's page estimate,
-- effective_cache_size is set too low and index scans look artificially expensive.
5. Refresh statistics on stale tables — if actual rows deviates from rows by more than 10×:
ANALYZE VERBOSE orders;
-- Then re-run EXPLAIN to confirm row estimates have converged.
-- Verify histogram quality:
SELECT attname, n_distinct, most_common_vals, histogram_bounds
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';
6. Persist validated constants — after confirming improvements at the session level:
-- Per-role (targets a specific application account):
ALTER ROLE app_user SET random_page_cost = 1.2;
ALTER ROLE app_user SET effective_cache_size = '8GB';
-- Or per-database:
ALTER DATABASE mydb SET random_page_cost = 1.2;
Before/After Plan Comparison #
The two-line diff below shows the effect of correcting random_page_cost from 4.0 to 1.2 on a selective query:
-- Before (random_page_cost = 4.0):
Seq Scan on orders (cost=0.00..3420.00 rows=85 width=48)
Filter: ((status = 'pending') AND (customer_id = 9142))
-- After (random_page_cost = 1.2):
Index Scan using orders_customer_status_idx on orders (cost=0.29..18.40 rows=85 width=48)
Index Cond: ((customer_id = 9142) AND (status = 'pending'))
The total cost dropped from 3420.00 to 18.40 — a 186× reduction. The plan was always capable of using the index; the misaligned cost constant prevented it.
Common Pitfalls #
Treating cost values as milliseconds. Cost is strictly a unitless relative metric. A node with cost=5000 may execute in 2ms; a node with cost=12 may take 400ms if it triggers a lock wait.
Ignoring pg_stats.correlation. The correlation column in pg_stats tells the planner how physically ordered a column’s data is relative to the heap. A correlation near 1.0 or -1.0 means an index scan can fetch pages sequentially, drastically reducing the random I/O penalty. Treating all index scans as fully random — as the planner does when correlation is near 0 — inflates their cost estimate.
Permanently altering cost constants without workload profiling. Lowering random_page_cost globally improves OLTP queries but may cause regressions on OLAP queries that genuinely benefit from sequential scans of large tables. Validate across your full representative query mix, not just the query you’re currently debugging.
Assuming high total_cost equals slow execution. Actual runtime depends on cache state, parallelism, and I/O concurrency — none of which are directly modeled in the cost formula. Always verify with EXPLAIN (ANALYZE, BUFFERS).
Frequently Asked Questions #
Are PostgreSQL EXPLAIN costs measured in milliseconds? No. Costs are unitless relative estimates — a weighted combination of I/O operations and CPU cycles. They exist solely for the planner to compare candidate paths, not to predict runtime.
How do I safely adjust cost constants without breaking other queries?
Use SET LOCAL at the session level to test. Validate the change across a representative sample of your production query patterns before persisting. Apply constants at the role or database level (ALTER ROLE, ALTER DATABASE) to isolate tuning to specific workloads.
How does effective_cache_size influence node cost calculation?
effective_cache_size does not enter the node cost arithmetic directly. It adjusts the planner’s probability that a randomly accessed page is already in the OS page cache or shared_buffers. A higher value reduces the effective random I/O penalty, making index scans comparatively cheaper — which is why setting it accurately matters on systems with large RAM.
Related
- Cost Estimation Models — how all cost constants interact with statistics and buffer behavior
- Sequential vs Index Scans — when the planner prefers a full scan over an index path
- Reading Cost vs Actual Time in EXPLAIN ANALYZE — diagnosing gaps between the planner’s projection and measured wall-clock time
- Execution Plan Fundamentals — the broader foundation these formulas build on