Sequential vs Index Scans: Execution Plan Diagnostics #
The PostgreSQL optimizer compares the estimated cost of every candidate access path and picks the cheapest one — and a full sequential scan is sometimes genuinely the correct choice. Knowing when each path wins, and how to read the evidence in EXPLAIN (ANALYZE, BUFFERS) output, is the first skill covered in Execution Plan Fundamentals.
When the Optimizer Chooses Each Access Path #
The planner calculates an estimated cost for every candidate access path and picks the lowest. Selectivity — the fraction of rows a predicate returns — is the dominant factor. Understanding how cost estimation models translate statistics into path costs makes every subsequent diagnostic step clearer.
Sequential scan reads all heap pages in storage order. The planner estimates it as seq_page_cost × relpages + cpu_tuple_cost × reltuples. Because OS read-ahead makes sequential I/O very efficient, this path wins whenever a predicate returns a large fraction of rows — typically above 5–15%, though the threshold shifts with random_page_cost and available cache.
Index scan traverses the B-tree to a matching leaf, collects tuple identifiers (TIDs), then fetches each heap page individually. The per-row heap fetch costs random_page_cost in the cold case. For selective predicates returning a small number of rows (under 1–5% of the table), index scans outperform sequential scans decisively. As selectivity rises, the heap-fetch pattern becomes increasingly random across the whole table, and the cumulative random I/O cost surpasses the sequential scan estimate.
Index only scan skips the heap entirely when the index covers every column the query projects. This path requires both a covering index and a current visibility map (kept current by VACUUM). Any non-zero Heap Fetches in the plan output signals that the visibility map has stale entries and a VACUUM is overdue. Covering index design explores how to build indexes that satisfy this requirement.
Bitmap heap scan sits between the two extremes. The Bitmap Index Scan child collects all matching TIDs into an in-memory bitmap, then the Bitmap Heap Scan parent sorts them by physical page number and fetches heap pages in that order. This converts random I/O into near-sequential access for medium-selectivity predicates. When the bitmap exceeds work_mem, PostgreSQL degrades to a “lossy” bitmap that marks whole pages rather than individual rows, requiring a recheck pass (Recheck Cond in the plan).
The random_page_cost GUC is the primary lever that shifts these thresholds. The default value of 4.0 reflects spinning-disk access patterns. On SSDs, setting it to 1.1–1.5 narrows the gap between sequential and random I/O, causing the planner to prefer index scans at significantly higher selectivity levels. Changing this setting per tablespace is safer than an instance-wide session override.
Annotated EXPLAIN Node Breakdown #
Run the canonical diagnostic form to expose every scan metric:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT id, status FROM orders WHERE created_at > '2024-01-01';
A sequential scan output for a high-selectivity predicate looks like this:
Seq Scan on orders (cost=0.00..12450.50 rows=85000 width=12)
(actual time=0.012..42.150 rows=84120 loops=1)
Filter: (created_at > '2024-01-01'::timestamp)
Rows Removed by Filter: 15880
Buffers: shared hit=210 read=1890
Planning Time: 0.120 ms
Execution Time: 48.310 ms
Field-by-field breakdown:
| Field | What it tells you |
|---|---|
cost=0.00..12450.50 |
Startup cost (0) and total cost — unitless planner metrics, not milliseconds |
rows=85000 |
Planner’s estimate based on table statistics |
actual rows=84120 |
True row count; close to estimate means healthy statistics |
Rows Removed by Filter: 15880 |
Rows read but discarded; ~16% rejection rate confirms the predicate is high-selectivity — sequential scan is correct here |
Buffers: shared hit=210 read=1890 |
210 pages served from shared_buffers (cache), 1890 read from disk; high read count is expected for a cold sequential scan |
loops=1 |
This node executed once; a loops value above 1 on a scan node inside a join indicates nested execution — multiply times accordingly |
For an index scan on a selective predicate:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT id, status FROM orders WHERE order_id = 99521;
Index Scan using idx_orders_pkey on orders
(cost=0.42..8.44 rows=1 width=12)
(actual time=0.045..0.048 rows=1 loops=1)
Index Cond: (order_id = 99521)
Buffers: shared hit=3
Planning Time: 0.095 ms
Execution Time: 0.072 ms
Buffers: shared hit=3 confirms just three pages were read: the root, an internal node, and a leaf of the B-tree (all already cached). actual rows=1 matches the estimate. This is optimal index scan behaviour.
For an index only scan:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT id, status FROM orders WHERE status = 'shipped';
Index Only Scan using idx_orders_status_id on orders
(cost=0.42..125.30 rows=1200 width=8)
(actual time=0.045..1.120 rows=1180 loops=1)
Index Cond: (status = 'shipped'::text)
Heap Fetches: 0
Buffers: shared hit=12
Planning Time: 0.080 ms
Execution Time: 1.250 ms
Heap Fetches: 0 confirms every result row was served from the index. If Heap Fetches were, say, 900 out of 1180 rows, the visibility map is stale: run VACUUM orders; and recheck.
Algorithm Internals: Costs, Phases, and Spill Conditions #
Bitmap scan phases and lossy degradation #
The bitmap heap scan runs in two steps the plan tree makes explicit:
Bitmap Heap Scan on orders (cost=42.10..1850.90 rows=3200 width=12)
(actual time=1.2..18.4 rows=3150 loops=1)
Recheck Cond: (created_at BETWEEN '2024-01-01' AND '2024-06-30')
Heap Blocks: exact=820
Buffers: shared hit=120 read=720
-> Bitmap Index Scan on idx_orders_created_at
(cost=0.00..41.30 rows=3200 width=0)
(actual time=0.9..0.9 rows=3150 loops=1)
Index Cond: (created_at BETWEEN '2024-01-01' AND '2024-06-30')
Buffers: shared read=18
Heap Blocks: exact=820 means every page was resolved to exact row positions — the bitmap stayed within work_mem. If you see Heap Blocks: lossy=820, the bitmap overflowed memory and the Recheck Cond filter is performing a true row-level recheck on every tuple in each page, not a no-op. Increasing work_mem for the session eliminates lossy mode.
The scan node that drives the downstream hash join or merge join vs nested loop choice is often a bitmap heap scan: it delivers rows in storage order, which makes a merge join feasible without an extra sort node.
Index access cost model in detail #
The planner estimates index scan total cost as:
startup_cost = cost to descend B-tree to first leaf
run_cost = (index_correlation_factor × seq_page_cost
+ (1 − index_correlation_factor) × random_page_cost)
× pages_fetched
+ cpu_index_tuple_cost × index_rows
+ cpu_tuple_cost × table_rows
index_correlation (visible in pg_stats.correlation) measures how well the index’s physical order matches the heap’s physical order. A correlation of 1.0 means the index and heap rows are in the same order — heap fetches are near-sequential and cost approaches seq_page_cost. A correlation near 0.0 means each heap fetch is a random I/O. Newly built indexes on append-only tables start with high correlation; after many updates it degrades toward 0.
Run this query to check correlation for a column you suspect is causing bad scan selection:
SELECT attname, correlation
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'created_at';
If correlation is below 0.2 on a heavily read range-scan column, a CLUSTER orders USING idx_orders_created_at; (or a CREATE INDEX CONCURRENTLY rebuild after reordering the heap) can restore near-sequential access patterns for that index.
Memory, I/O, and Resource Behaviour #
Sequential scans bypass the standard work_mem limit — they read pages into the ring buffer (a small circular section of shared_buffers that prevents a large scan from evicting hot data) and are governed by effective_io_concurrency for prefetch. Index scans consume comparatively little memory; the cost is random I/O, not RAM.
Bitmap heap scans depend on work_mem. The bitmap for N matching rows occupies roughly N × 6 bytes. For 1,000,000 rows that is ~6 MB. With the default work_mem = 4MB, any bitmap beyond ~660,000 rows goes lossy. Raise work_mem at session level before a known large bitmap query:
SET LOCAL work_mem = '32MB';
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE created_at >= '2023-01-01';
Index only scans consume minimal memory and produce the fewest buffer reads of any scan type. Their cost is dominated by visibility map freshness: on tables with frequent updates and infrequent VACUUM, Heap Fetches grows toward the index scan level, eliminating the performance advantage.
Step-by-Step Tuning Workflow #
Follow these steps in order when scan selection looks wrong.
-
Capture a baseline. Run
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)on the slow query. Recordactual rows,Buffers: shared read, andExecution Time. -
Check statistics freshness. Look up the last analyze time:
SELECT relname, last_analyze, last_autoanalyze, n_live_tup, n_dead_tup FROM pg_stat_user_tables WHERE relname = 'orders';If
last_analyzeis NULL or more than a few hours old on an active table, runANALYZE orders;and re-explain. -
Check for type-cast predicate suppression. Run with
VERBOSEto surface implicit casts:EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT id FROM orders WHERE created_at > '2024-01-01';A cast like
(created_at)::text > '...'forces a sequential scan regardless of available indexes. Fix the application query or add a functional index. -
Measure index bloat. Dead tuple accumulation inflates index pages and increases the random I/O cost of index scans:
SELECT * FROM pgstattuple('idx_orders_created_at'); -- dead_tuple_percent > 5% warrants REINDEX CONCURRENTLY -
Check index correlation. For range queries on timestamp or ID columns:
SELECT attname, correlation FROM pg_stats WHERE tablename = 'orders' AND attname = 'created_at';Correlation below 0.3 means heap fetches are nearly random — the index scan cost estimate may be underestimated, or
random_page_costmay need tuning for your storage type. -
Isolate the forced-index plan. Disable sequential scans temporarily to see what the index plan would cost:
SET LOCAL enable_seqscan = off; EXPLAIN (ANALYZE, BUFFERS) SELECT id, status FROM orders WHERE created_at > '2024-01-01'; SET LOCAL enable_seqscan = on;Compare the
Buffersand execution time. If the forced index plan is slower, the optimizer was correct. If it is faster, a statistics or cost-parameter problem is misdirecting the planner.
Further reading on threshold-level selectivity failures is in Why Index Scans Sometimes Underperform.
Common Pitfalls #
Assuming index scans are always faster. For predicates returning more than ~10% of rows, sequential I/O wins because it avoids random heap fetches. Signal: Buffers: shared read is high and the index scan takes longer than the sequential baseline.
Ignoring index bloat. Dead tuple accumulation increases physical index size, adding extra B-tree pages and raising the I/O cost of every index lookup. Signal: pgstattuple shows dead_tuple_percent above 5%. Fix: REINDEX CONCURRENTLY idx_name;.
Implicit type casting disabling index use. A function or implicit cast on the indexed column prevents the planner from matching the index. Signal: EXPLAIN (ANALYZE, VERBOSE) shows a cast in the Filter line rather than an Index Cond. Fix: rewrite the predicate to match the column’s native type, or create a functional index.
Stale statistics after bulk loads. After loading millions of rows, autovacuum may not have run yet. The planner uses the old row count and distribution, producing wildly wrong estimates. Signal: large gap between rows= estimate and actual rows in EXPLAIN output. Fix: ANALYZE table_name; immediately after bulk loads.
Lossy bitmap without memory adjustment. A bitmap scan that goes lossy adds a full-page recheck pass, increasing CPU and I/O. Signal: Heap Blocks: lossy in the plan. Fix: SET LOCAL work_mem = 'XMB'; where X is enough to keep the bitmap exact.
Relying on random_page_cost defaults on SSD storage. The default of 4.0 was tuned for spinning disks. Leaving it at 4.0 on SSDs causes the planner to over-penalise index scans, selecting sequential scans for queries that would run faster with index access. Fix: set random_page_cost = 1.1 (or 1.5 for shared-storage cloud volumes) in postgresql.conf or at tablespace level.
Frequently Asked Questions #
When should I allow a sequential scan instead of forcing an index scan?
Sequential scans are optimal when retrieving more than roughly 5–15% of a table’s rows, when the table fits entirely in shared_buffers, or when the query requires full-table aggregation. Forcing an index scan in those scenarios increases random I/O, adds per-row heap pointer chasing, and degrades throughput. Use SET LOCAL enable_seqscan = off only for diagnostic isolation, not production tuning.
How do I diagnose why an existing index is being ignored?
Check for implicit data type conversions in predicates — a cast on the indexed column disables index use. Verify statistics are current with ANALYZE. Inspect index bloat via pgstattuple. Finally, compare the planner’s row estimate against actual rows in EXPLAIN (ANALYZE) output; a large mismatch points to stale or skewed statistics.
What is the difference between an Index Scan and an Index Only Scan?
An Index Scan reads the index B-tree to locate row pointers (TIDs) and then fetches the actual tuple from the heap — two I/O operations per row in the cold case. An Index Only Scan retrieves every requested column directly from the index, skipping the heap entirely, provided the index covers all projected columns and the visibility map is current. Heap Fetches: 0 in EXPLAIN output confirms a true index-only execution.
How does lowering random_page_cost change scan selection?
The optimizer weights random I/O at random_page_cost (default 4.0) relative to sequential I/O at seq_page_cost (default 1.0). On SSDs where random reads cost nearly as much as sequential reads, setting random_page_cost to 1.1–1.5 causes the planner to prefer index scans at much higher selectivity levels. Always benchmark before changing this setting cluster-wide; per-tablespace overrides are safer than session-level changes.
Related #
- Why Index Scans Sometimes Underperform — selectivity thresholds and visibility map failures explained in depth
- Cost Estimation Models — how the planner converts statistics into unitless cost values
- Hash Join Mechanics — how scan node output feeds the build and probe phases of a hash join
- Merge Join vs Nested Loop — how scan ordering affects join algorithm selection
- Covering Index Design — building indexes that satisfy index-only scan eligibility