Why does Postgres use filter instead of index scan when there are non-negligible number of null records?
0
votes
0
answers
43
views
This is my table:
learning_postgres=# create table r(id serial primary key, x int not null, y int);
CREATE TABLE
learning_postgres=# create index r_x_btree_idx on r using btree (x);
CREATE INDEX
learning_postgres=# create index r_y_btree_idx on r using btree (y);
CREATE INDEX
learning_postgres=# \d+ r
Table "public.r"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+---------+-----------+----------+-------------------------------+---------+-------------+--------------+-------------
id | integer | | not null | nextval('r_id_seq'::regclass) | plain | | |
x | integer | | not null | | plain | | |
y | integer | | | | plain | | |
Indexes:
"r_pkey" PRIMARY KEY, btree (id)
"r_x_btree_idx" btree (x)
"r_y_btree_idx" btree (y)
Access method: heap
After inserting 5 millions records and a very few number of records with null y
:
learning_postgres=# insert into r(x, y) select random() * 1000000, random() * 1000000 from generate_series(0, 5000000);
INSERT 0 5000001
learning_postgres=# insert into r(x) select random() * 1000000 from generate_series(0, 10);
INSERT 0 11
learning_postgres=# analyze r;
ANALYZE
learning_postgres=# select attname, null_frac from pg_stats where tablename = 'r';
attname | null_frac
---------+-----------
id | 0
x | 0
y | 0
(3 rows)
The query planner is using an index scan to find y is NULL
:
learning_postgres=# explain analyze select * from r where y is null order by x asc limit 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Limit (cost=8.46..8.46 rows=1 width=12) (actual time=0.031..0.032 rows=1 loops=1)
-> Sort (cost=8.46..8.46 rows=1 width=12) (actual time=0.030..0.030 rows=1 loops=1)
Sort Key: x
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using r_y_btree_idx on r (cost=0.43..8.45 rows=1 width=12) (actual time=0.016..0.018 rows=11 loops=1)
Index Cond: (y IS NULL)
Planning Time: 0.383 ms
Execution Time: 0.066 ms
(8 rows)
But after inserting more null records for column y
, the query planner then choses filter over an index scan:
learning_postgres=# insert into r(x) select random() * 1000000 from generate_series(0, 1000);
INSERT 0 1001
learning_postgres=# analyze r;
ANALYZE
learning_postgres=# select attname, null_frac from pg_stats where tablename = 'r';
attname | null_frac
---------+-----------
x | 0
id | 0
y | 0.0002
(3 rows)
learning_postgres=# explain analyze select * from r where y is null order by x asc limit 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..220.45 rows=1 width=12) (actual time=40.041..40.043 rows=1 loops=1)
-> Index Scan using r_x_btree_idx on r (cost=0.43..220017.03 rows=1000 width=12) (actual time=40.040..40.040 rows=1 loops=1)
Filter: (y IS NULL)
Rows Removed by Filter: 7962
Planning Time: 0.354 ms
Execution Time: 40.068 ms
(6 rows)
Why this is the case? Why it doesn't use an index scan?
Asked by Amir reza Riahi
(155 rep)
Mar 27, 2025, 09:32 AM