Sample Header Ad - 728x90

Why does Postgres use filter instead of index scan when there are non-negligible number of null records?

0 votes
0 answers
43 views
This is my table:
learning_postgres=# create table r(id serial primary key, x int not null, y int);
CREATE TABLE
learning_postgres=# create index r_x_btree_idx on r using btree (x);
CREATE INDEX
learning_postgres=# create index r_y_btree_idx on r using btree (y);
CREATE INDEX
learning_postgres=# \d+ r
                                                       Table "public.r"
 Column |  Type   | Collation | Nullable |            Default            | Storage | Compression | Stats target | Description 
--------+---------+-----------+----------+-------------------------------+---------+-------------+--------------+-------------
 id     | integer |           | not null | nextval('r_id_seq'::regclass) | plain   |             |              | 
 x      | integer |           | not null |                               | plain   |             |              | 
 y      | integer |           |          |                               | plain   |             |              | 
Indexes:
    "r_pkey" PRIMARY KEY, btree (id)
    "r_x_btree_idx" btree (x)
    "r_y_btree_idx" btree (y)
Access method: heap
After inserting 5 millions records and a very few number of records with null y:
learning_postgres=# insert into r(x, y) select random() * 1000000, random() * 1000000 from generate_series(0, 5000000);
INSERT 0 5000001
learning_postgres=# insert into r(x) select random() * 1000000 from generate_series(0, 10);
INSERT 0 11
learning_postgres=# analyze r;
ANALYZE
learning_postgres=# select attname, null_frac from pg_stats where tablename = 'r';
 attname | null_frac 
---------+-----------
 id      |         0
 x       |         0
 y       |         0
(3 rows)
The query planner is using an index scan to find y is NULL:
learning_postgres=# explain analyze select * from r where y is null order by x asc limit 1;
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=8.46..8.46 rows=1 width=12) (actual time=0.031..0.032 rows=1 loops=1)
   ->  Sort  (cost=8.46..8.46 rows=1 width=12) (actual time=0.030..0.030 rows=1 loops=1)
         Sort Key: x
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Index Scan using r_y_btree_idx on r  (cost=0.43..8.45 rows=1 width=12) (actual time=0.016..0.018 rows=11 loops=1)
               Index Cond: (y IS NULL)
 Planning Time: 0.383 ms
 Execution Time: 0.066 ms
(8 rows)
But after inserting more null records for column y, the query planner then choses filter over an index scan:
learning_postgres=# insert into r(x) select random() * 1000000 from generate_series(0, 1000);
INSERT 0 1001
learning_postgres=# analyze r;
ANALYZE
learning_postgres=# select attname, null_frac from pg_stats where tablename = 'r';
 attname | null_frac 
---------+-----------
 x       |         0
 id      |         0
 y       |    0.0002
(3 rows)

learning_postgres=# explain analyze select * from r where y is null order by x asc limit 1;
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..220.45 rows=1 width=12) (actual time=40.041..40.043 rows=1 loops=1)
   ->  Index Scan using r_x_btree_idx on r  (cost=0.43..220017.03 rows=1000 width=12) (actual time=40.040..40.040 rows=1 loops=1)
         Filter: (y IS NULL)
         Rows Removed by Filter: 7962
 Planning Time: 0.354 ms
 Execution Time: 40.068 ms
(6 rows)
Why this is the case? Why it doesn't use an index scan?
Asked by Amir reza Riahi (155 rep)
Mar 27, 2025, 09:32 AM