Improving distinct values estimates in Postgres

3 votes
2 answers
2437 views
postgresql distinct cardinality-estimates
                          Full counts in Postgres can be slow, for reasons that are well-understood and much discussed. So, I've been using estimation techniques instead, where possible. For rows, pg_stats seems fine, for views, extracting an estimate returned by EXPLAIN works okay.

https://www.cybertec-postgresql.com/en/count-made-fast/ 

But what about distinct values? Here, I have had a lot less luck. Sometimes the estimates are 100% correct, sometimes they're off by factors of 2 or 20. Truncated tables seem to have badly stale estimates in particular (?). 

I just ran this test and have provided some results:

    analyze assembly_prods; -- Doing an ANLYZE to give pg_stats every help.
    
    select 'count(*) distinct' as method,
            count(*) as count
    from (select distinct assembly_id 
          from assembly_prods) d 
    union all
    select 'n_distinct from pg_stats' as method,
            n_distinct as count
    from pg_stats 
	where tablename  = 'assembly_prods' and
	      attname    = 'assembly_id';

The results:
       
    method	                    count
    count(*) distinct	        28088
    n_distinct from pg_stats	13805

That's only off by a factor of 2, but I've seem _much_ worse in my data. To the point where I won't use estimates. Is there something else that I can try? Is this something that PG 12 improves?

# Follow-up #
I hadn't ever experimented SET STATISTICS before, because there are only so many hours in a day. Inspired by Laurenz' answer, I've take a quick look. Here's a useful comment from the documentation:

https://www.postgresql.org/docs/current/planner-stats.html 

> The amount of information stored in pg_statistic by ANALYZE, in
> particular the maximum number of entries in the most_common_vals and
> histogram_bounds arrays for each column, can be set on a
> column-by-column basis using the ALTER TABLE SET STATISTICS command,
> or globally by setting the default_statistics_target configuration
> variable. The default limit is presently 100 entries. Raising the
> limit might allow more accurate planner estimates to be made,
> particularly for columns with irregular data distributions, at the
> price of consuming more space in pg_statistic and slightly more time
> to compute the estimates. Conversely, a lower limit might be
> sufficient for columns with simple data distributions.

I have often got tables with a few common values and a lot of rare values. Or the other way around, so the right threshold will depend. For those who haven't used SET STATISTICS, it lets you set the sampling rate as a target number of entries. The default is 100, so 1000 should be higher fidelity. Here's what that looks like:

    ALTER TABLE assembly_prods 
        ALTER COLUMN assembly_id
        SET STATISTICS 1000;

You can use SET STATISTICS on a table or index. Here's an interesting piece on indexes:

https://akorotkov.github.io/blog/2017/05/31/alter-index-weird/ 

Note that the current documentation *does* list SET STATISTICS on indexes.

So I tried out thresholds of 1, 10, 100, 1000, and 10,000 and got these results out of a table with 467,767 rows and 28,088 distinct values:

    Target   Estimate  Difference  Missing
         1   13,657    14,431      51%
        10   13,867    14,221      51%
       100   13,759    14,329      51%
     1,000   24,746     3,342      12%
    10,000   28,088         0       0%

Obviously you can't draw any general conclusions from one case, but SET STATISTICS looks pretty darn useful and I'll be glad to have it in the back of my mind. I'm tempted to raise the target a bit in general as I suspect it would help in many of the cases in our system.


                        
Asked by Morris de Oryx (939 rep)
Oct 1, 2019, 12:04 AM
Last activity: Apr 27, 2023, 09:44 PM
Improving distinct values estimates in Postgres

Related Questions