Sample Header Ad - 728x90

Improving distinct values estimates in Postgres

3 votes
2 answers
2437 views
Full counts in Postgres can be slow, for reasons that are well-understood and much discussed. So, I've been using estimation techniques instead, where possible. For rows, pg_stats seems fine, for views, extracting an estimate returned by EXPLAIN works okay. https://www.cybertec-postgresql.com/en/count-made-fast/ But what about distinct values? Here, I have had a lot less luck. Sometimes the estimates are 100% correct, sometimes they're off by factors of 2 or 20. Truncated tables seem to have badly stale estimates in particular (?). I just ran this test and have provided some results: analyze assembly_prods; -- Doing an ANLYZE to give pg_stats every help. select 'count(*) distinct' as method, count(*) as count from (select distinct assembly_id from assembly_prods) d union all select 'n_distinct from pg_stats' as method, n_distinct as count from pg_stats where tablename = 'assembly_prods' and attname = 'assembly_id'; The results: method count count(*) distinct 28088 n_distinct from pg_stats 13805 That's only off by a factor of 2, but I've seem _much_ worse in my data. To the point where I won't use estimates. Is there something else that I can try? Is this something that PG 12 improves? # Follow-up # I hadn't ever experimented SET STATISTICS before, because there are only so many hours in a day. Inspired by Laurenz' answer, I've take a quick look. Here's a useful comment from the documentation: https://www.postgresql.org/docs/current/planner-stats.html > The amount of information stored in pg_statistic by ANALYZE, in > particular the maximum number of entries in the most_common_vals and > histogram_bounds arrays for each column, can be set on a > column-by-column basis using the ALTER TABLE SET STATISTICS command, > or globally by setting the default_statistics_target configuration > variable. The default limit is presently 100 entries. Raising the > limit might allow more accurate planner estimates to be made, > particularly for columns with irregular data distributions, at the > price of consuming more space in pg_statistic and slightly more time > to compute the estimates. Conversely, a lower limit might be > sufficient for columns with simple data distributions. I have often got tables with a few common values and a lot of rare values. Or the other way around, so the right threshold will depend. For those who haven't used SET STATISTICS, it lets you set the sampling rate as a target number of entries. The default is 100, so 1000 should be higher fidelity. Here's what that looks like: ALTER TABLE assembly_prods ALTER COLUMN assembly_id SET STATISTICS 1000; You can use SET STATISTICS on a table or index. Here's an interesting piece on indexes: https://akorotkov.github.io/blog/2017/05/31/alter-index-weird/ Note that the current documentation *does* list SET STATISTICS on indexes. So I tried out thresholds of 1, 10, 100, 1000, and 10,000 and got these results out of a table with 467,767 rows and 28,088 distinct values: Target Estimate Difference Missing 1 13,657 14,431 51% 10 13,867 14,221 51% 100 13,759 14,329 51% 1,000 24,746 3,342 12% 10,000 28,088 0 0% Obviously you can't draw any general conclusions from one case, but SET STATISTICS looks pretty darn useful and I'll be glad to have it in the back of my mind. I'm tempted to raise the target a bit in general as I suspect it would help in many of the cases in our system.
Asked by Morris de Oryx (939 rep)
Oct 1, 2019, 12:04 AM
Last activity: Apr 27, 2023, 09:44 PM