Improving distinct values estimates in Postgres
3
votes
2
answers
2437
views
Full counts in Postgres can be slow, for reasons that are well-understood and much discussed. So, I've been using estimation techniques instead, where possible. For rows, pg_stats seems fine, for views, extracting an estimate returned by
EXPLAIN
works okay.
https://www.cybertec-postgresql.com/en/count-made-fast/
But what about distinct values? Here, I have had a lot less luck. Sometimes the estimates are 100% correct, sometimes they're off by factors of 2 or 20. Truncated tables seem to have badly stale estimates in particular (?).
I just ran this test and have provided some results:
analyze assembly_prods; -- Doing an ANLYZE to give pg_stats every help.
select 'count(*) distinct' as method,
count(*) as count
from (select distinct assembly_id
from assembly_prods) d
union all
select 'n_distinct from pg_stats' as method,
n_distinct as count
from pg_stats
where tablename = 'assembly_prods' and
attname = 'assembly_id';
The results:
method count
count(*) distinct 28088
n_distinct from pg_stats 13805
That's only off by a factor of 2, but I've seem _much_ worse in my data. To the point where I won't use estimates. Is there something else that I can try? Is this something that PG 12 improves?
# Follow-up #
I hadn't ever experimented SET STATISTICS
before, because there are only so many hours in a day. Inspired by Laurenz' answer, I've take a quick look. Here's a useful comment from the documentation:
https://www.postgresql.org/docs/current/planner-stats.html
> The amount of information stored in pg_statistic
by ANALYZE
, in
> particular the maximum number of entries in the most_common_vals
and
> histogram_bounds arrays for each column, can be set on a
> column-by-column basis using the ALTER TABLE SET STATISTICS
command,
> or globally by setting the default_statistics_target
configuration
> variable. The default limit is presently 100 entries. Raising the
> limit might allow more accurate planner estimates to be made,
> particularly for columns with irregular data distributions, at the
> price of consuming more space in pg_statistic
and slightly more time
> to compute the estimates. Conversely, a lower limit might be
> sufficient for columns with simple data distributions.
I have often got tables with a few common values and a lot of rare values. Or the other way around, so the right threshold will depend. For those who haven't used SET STATISTICS
, it lets you set the sampling rate as a target number of entries. The default is 100, so 1000 should be higher fidelity. Here's what that looks like:
ALTER TABLE assembly_prods
ALTER COLUMN assembly_id
SET STATISTICS 1000;
You can use SET STATISTICS
on a table or index. Here's an interesting piece on indexes:
https://akorotkov.github.io/blog/2017/05/31/alter-index-weird/
Note that the current documentation *does* list SET STATISTICS
on indexes.
So I tried out thresholds of 1, 10, 100, 1000, and 10,000 and got these results out of a table with 467,767 rows and 28,088 distinct values:
Target Estimate Difference Missing
1 13,657 14,431 51%
10 13,867 14,221 51%
100 13,759 14,329 51%
1,000 24,746 3,342 12%
10,000 28,088 0 0%
Obviously you can't draw any general conclusions from one case, but SET STATISTICS
looks pretty darn useful and I'll be glad to have it in the back of my mind. I'm tempted to raise the target a bit in general as I suspect it would help in many of the cases in our system.
Asked by Morris de Oryx
(939 rep)
Oct 1, 2019, 12:04 AM
Last activity: Apr 27, 2023, 09:44 PM
Last activity: Apr 27, 2023, 09:44 PM