Sample Header Ad - 728x90

Slow COUNT DISTINCT on a large table in PostgreSQL

0 votes
1 answer
218 views
In PostgreSQL 14, I have a table with around 10 million of rows with this structure:
CREATE TABLE search_stats (
	id bigserial NOT NULL,
	search_date timestamp NOT NULL,
	user_id varchar NULL,
	search_query varchar NOT NULL
)
I want to retrieve, for each search_query, the count and the number of unique user_id associated. This is the query that I have come up with:
SELECT search_query, count(*) AS total_number_of_searches, 
	count(DISTINCT user_id) AS total_number_of_users
FROM search_stats 
WHERE search_date >= '2023-01-01' AND search_date   WindowAgg  (cost=100614.65..353591.31 rows=1552612 width=68) (actual time=14207.366..14836.972 rows=1656788 loops=1)                                                                               |
        Buffers: shared hit=15953 read=29111, temp read=21361 written=15188                                                                                                                              |
        ->  GroupAggregate  (cost=100614.65..334183.66 rows=1552612 width=60) (actual time=7578.995..13339.283 rows=1656788 loops=1)                                                                     |
              Group Key: search_stats.search_query                                                                                                                                         |
              Buffers: shared hit=15953 read=29111                                                                                                                                                       |
              ->  Gather Merge  (cost=100614.65..305804.74 rows=1713707 width=109) (actual time=7578.971..10803.777 rows=1714043 loops=1)                                                                |
                    Workers Planned: 4                                                                                                                                                                   |
                    Workers Launched: 4                                                                                                                                                                  |
                    Buffers: shared hit=15953 read=29111                                                                                                                                                 |
                    ->  Sort  (cost=99614.59..100685.65 rows=428426 width=109) (actual time=5440.355..5521.071 rows=342809 loops=5)                                                                      |
                          Sort Key: search_stats.search_query                                                                                                                              |
                          Sort Method: quicksort  Memory: 113434kB                                                                                                                                       |
                          Buffers: shared hit=15953 read=29111                                                                                                                                           |
                          Worker 0:  Sort Method: quicksort  Memory: 93953kB                                                                                                                             |
                          Worker 1:  Sort Method: quicksort  Memory: 95874kB                                                                                                                             |
                          Worker 2:  Sort Method: quicksort  Memory: 43844kB                                                                                                                             |
                          Worker 3:  Sort Method: quicksort  Memory: 38459kB                                                                                                                             |
                          ->  Parallel Append  (cost=0.00..59538.15 rows=428426 width=109) (actual time=12.168..142.869 rows=342809 loops=5)                                                             |
                                Buffers: shared hit=15785 read=29111                                                                                                                                     |
                                ->  Parallel Seq Scan on mdr_querystats_part_may search_stats_5  (cost=0.00..8218.54 rows=119279 width=110) (actual time=0.018..87.771 rows=286326 loops=1)|
                                      Filter: ((search_date >= '2023-01-01 00:00:00'::timestamp without time zone) AND (search_date   Parallel Seq Scan on mdr_querystats_part_jun search_stats_6  (cost=0.00..8212.33 rows=119198 width=110) (actual time=0.018..86.707 rows=286133 loops=1)|
                                      Filter: ((search_date >= '2023-01-01 00:00:00'::timestamp without time zone) AND (search_date   Parallel Seq Scan on mdr_querystats_part_jul search_stats_7  (cost=0.00..8206.25 rows=1 width=109) (actual time=59.912..59.912 rows=0 loops=1)         |
                                      Filter: ((search_date >= '2023-01-01 00:00:00'::timestamp without time zone) AND (search_date   Parallel Seq Scan on mdr_querystats_part_jan search_stats_1  (cost=0.00..8198.10 rows=118983 width=109) (actual time=0.405..52.659 rows=142808 loops=2)|
                                      Filter: ((search_date >= '2023-01-01 00:00:00'::timestamp without time zone) AND (search_date   Parallel Seq Scan on mdr_querystats_part_mar search_stats_3  (cost=0.00..8196.36 rows=119000 width=109) (actual time=0.499..21.592 rows=57132 loops=5) |
                                      Filter: ((search_date >= '2023-01-01 00:00:00'::timestamp without time zone) AND (search_date   Parallel Seq Scan on mdr_querystats_part_apr search_stats_4  (cost=0.00..8187.51 rows=118877 width=109) (actual time=0.006..45.467 rows=142681 loops=2)|
                                      Filter: ((search_date >= '2023-01-01 00:00:00'::timestamp without time zone) AND (search_date   Parallel Seq Scan on mdr_querystats_part_feb search_stats_2  (cost=0.00..8176.93 rows=118705 width=109) (actual time=0.061..83.882 rows=284948 loops=1)|
                                      Filter: ((search_date >= '2023-01-01 00:00:00'::timestamp without time zone) AND (search_date   Parallel Seq Scan on mdr_querystats_part_aug search_stats_8  (cost=0.00..0.00 rows=1 width=64) (actual time=0.000..0.000 rows=0 loops=1)               |
                                      Filter: ((search_date >= '2023-01-01 00:00:00'::timestamp without time zone) AND (search_date   Parallel Seq Scan on mdr_querystats_part_sep search_stats_9  (cost=0.00..0.00 rows=1 width=64) (actual time=0.001..0.001 rows=0 loops=1)               |
                                      Filter: ((search_date >= '2023-01-01 00:00:00'::timestamp without time zone) AND (search_date   Parallel Seq Scan on mdr_querystats_part_oct search_stats_10  (cost=0.00..0.00 rows=1 width=64) (actual time=0.000..0.000 rows=0 loops=1)              |
                                      Filter: ((search_date >= '2023-01-01 00:00:00'::timestamp without time zone) AND (search_date   Parallel Seq Scan on mdr_querystats_part_nov search_stats_11  (cost=0.00..0.00 rows=1 width=64) (actual time=0.000..0.001 rows=0 loops=1)              |
                                      Filter: ((search_date >= '2023-01-01 00:00:00'::timestamp without time zone) AND (search_date   Parallel Seq Scan on mdr_querystats_part_dec search_stats_12  (cost=0.00..0.00 rows=1 width=64) (actual time=0.002..0.002 rows=0 loops=1)              |
                                      Filter: ((search_date >= '2023-01-01 00:00:00'::timestamp without time zone) AND (search_date < '2023-07-01 00:00:00'::timestamp without time zone))                             |
Planning:                                                                                                                                                                                                |
  Buffers: shared hit=20                                                                                                                                                                                 |
Planning Time: 0.602 ms                                                                                                                                                                                  |
Execution Time: 16105.124 ms
How can I improve the performance of the query?
Asked by drew458 (1 rep)
Jul 18, 2023, 02:51 PM
Last activity: Jun 23, 2025, 11:04 PM