Sample Header Ad - 728x90

Poor performance bucketing historical data with DISTINCT

1 vote
1 answer
64 views
I'm running into performance problems with a query that buckets a set of historical records of daily counts into a form that can be rendered easily as a daily graph. ## Context Essentially, the table in question is a record of daily audit reports where the number of problems of varying seriousness are reported for a set of location IDs. Each record in the table contains the counts of problems found during the audit at one location on one day, of several severities.
CREATE TABLE IF NOT EXISTS problem_reports (
    internal_id      bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    day              timestamp         NOT NULL,
    location_id      varchar(16)       NOT NULL,
    bad              integer DEFAULT 0 NOT NULL,
    awful            integer DEFAULT 0 NOT NULL,
    catastrophic     integer DEFAULT 0 NOT NULL
);
Since audits are not performed at every location on every day, I saved space by only recording rows for days that an audit has been performed. If the number of problems found was zero, I record a row with all zero counts for that location, to indicate that the number of problems of each severity is now zero as of that date. The output I need from the query is a running daily total of all problems found at all locations, by severity. So, if an audit wasn't performed for a location on a given day, the query needs to "look back" to find the most recent audit for that location, and include that in the aggregation. I achieved this using a DISTINCT clause:
WITH days AS (
    SELECT GENERATE_SERIES(
        DATE_TRUNC('day', '2025-03-01'::date),
        DATE_TRUNC('day', '2025-03-07'::date),
        '1 day'::interval
    ) AS day
),
counts AS (
    SELECT DISTINCT ON (days.day, pr.location_id)
        days.day, pr.location_id, pr.bad, pr.awful, pr.catastrophic
    FROM problem_reports pr
        RIGHT JOIN days ON pr.day   Unique  (cost=46930.59..48855.59 rows=40000 width=86) (actual time=0.109..0.120 rows=18 loops=1)                                                                                                                                                                                       |
|        Output: (generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval)), pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id            |
|        Buffers: shared hit=1                                                                                                                                                                                                                                                                |
|        ->  Sort  (cost=46930.59..47572.26 rows=256667 width=86) (actual time=0.108..0.111 rows=32 loops=1)                                                                                                                                                                                  |
|              Output: (generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval)), pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id      |
|              Sort Key: (generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval)), pr.location_id, pr.day DESC, pr.internal_id DESC                             |
|              Sort Method: quicksort  Memory: 27kB                                                                                                                                                                                                                                           |
|              Buffers: shared hit=1                                                                                                                                                                                                                                                          |
|              ->  Nested Loop Left Join  (cost=0.00..11584.65 rows=256667 width=86) (actual time=0.048..0.077 rows=32 loops=1)                                                                                                                                                               |
|                    Output: (generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval)), pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id|
|                    Join Filter: (pr.day   ProjectSet  (cost=0.00..5.03 rows=1000 width=8) (actual time=0.024..0.027 rows=7 loops=1)                                                                                                                                                                            |
|                          Output: generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval)                                                                       |
|                          ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.002 rows=1 loops=1)                                                                                                                                                                             |
|                    ->  Materialize  (cost=0.00..21.55 rows=770 width=78) (actual time=0.003..0.004 rows=8 loops=7)                                                                                                                                                                          |
|                          Output: pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id                                                                                                                                                                                  |
|                          Buffers: shared hit=1                                                                                                                                                                                                                                              |
|                          ->  Seq Scan on sto.problem_reports pr  (cost=0.00..17.70 rows=770 width=78) (actual time=0.016..0.019 rows=8 loops=1)                                                                                                                                             |
|                                Output: pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id                                                                                                                                                                            |
|                                Buffers: shared hit=1                                                                                                                                                                                                                                        |
|Planning Time: 0.186 ms                                                                                                                                                                                                                                                                      |
|Execution Time: 0.188 ms                                                                                                                                                                                                                                                                     |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
## The Problem This works fine for moderate numbers of days and locations, but as the number of those grow, the DISTINCT clause in the CTE starts to perform very poorly. I considered recording a row for every location on every day, to eliminate the need for the DISTINCT "look-back", but that will quickly result in more rows than I'd like to deal with, as the number of locations is expected to grow exponentially. In some cases, I need to query for only a subset of locations, and render a graph of that, so recording a preprocessed aggregate of just the daily counts to the DB wouldn'twork. Is there a more efficient way of storing and querying this type of historical data to produce the graph data? Any suggestions appreciated!
Asked by hampercm (111 rep)
Apr 1, 2025, 06:49 PM
Last activity: Apr 4, 2025, 09:45 PM