Poor performance bucketing historical data with DISTINCT
1
vote
1
answer
64
views
I'm running into performance problems with a query that buckets a set of historical records of daily counts into a form that can be rendered easily as a daily graph.
## Context
Essentially, the table in question is a record of daily audit reports where the number of problems of varying seriousness are reported for a set of location IDs. Each record in the table contains the counts of problems found during the audit at one location on one day, of several severities.
CREATE TABLE IF NOT EXISTS problem_reports (
internal_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
day timestamp NOT NULL,
location_id varchar(16) NOT NULL,
bad integer DEFAULT 0 NOT NULL,
awful integer DEFAULT 0 NOT NULL,
catastrophic integer DEFAULT 0 NOT NULL
);
Since audits are not performed at every location on every day, I saved space by only recording rows for days that an audit has been performed. If the number of problems found was zero, I record a row with all zero counts for that location, to indicate that the number of problems of each severity is now zero as of that date.
The output I need from the query is a running daily total of all problems found at all locations, by severity. So, if an audit wasn't performed for a location on a given day, the query needs to "look back" to find the most recent audit for that location, and include that in the aggregation. I achieved this using a DISTINCT clause:
WITH days AS (
SELECT GENERATE_SERIES(
DATE_TRUNC('day', '2025-03-01'::date),
DATE_TRUNC('day', '2025-03-07'::date),
'1 day'::interval
) AS day
),
counts AS (
SELECT DISTINCT ON (days.day, pr.location_id)
days.day, pr.location_id, pr.bad, pr.awful, pr.catastrophic
FROM problem_reports pr
RIGHT JOIN days ON pr.day Unique (cost=46930.59..48855.59 rows=40000 width=86) (actual time=0.109..0.120 rows=18 loops=1) |
| Output: (generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval)), pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id |
| Buffers: shared hit=1 |
| -> Sort (cost=46930.59..47572.26 rows=256667 width=86) (actual time=0.108..0.111 rows=32 loops=1) |
| Output: (generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval)), pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id |
| Sort Key: (generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval)), pr.location_id, pr.day DESC, pr.internal_id DESC |
| Sort Method: quicksort Memory: 27kB |
| Buffers: shared hit=1 |
| -> Nested Loop Left Join (cost=0.00..11584.65 rows=256667 width=86) (actual time=0.048..0.077 rows=32 loops=1) |
| Output: (generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval)), pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id|
| Join Filter: (pr.day ProjectSet (cost=0.00..5.03 rows=1000 width=8) (actual time=0.024..0.027 rows=7 loops=1) |
| Output: generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval) |
| -> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.002 rows=1 loops=1) |
| -> Materialize (cost=0.00..21.55 rows=770 width=78) (actual time=0.003..0.004 rows=8 loops=7) |
| Output: pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id |
| Buffers: shared hit=1 |
| -> Seq Scan on sto.problem_reports pr (cost=0.00..17.70 rows=770 width=78) (actual time=0.016..0.019 rows=8 loops=1) |
| Output: pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id |
| Buffers: shared hit=1 |
|Planning Time: 0.186 ms |
|Execution Time: 0.188 ms |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
## The Problem
This works fine for moderate numbers of days and locations, but as the number of those grow, the DISTINCT clause in the CTE starts to perform very poorly. I considered recording a row for every location on every day, to eliminate the need for the DISTINCT "look-back", but that will quickly result in more rows than I'd like to deal with, as the number of locations is expected to grow exponentially.
In some cases, I need to query for only a subset of locations, and render a graph of that, so recording a preprocessed aggregate of just the daily counts to the DB wouldn'twork.
Is there a more efficient way of storing and querying this type of historical data to produce the graph data? Any suggestions appreciated!
Asked by hampercm
(111 rep)
Apr 1, 2025, 06:49 PM
Last activity: Apr 4, 2025, 09:45 PM
Last activity: Apr 4, 2025, 09:45 PM