Sample Header Ad - 728x90

How to use an expression like date_trunc() in GROUP BY?

1 vote
2 answers
240 views
I have a couple of joined tables with industrial data on them: ~~~pgsql create table v2.tag ( tag_id integer generated always as identity, tag text not null, primary key (tag_id), unique (tag) ); create table v2.state ( tag_id integer not null, "timestamp" timestamp without time zone not null, value float not null, primary key (tag_id, timestamp), foreign key (tag_id) references v2.tag (tag_id) ) partition by range (timestamp); ~~~ The state table holds time series data of about 50 million rows from the last 6 months and I need to run a benchmark on it with various queries. The table is partitioned monthly. The query I tried simply gets the number of data points per day and tag, that any actual TSDB can do without breaking a sweat on such a small dataset: SELECT count(*) as points,date_trunc('day', timestamp) as timestamp,tag.tag FROM v2.state JOIN v2.tag USING (tag_id) GROUP BY timestamp, tag ORDER BY timestamp ASC; The thing is, that for some reason this query makes the DB take up almost 3GB of RAM and returns a bunch of duplicates. Like this: 2024-02-01 00:00:00 | /Heave | 1 2024-02-01 00:00:00 | /Pitch | 1 2024-02-01 00:00:00 | /Roll | 1 2024-02-01 00:00:00 | /Velocity | 1 2024-02-01 00:00:00 | /Heave | 1 ... And so on. All in the same day, I could not scroll over to the next, it just kept repeating these rows in the result instead of counting them up per tag like I expected. So instead of counting the number of data points per day/tag, it seems to just produce a duplicate for each actual row of the ~50 million rows in the database. So something is not working in the aggregation. I would expect this query to return around 12K lines (65*30*6), but it returns millions of rows instead, causing the Jupyter notebook I am trying to load it into to get OOM-killed. I tried to run this with EXPLAIN ANALYZE, but since I am a noob with Postgres, it doesn't really... explain anything: ~~~none Sort (cost=700769.72..700798.22 rows=11400 width=78) (actual time=80503.260..83825.211 rows=47499969 loops=1) Sort Key: (date_trunc('day'::text, state."timestamp")) Sort Method: external merge Disk: 4703296kB -> Finalize GroupAggregate (cost=697027.86..700001.55 rows=11400 width=78) (actual time=35609.801..64328.719 rows=47 499969 loops=1) Group Key: state."timestamp", tag.tag -> Gather Merge (cost=697027.86..699688.05 rows=22800 width=70) (actual time=35609.453..55143.276 rows=4749996 9 loops=1) Workers Planned: 2 Workers Launched: 2 -> Sort (cost=696027.84..696056.34 rows=11400 width=70) (actual time=34526.070..42018.956 rows=15833323 loops=3) Sort Key: state."timestamp", tag.tag Sort Method: external merge Disk: 1414088kB Worker 0: Sort Method: external merge Disk: 1446832kB Worker 1: Sort Method: external merge Disk: 1470664kB -> Partial HashAggregate (cost=695145.67..695259.67 rows=11400 width=70) (actual time=8690.289..20 138.661 rows=15833323 loops=3) Group Key: state."timestamp", tag.tag Batches: 1029 Memory Usage: 8241kB Disk Usage: 1694608kB Worker 0: Batches: 901 Memory Usage: 8241kB Disk Usage: 1727928kB Worker 1: Batches: 773 Memory Usage: 8241kB Disk Usage: 1748528kB -> Hash Join (cost=2.28..652834.40 rows=5641502 width=62) (actual time=138.598..4142.702 row s=15833323 loops=3) Hash Cond: (state.tag_id = tag.tag_id) -> Parallel Append (cost=0.00..599769.83 rows=19794743 width=12) (actual time=138.383. .2665.699 rows=15833323 loops=3) -> Parallel Seq Scan on state_y2024m04 state_4 (cost=0.00..221214.31 rows=874583 1 width=12) (actual time=39.308..827.302 rows=6996457 loops=3) -> Parallel Seq Scan on state_y2024m02 state_2 (cost=0.00..172317.34 rows=680943 4 width=12) (actual time=58.866..1102.604 rows=8171318 loops=2) -> Parallel Seq Scan on state_y2024m03 state_3 (cost=0.00..78305.04 rows=3095204 width=12) (actual time=0.766..694.493 rows=7428501 loops=1) -> Parallel Seq Scan on state_y2024m05 state_5 (cost=0.00..28879.42 rows=1141442 width=12) (actual time=180.418..416.467 rows=2739461 loops=1) -> Parallel Seq Scan on state_y2024m01 state_1 (cost=0.00..20.00 rows=1000 width =12) (actual time=0.000..0.001 rows=0 loops=1) -> Parallel Seq Scan on state_y2024m06 state_6 (cost=0.00..20.00 rows=1000 width =12) (actual time=0.000..0.001 rows=0 loops=1) -> Parallel Seq Scan on state_y2024m07 state_7 (cost=0.00..20.00 rows=1000 width =12) (actual time=0.000..0.001 rows=0 loops=1) -> Parallel Seq Scan on state_y2024m08 state_8 (cost=0.00..20.00 rows=1000 width =12) (actual time=0.002..0.002 rows=0 loops=1) -> Hash (cost=1.57..1.57 rows=57 width=58) (actual time=0.149..0.268 rows=65 loops=3) Buckets: 1024 Batches: 1 Memory Usage: 14kB -> Seq Scan on tag (cost=0.00..1.57 rows=57 width=58) (actual time=0.031..0.036 rows=65 loops=3) Planning Time: 2.447 ms JIT: Functions: 96 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 14.487 ms, Inlining 105.515 ms, Optimization 203.723 ms, Emission 143.355 ms, Total 467.081 ms Execution Time: 86164.911 ms ~~~ So what's wrong with my query? Why is it not aggregating?
Asked by Megakoresh (113 rep)
Jul 31, 2024, 05:53 PM
Last activity: Aug 1, 2024, 12:37 PM