How to efficiently do partial updates on huge dataset in Postgresql
0
votes
0
answers
414
views
**Disclaimer**: I am a total "newb" in regards to PgSql, but due to some unfortunate aligment of circumstances I am in charge of this project which I have to rewrite/migrate away from SqlServer (alone), so there's that.
The process is fairly straightforward: I have multiple sources of unique data which I must check every day for updates, import the changes into table *Items* and aggregate the results into *Aggregates*.
The tables are as following:
CREATE TABLE public."Items" (
"Md5Hash" bytea NOT NULL,
"SourceId" int4 NOT NULL,
"LastModifiedAt" timestamp NULL,
"PropA" int4 NOT NULL,
"PropB" timestamp NULL,
"PropC" bool NULL,
...
CONSTRAINT "PK_Items" PRIMARY KEY ("Md5Hash", "SourceId"),
CONSTRAINT "FK_Items_Sources_SourceId" FOREIGN KEY ("SourceId") REFERENCES "Sources"("Id") ON DELETE RESTRICT
);
CREATE INDEX "IX_Items_SourceId" ON public."Items" USING btree ("SourceId");
CREATE TABLE public."Aggregates" (
"Id" int4 NOT NULL GENERATED BY DEFAULT AS IDENTITY,
"UniqueIdentifier" varchar(150) NOT NULL,
"Md5Hash" bytea NOT NULL, Merge Join (cost=2.27..18501202.86 rows=1 width=409) (actual time=1323.481..625667.850 rows=262679 loops=1)
Merge Cond: ("Aggregates"."Md5Hash" = agg."Md5Hash")
Buffers: shared hit=189969175 read=2738925 dirtied=258646 written=237386
-> Sort (cost=1.70..1.70 rows=1 width=103) (actual time=1321.782..1444.692 rows=262679 loops=1)
Sort Key: "Aggregates"."Md5Hash"
Sort Method: quicksort Memory: 50901kB
Buffers: shared hit=121000 read=35
-> Index Scan using "IX_Aggregates_Aggregated" on "Aggregates" (cost=0.57..1.69 rows=1 width=103) (actual time=0.137..1080.727 rows=262679 loops=1)
Index Cond: ("Aggregated" = false)
Buffers: shared hit=121000 read=35
-> Subquery Scan on agg (cost=0.57..18432053.72 rows=27658963 width=362) (actual time=0.103..615067.274 rows=67537041 loops=1)
Buffers: shared hit=189848175 read=2738890 dirtied=258646 written=237386
-> GroupAggregate (cost=0.57..18155464.09 rows=27658963 width=169) (actual time=0.081..559315.620 rows=67537041 loops=1)
Group Key: "Items"."Md5Hash"
Buffers: shared hit=189848175 read=2738890 dirtied=258646 written=237386
-> Index Scan using "PK_Items" on "Items" (cost=0.57..7571149.04 rows=196337627 width=107) (actual time=0.052..331107.259 rows=191030895 loops=1)
Buffers: shared hit=189848175 read=2738890 dirtied=258646 written=237386
Planning Time: 0.370 ms
Execution Time: 656803.036 ms
Asked by ConfusedUser
(1 rep)
Apr 20, 2020, 09:15 AM
Last activity: Apr 20, 2020, 03:45 PM
Last activity: Apr 20, 2020, 03:45 PM