Sample Header Ad - 728x90

How to efficiently do partial updates on huge dataset in Postgresql

0 votes
0 answers
414 views
**Disclaimer**: I am a total "newb" in regards to PgSql, but due to some unfortunate aligment of circumstances I am in charge of this project which I have to rewrite/migrate away from SqlServer (alone), so there's that. The process is fairly straightforward: I have multiple sources of unique data which I must check every day for updates, import the changes into table *Items* and aggregate the results into *Aggregates*. The tables are as following:
CREATE TABLE public."Items" (
	"Md5Hash" bytea NOT NULL,
	"SourceId" int4 NOT NULL,
	"LastModifiedAt" timestamp NULL,
	"PropA" int4 NOT NULL,
	"PropB" timestamp NULL,
	"PropC" bool NULL,
	...
	CONSTRAINT "PK_Items" PRIMARY KEY ("Md5Hash", "SourceId"),
	CONSTRAINT "FK_Items_Sources_SourceId" FOREIGN KEY ("SourceId") REFERENCES "Sources"("Id") ON DELETE RESTRICT
);
CREATE INDEX "IX_Items_SourceId" ON public."Items" USING btree ("SourceId");

CREATE TABLE public."Aggregates" (
	"Id" int4 NOT NULL GENERATED BY DEFAULT AS IDENTITY,
	"UniqueIdentifier" varchar(150) NOT NULL,
	"Md5Hash" bytea NOT NULL,   Merge Join  (cost=2.27..18501202.86 rows=1 width=409) (actual time=1323.481..625667.850 rows=262679 loops=1)
        Merge Cond: ("Aggregates"."Md5Hash" = agg."Md5Hash")
        Buffers: shared hit=189969175 read=2738925 dirtied=258646 written=237386
        ->  Sort  (cost=1.70..1.70 rows=1 width=103) (actual time=1321.782..1444.692 rows=262679 loops=1)
              Sort Key: "Aggregates"."Md5Hash"
              Sort Method: quicksort  Memory: 50901kB
              Buffers: shared hit=121000 read=35
              ->  Index Scan using "IX_Aggregates_Aggregated" on "Aggregates"  (cost=0.57..1.69 rows=1 width=103) (actual time=0.137..1080.727 rows=262679 loops=1)
                    Index Cond: ("Aggregated" = false)
                    Buffers: shared hit=121000 read=35
        ->  Subquery Scan on agg  (cost=0.57..18432053.72 rows=27658963 width=362) (actual time=0.103..615067.274 rows=67537041 loops=1)
              Buffers: shared hit=189848175 read=2738890 dirtied=258646 written=237386
              ->  GroupAggregate  (cost=0.57..18155464.09 rows=27658963 width=169) (actual time=0.081..559315.620 rows=67537041 loops=1)
                    Group Key: "Items"."Md5Hash"
                    Buffers: shared hit=189848175 read=2738890 dirtied=258646 written=237386
                    ->  Index Scan using "PK_Items" on "Items"  (cost=0.57..7571149.04 rows=196337627 width=107) (actual time=0.052..331107.259 rows=191030895 loops=1)
                          Buffers: shared hit=189848175 read=2738890 dirtied=258646 written=237386
Planning Time: 0.370 ms
Execution Time: 656803.036 ms
Asked by ConfusedUser (1 rep)
Apr 20, 2020, 09:15 AM
Last activity: Apr 20, 2020, 03:45 PM