Aggregate the last two entries per item efficiently
0
votes
2
answers
78
views
I'd like your help to optimise an increasingly slow query - or better yet, help me understand what is the problem and point me in the right direction.
Every day I scrape 3 supermarkets and I record their prices in a django app. Each scrape results in a new
Retrieval
record, a few thousand Price
records and maybe in a few Item
records (most items already exist, so I only record their prices for a given day).
Once this is done, a process runs which computes the price changes for all of today's items, by retrieving each item's **last two** prices. These last two prices may not always be on consecutive days, because some items come and go.
The 3 tables that we're interested in and their indexes look like this:
CREATE TABLE public.main_item (
id uuid NOT NULL,
name character varying(512) NOT NULL
);
CREATE TABLE public.main_price (
id uuid NOT NULL,
per_item double precision NOT NULL,
item_id uuid NOT NULL,
retrieval_id uuid NOT NULL
);
CREATE TABLE public.main_retrieval (
id uuid NOT NULL,
"timestamp" timestamp with time zone NOT NULL,
);
CREATE INDEX name_index ON public.main_item USING btree (name);
CREATE INDEX timestamp_index ON public.main_retrieval USING btree ("timestamp");
I have a query which returns a result that looks like this where id
is the Item
id and latest_prices
is a json which contains the last 2 prices for that item and then I process the results in python.
| id | latest_prices|
|-|-|
0003db22-3c8a-4f21-aea1-667361ebe377 | {"{\"per_item\": 2.44, \"price_id\": \"24dc5524-35d5-472b-8f16-5840308a9cc4\"}","{\"per_item\": 2.44, \"price_id\": \"a415d740-0e50-43ba-b33f-3d6c9328a319\"}"}
0011cc73-07ca-415d-85e0-1c6782e0b041 | {"{\"per_item\": 3.48, \"price_id\": \"e754cc25-9fb5-4e88-8689-55878e47f7dc\"}","{\"per_item\": 3.48, \"price_id\": \"553a6cf2-2c6e-421c-b7e0-c43d5c0cbf85\"}"}
However, the query is getting increasingly slow. My assumption is because of the size of the Price
table which by now is ~16 million rows.
Running an EXPLAIN ANALYZE
on that query I see that the majority of the time is spent in a Bitmap Heap Scan
on the main_price
table: https://explain.depesz.com/s/ZX78#stats
Below you can find the query that the django ORM has generated for today:
SELECT "main_item"."id", ARRAY(
SELECT JSONB_BUILD_OBJECT(('price_id')::text, U0."id", ('per_item')::text, U0."per_item") AS "json"
FROM "main_price" U0
INNER JOIN "main_retrieval" U2
ON (U0."retrieval_id" = U2."id")
WHERE (U0."item_id" = ("main_item"."id") AND U2."timestamp" = '2024-09-23 00:00:00+00:00'
AND V0."timestamp" < '2024-09-24 00:00:00+00:00'
AND NOT (
EXISTS(
SELECT 1 AS "a"
FROM "main_retrieval" U0
LEFT OUTER JOIN "main_price" U1
ON (U0."id" = U1."retrieval_id")
WHERE (U1."item_id" IS NULL AND U0."id" = (V0."id"))
LIMIT 1))))
### Note
I'm more than happy to ditch the query generated by the ORM and write it by hand.
However, I'm struggling to find an optimal way to fetch the 2 most recent prices for each Item. Fetching the most recent is easy, but fetching the previous one is surprisingly difficult.
Asked by Sakis Vtdk
(103 rep)
Sep 27, 2024, 04:58 PM
Last activity: Oct 2, 2024, 08:13 PM
Last activity: Oct 2, 2024, 08:13 PM