Aggregate the last two entries per item efficiently

0 votes

2 answers

78 views

postgresql query-performance postgresql-performance index-tuning django

I'd like your help to optimise an increasingly slow query - or better yet, help me understand what is the problem and point me in the right direction. Every day I scrape 3 supermarkets and I record their prices in a django app. Each scrape results in a new Retrieval record, a few thousand Price records and maybe in a few Item records (most items already exist, so I only record their prices for a given day). Once this is done, a process runs which computes the price changes for all of today's items, by retrieving each item's **last two** prices. These last two prices may not always be on consecutive days, because some items come and go. The 3 tables that we're interested in and their indexes look like this:

CREATE TABLE public.main_item (
    id uuid NOT NULL,
    name character varying(512) NOT NULL
);

CREATE TABLE public.main_price (
    id uuid NOT NULL,
    per_item double precision NOT NULL,
    item_id uuid NOT NULL,
    retrieval_id uuid NOT NULL
);

CREATE TABLE public.main_retrieval (
    id uuid NOT NULL,
    "timestamp" timestamp with time zone NOT NULL,
);

CREATE INDEX name_index ON public.main_item USING btree (name);
CREATE INDEX timestamp_index ON public.main_retrieval USING btree ("timestamp");

I have a query which returns a result that looks like this where id is the Item id and latest_prices is a json which contains the last 2 prices for that item and then I process the results in python. | id | latest_prices| |-|-| 0003db22-3c8a-4f21-aea1-667361ebe377 | {"{\"per_item\": 2.44, \"price_id\": \"24dc5524-35d5-472b-8f16-5840308a9cc4\"}","{\"per_item\": 2.44, \"price_id\": \"a415d740-0e50-43ba-b33f-3d6c9328a319\"}"} 0011cc73-07ca-415d-85e0-1c6782e0b041 | {"{\"per_item\": 3.48, \"price_id\": \"e754cc25-9fb5-4e88-8689-55878e47f7dc\"}","{\"per_item\": 3.48, \"price_id\": \"553a6cf2-2c6e-421c-b7e0-c43d5c0cbf85\"}"} However, the query is getting increasingly slow. My assumption is because of the size of the Price table which by now is ~16 million rows. Running an EXPLAIN ANALYZE on that query I see that the majority of the time is spent in a Bitmap Heap Scan on the main_price table: https://explain.depesz.com/s/ZX78#stats Below you can find the query that the django ORM has generated for today:

SELECT "main_item"."id", ARRAY(
    SELECT JSONB_BUILD_OBJECT(('price_id')::text, U0."id", ('per_item')::text, U0."per_item") AS "json"
    FROM "main_price" U0
        INNER JOIN "main_retrieval" U2
        ON (U0."retrieval_id" = U2."id")
    WHERE (U0."item_id" = ("main_item"."id") AND U2."timestamp" = '2024-09-23 00:00:00+00:00'
            AND V0."timestamp" < '2024-09-24 00:00:00+00:00'
            AND NOT (
                EXISTS(
                    SELECT 1 AS "a"
                    FROM "main_retrieval" U0
                        LEFT OUTER JOIN "main_price" U1
                        ON (U0."id" = U1."retrieval_id")
                    WHERE (U1."item_id" IS NULL AND U0."id" = (V0."id"))
                    LIMIT 1))))

### Note I'm more than happy to ditch the query generated by the ORM and write it by hand. However, I'm struggling to find an optimal way to fetch the 2 most recent prices for each Item. Fetching the most recent is easy, but fetching the previous one is surprisingly difficult.

Asked by Sakis Vtdk (103 rep)

Sep 27, 2024, 04:58 PM
Last activity: Oct 2, 2024, 08:13 PM

Aggregate the last two entries per item efficiently

Related Questions