Sample Header Ad - 728x90

Limit results of complex query to most frequent values in given attribute

1 vote
1 answer
120 views
Say that you have the following tables for a multi store platform
CREATE TABLE orders (
  id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  store_id BIGINT NOT NULL,
  ordered_at TIMESTAMPTZ NOT NULL
);

CREATE INDEX ON orders (store_id);
CREATE TABLE order_lines (
  id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  order_id BIGINT NOT NULL REFERENCES orders (id) ON DELETE CASCADE,
  item_id BIGINT NOT NULL REFERENCES items (id),
  quantity INT
);

CREATE INDEX ON order_lines (order_id);
CREATE INDEX ON order_lines (item_id);
CREATE TABLE items (
  id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  store_id BIGINT NOT NULL,
  color VARCHAR(255) NOT NULL,
  size VARCHAR(255) NOT NULL,
  category VARCHAR(255)
);
CREATE TABLE returns (
  id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  order_line_id BIGINT NOT NULL REFERENCES order_lines (id) ON DELETE CASCADE,
  returned_at TIMESTAMPTZ NOT NULL,
  quantity INT NOT NULL,
  reason VARCHAR(255)
);

CREATE INDEX ON returns (order_line_id);
From this I would like to get a list of all orders during a date range and calculate certain metrics, like how many items were bought, how many was returned etc. I would also like to do this for a subset of the items based on color or size, but I also want them to be "ranked". For example, I would like to show these metrics for the items that had a color that was part of the most returned colors, over all. What I have come up with so far is to do this in two queries. First one is to go through all of the returns and group them by color and sum the items, return, like below:
select
  i.color,
  trunc(sum(r.quantity)::numeric / sum(ol.quantity)::numeric, 2) as return_rate 
from orders o
inner join
  order_lines ol on ol.order_id = o.id
inner join
  items i on i.id = ol.item_id
left outer join
  returns r on r.order_line_id = ol.id
group by i.color
order by return_rate desc nulls last 
limit 4;

    color     | return_rate 
--------------+-------------
 Black        |       0.43
 Blue         |       0.41
 White        |       0.40
 Yellow       |       0.39
Based on this query I would then do a new one that groups all orders by day(date) and then sums up the total return rate and the return rate for the top returned colors during a time range. I would also like to be able to filter this by for example size, category etc on the items. It will be used for a dynamic report where people can see the total number of orders, returns, rate as well as a line chart of the average return rate, top returned colors over a selected time range. Is there any better ways of doing this? It feels a bit wrong to traverse and do all of the joins twice. Read some about window functions but wasn't able to figure out if it was applicable here. Started on a fiddle here with my naive approach as well as one with CTE to try to reuse the bulk data. https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/10700 It obviously performs a bit different locally with around 100k rows, here's a explain analyze plan of the CTE query being run locally.
QUERY PLAN                                                                                           
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=50730.74..50809.91 rows=2262 width=68) (actual time=350.787..351.912 rows=722 loops=1)
   Group Key: ((order_return_items.date)::date), order_return_items.color
   CTE order_return_items
     ->  HashAggregate  (cost=33884.69..39240.19 rows=226239 width=33) (actual time=211.390..258.041 rows=194309 loops=1)
           Group Key: o.ordered_at, i.color
           Planned Partitions: 4  Batches: 5  Memory Usage: 9265kB  Disk Usage: 7736kB
           ->  Hash Left Join  (cost=5740.52..14725.07 rows=226239 width=25) (actual time=44.080..158.610 rows=226902 loops=1)
                 Hash Cond: (ol.id = r.order_line_id)
                 ->  Hash Join  (cost=3754.69..9321.56 rows=226239 width=29) (actual time=31.592..107.385 rows=226239 loops=1)
                       Hash Cond: (ol.item_id = i.id)
                       ->  Hash Join  (cost=3572.53..8544.81 rows=226239 width=28) (actual time=29.966..79.287 rows=226239 loops=1)
                             Hash Cond: (ol.order_id = o.id)
                             ->  Seq Scan on order_lines ol  (cost=0.00..4378.39 rows=226239 width=28) (actual time=0.012..10.556 rows=226239 loops=1)
                             ->  Hash  (cost=2363.90..2363.90 rows=96690 width=16) (actual time=29.931..29.931 rows=96690 loops=1)
                                   Buckets: 131072  Batches: 1  Memory Usage: 5557kB
                                   ->  Seq Scan on orders o  (cost=0.00..2363.90 rows=96690 width=16) (actual time=0.005..13.289 rows=96690 loops=1)
                       ->  Hash  (cost=137.63..137.63 rows=3563 width=17) (actual time=1.620..1.621 rows=4462 loops=1)
                             Buckets: 8192 (originally 4096)  Batches: 1 (originally 1)  Memory Usage: 300kB
                             ->  Seq Scan on items i  (cost=0.00..137.63 rows=3563 width=17) (actual time=0.014..0.754 rows=4462 loops=1)
                 ->  Hash  (cost=1248.70..1248.70 rows=58970 width=12) (actual time=12.385..12.385 rows=61098 loops=1)
                       Buckets: 65536  Batches: 1  Memory Usage: 3376kB
                       ->  Seq Scan on returns r  (cost=0.00..1248.70 rows=58970 width=12) (actual time=0.007..5.392 rows=61098 loops=1)
   ->  Sort  (cost=11490.55..11496.20 rows=2262 width=52) (actual time=350.765..351.008 rows=6567 loops=1)
         Sort Key: ((order_return_items.date)::date), order_return_items.color
         Sort Method: quicksort  Memory: 626kB
         ->  Hash Join  (cost=6227.62..11364.52 rows=2262 width=52) (actual time=323.014..349.528 rows=6567 loops=1)
               Hash Cond: (order_return_items.color = most_returned_colors.color)
               ->  CTE Scan on order_return_items  (cost=0.00..4524.78 rows=226239 width=56) (actual time=211.393..223.913 rows=194309 loops=1)
               ->  Hash  (cost=6227.60..6227.60 rows=2 width=32) (actual time=111.569..111.571 rows=2 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 9kB
                     ->  Subquery Scan on most_returned_colors  (cost=6227.57..6227.60 rows=2 width=32) (actual time=111.564..111.566 rows=2 loops=1)
                           ->  Limit  (cost=6227.57..6227.58 rows=2 width=64) (actual time=111.564..111.565 rows=2 loops=1)
                                 ->  Sort  (cost=6227.57..6228.07 rows=200 width=64) (actual time=111.562..111.563 rows=2 loops=1)
                                       Sort Key: (trunc((sum(order_return_items_1.r_q) / sum(order_return_items_1.ol_q)), 3)) DESC NULLS LAST
                                       Sort Method: top-N heapsort  Memory: 25kB
                                       ->  HashAggregate  (cost=6221.57..6225.57 rows=200 width=64) (actual time=111.538..111.550 rows=44 loops=1)
                                             Group Key: order_return_items_1.color
                                             Batches: 1  Memory Usage: 48kB
                                             ->  CTE Scan on order_return_items order_return_items_1  (cost=0.00..4524.78 rows=226239 width=48) (actual time=0.000..85.098 rows=194309 loops=1)
Would also appreciate feedback on the overall structure as well as the indices if any!
Asked by nbon (113 rep)
Oct 30, 2023, 01:33 PM
Last activity: Nov 1, 2023, 01:40 AM