PostgreSQL query performance issue

1 vote
3 answers
269 views
postgresql postgresql-performance postgresql-9.6
                          We are sometimes getting poor performance (~14s) when using the following query (PostgreSQL 9.6) to fetch rows from the table items whose ID is present in the table items_categories:

    SELECT items.*
    FROM items 
    WHERE EXISTS (
    	SELECT item_id 
    	FROM items_categories 
    	WHERE item_id = items.id  AND category_id = 626 
    ) 
    AND items.active = TRUE
    -- possibly some others "AND" here to use more filters on "items", but not considered for this question
    ORDER BY modified_at DESC 
    LIMIT 10

Relevant parts of our schema:

                                  Table "public.items"
               Column      |       Type        |                     Modifiers
    -----------------------+-------------------+----------------------------------------------------
     id                    | integer           | not null default nextval('items_id_seq'::regclass)
     active                | boolean           | default true
     modified_at           | timestamp without time zone | default now()
    Indexes:
        "items_pkey" PRIMARY KEY, btree (id)
        "active_idx" btree (active)
        "aggregate_idx" btree (id)
        "items_modified_at_idx" btree (modified_at)
    
    
      Table "public.items_categories"
       Column    |  Type   | Modifiers
    -------------+---------+-----------
     item_id     | integer | not null
     category_id | integer | not null
    Indexes:
        "unique_cat_item_assoc" UNIQUE CONSTRAINT, btree (item_id, category_id)
        "items_categories_1_idx" btree (category_id)
        "items_categories_2_idx" btree (item_id)
    Foreign-key constraints:
        "items_categories_category_id_fkey" FOREIGN KEY (category_id) REFERENCES categories(id)
        "items_categories_item_id_fkey" FOREIGN KEY (item_id) REFERENCES items(id)

The table items contains ~2 M rows, and the table items_categories contains ~4 M rows

When we ask for 10 items (i.e. LIMIT 10 at the end of the above query) and 10 or more rows match in items_categories, the performance is good (~10ms), but when we ask for 10 items and less than 10 rows match in items_categories, then the query takes ~14s because it’s doing an index scan on items.modified_at to look at each 2 M rows.

Query plan when less than 10 rows match in items_categories (poor performance):

    Limit  (cost=0.86..11696.68 rows=10 width=1797) (actual time=168.376..14484.854 rows=7 loops=1)
      ->  Nested Loop Semi Join  (cost=0.86..2746178.23 rows=2348 width=1797) (actual time=168.376..14484.836 rows=7 loops=1)
            ->  Index Scan Backward using items_modified_at_idx on items  (cost=0.43..1680609.95 rows=2243424 width=1797) (actual time=0.054..7611.300 rows=2251395 loops=1)
                  Filter: active
                  Rows Removed by Filter: 2467
            ->  Index Only Scan using unique_cat_item_assoc on items_categories  (cost=0.43..0.47 rows=1 width=4) (actual time=0.003..0.003 rows=0 loops=2251395)
                  Index Cond: ((item_id = items.id) AND (category_id = 626))
                  Heap Fetches: 7
    Planning time: 3.082 ms
    Execution time: 14485.057 ms

Query plan when more than 10 rows match in items_categories (good performance):

    Limit  (cost=0.86..24.07 rows=10 width=1857) (actual time=3.575..3.757 rows=10 loops=1)
      ->  Nested Loop Semi Join  (cost=0.86..2763459.56 rows=1190819 width=1857) (actual time=3.574..3.752 rows=10 loops=1)
            ->  Index Scan Backward using items_modified_at_idx on items  (cost=0.43..1684408.22 rows=2246967 width=1857) (actual time=0.013..2.205 rows=751 loops=1)
                  Filter: active
            ->  Index Only Scan using unique_cat_item_assoc on items_categories  (cost=0.43..0.47 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=751)
                  Index Cond: ((item_id = items.id) AND (category_id = 20))
                  Heap Fetches: 10
    Planning time: 1.650 ms
    Execution time: 3.868 ms


How can we tune this query to handle both situations? (i.e. good performances no matter how many rows of items_categories match).

I have a POC working where I first count the number of matching rows in items_categories (separate query) then if the number is low I use a CTE to work on a subset of items instead of all its rows, but it’s really a dirty temporary hack IMO… If the number is high, the CTE takes too long and it's more efficient to NOT use this CTE in this case (i.e. the above query performs better).

Thank you!



                        
Asked by xav (113 rep)
Jul 26, 2021, 11:31 AM
Last activity: May 29, 2025, 11:01 AM
PostgreSQL query performance issue

Related Questions