Sample Header Ad - 728x90

Is this DELETE ... WHERE EXISTS query plan O(n*2) or O(n^2)?

5 votes
2 answers
494 views
I'm trying to perform a common task, deleting duplicates from a table with the aim of adding a unique constraint.
CREATE TABLE IF NOT EXISTS item_identifier (
    pk       BIGSERIAL PRIMARY KEY,
    prefix   INTEGER NOT NULL,
    suffix   VARCHAR(1024) NOT NULL
);
CREATE INDEX temp_prefix_suffix_idx ON item_identifier (prefix, suffix);
I want to delete duplicates using a common query that can be found in many answers on this site. I think the rate of duplicates runs to about 1%, so there are not many to remove. Index is provided purely to help this de-duplicate and will be dropped later. Though, as you see, it isn't even used by PostgreSQL! There are 2,759,559,168 rows. The temp_prefix_suffix_idx index itself is ~ 100 GB. The CREATE INDEX took 12 hours so I don't expect the DELETE to be quick. But from a 10% sample set I extrapolated that it would take 20 hours, and it's already taken 40 hours. It's probably still within the margin of error for my sample method, but I am worried that this will take exponential time due to it not using indexes. This EXPLAIN has Seq Scan on item_identifier a and Seq Scan on item_identifier b.
EXPLAIN DELETE FROM item_identifier a
  WHERE EXISTS
    (SELECT FROM item_identifier b
       WHERE a.prefix = b.prefix
       AND a.suffix = b.suffix
       AND a.pk > b.pk);
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Delete on item_identifier a  (cost=1168440103.12..1233224771.45 rows=0 width=0)
   ->  Merge Semi Join  (cost=1168440103.12..1233224771.45 rows=919853056 width=12)
         Merge Cond: ((a.prefix = b.prefix) AND ((a.suffix)::text = (b.suffix)::text))
         Join Filter: (a.pk > b.pk)
         ->  Sort  (cost=584220051.56..591118949.48 rows=2759559168 width=52)
               Sort Key: a.prefix, a.suffix
               ->  Seq Scan on item_identifier a  (cost=0.00..57175596.68 rows=2759559168 width=52)
         ->  Materialize  (cost=584220051.56..598017847.40 rows=2759559168 width=52)
               ->  Sort  (cost=584220051.56..591118949.48 rows=2759559168 width=52)
                     Sort Key: b.prefix, b.suffix
                     ->  Seq Scan on item_identifier b  (cost=0.00..57175596.68 rows=2759559168 width=52)
Can I assume that PostgreSQL is making the right choice with not using an index? As another point of reference, another commonly suggested method gives similar results:
explain DELETE FROM item_identifier
WHERE pk IN (SELECT pk FROM (
        SELECT pk, row_number() OVER w as rnum
        FROM item_identifier
        WINDOW w AS (
            PARTITION BY prefix, suffix
            ORDER BY pk)
    ) t
WHERE t.rnum > 1);
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Delete on item_identifier  (cost=833491464.98..955347491.91 rows=0 width=0)
   ->  Merge Semi Join  (cost=833491464.98..955347491.91 rows=919853056 width=38)
         Merge Cond: (item_identifier.pk = t.pk)
         ->  Index Scan using item_identifier_pkey on item_identifier  (cost=0.58..101192612.10 rows=2759559168 width=14)
         ->  Sort  (cost=833476299.40..835775932.04 rows=919853056 width=40)
               Sort Key: t.pk
               ->  Subquery Scan on t  (cost=574787964.56..671372535.44 rows=919853056 width=40)
                     Filter: (t.rnum > 1)
                     ->  WindowAgg  (cost=574787964.56..636878045.84 rows=2759559168 width=54)
                           ->  Sort  (cost=574787964.56..581686862.48 rows=2759559168 width=46)
                                 Sort Key: item_identifier_1.prefix, item_identifier_1.suffix, item_identifier_1.pk
                                 ->  Seq Scan on item_identifier item_identifier_1  (cost=0.00..57175596.68 rows=2759559168 width=46)
The EXISTS method has a cost of 1,168,440,103 .. 1,233,224,771. The PARTITION method has a cost of 833,000,000 .. 955,000,000 (and uses the index, though not the one I thought would be uesful for the purpose!). They are close enough that I think PostgreSQL isn't making an error in its analysis of EXISTS. And is this doing a one-off doble table-scan of approx O(n*2) or is it nesting them, resulting in something more like O(n^2)?
Asked by Joe (1655 rep)
Jul 4, 2022, 10:24 AM
Last activity: Jul 4, 2022, 05:10 PM