PostgreSQL - Index efficiently on REGEX_REPLACE()

5 votes

1 answer

1107 views

I have a query which is designed to loop and search addresses for duplicates, the query uses REGEX_REPLACE. I am trying to index on the regex as on doing an explain and its doing a sequential scan on the user_property table with a filter on the regex

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)  with user_detail AS (
        SELECT user_id,
            max(user_property_value) FILTER (WHERE user_property_type_id = 6 ) AS FIRST_NAME,
            max(user_property_value) FILTER (WHERE user_property_type_id = 7 ) AS LAST_NAME,
            max(TO_DATE(user_property_value, 'YYYY-MM-DD')) FILTER (WHERE user_property_type_id = 8 ) AS DOB,
            max(user_property_value) FILTER (WHERE user_property_type_id = 33 ) AS BIRTH_NUMBER
        FROM PUBLIC.user_property cp
        JOIN PUBLIC.user c using (user_id)
        WHERE c.user_group_id= '38'
        AND cp.user_property_is_active
        GROUP BY user_id
    ),
    duplicate as (
        SELECT COALESCE(MAX(
                CASE WHEN REGEXP_REPLACE((address_line1), E'\\_|\\W','','g') = 'Flat 25 Arliss Court 24'
                AND (
                    COALESCE(REGEXP_REPLACE((address_line2), E'\\_|\\W','','g'), '') = ''
                    OR REGEXP_REPLACE((address_line2), E'\\_|\\W','','g') = 'Calderon Road'
                )
                AND REGEXP_REPLACE((address_place), E'\\_|\\W','','g') = 'Dartford'
                AND address_country_code = 'GB'
            THEN 1 ELSE 0 END), 0) AS dup_name_address,
            COALESCE(MAX(CASE WHEN REGEXP_REPLACE(UPPER(address_postcode), E'\\_|\\W','','g') = 'WD17 1JY' THEN 1 ELSE 0 END), 0) AS dup_name_postcode
        FROM
            user_detail cd
        LEFT JOIN PUBLIC.address ad ON cd.user_id = ad.user_id
        WHERE  (
          (REGEXP_REPLACE(UPPER(cd.FIRST_NAME), E'\\_|\\W', '', 'g') = 'Clyde'
                  AND REGEXP_REPLACE(UPPER(cd.LAST_NAME), E'\\_|\\W', '', 'g') = 'Len')
            OR
                  (REGEXP_REPLACE(UPPER(cd.LAST_NAME), E'\\_|\\W', '', 'g') = 'Clyde'
                  AND REGEXP_REPLACE(UPPER(cd.FIRST_NAME), E'\\_|\\W', '', 'g') = 'Len')
            )
            AND cd.user_id != '2589384'
    ), dup_dob_address AS (
        SELECT
            COALESCE(MAX(CASE WHEN
                (cd.DOB IS NOT NULL AND cd.DOB = '1982-06-14 00:00:00') OR (cd.BIRTH_NUMBER IS NOT NULL AND cd.BIRTH_NUMBER = null )
             THEN 1 ELSE 0 END), 0) AS dob
        FROM
            user_detail cd
        LEFT JOIN PUBLIC.address ad ON cd.user_id = ad.user_id
        WHERE (
                REGEXP_REPLACE((address_line1), E'\\_|\\W','','g') = 'Flat 25 Arliss Court 24'
                AND (
                    COALESCE(REGEXP_REPLACE((address_line2), E'\\_|\\W','','g'), '') = ''
                    OR REGEXP_REPLACE((address_line2), E'\\_|\\W','','g') = 'Calderon Road'
                )
                AND REGEXP_REPLACE((address_place), E'\\_|\\W','','g') = 'Dartford'
                AND address_country_code = 'GB'
            )
        AND cd.user_id != '2589384'
    )
    SELECT * FROM duplicate, dup_dob_address;

Explain result:

Nested Loop  (cost=492738.45..492738.50 rows=1 width=12) (actual time=7589.136..7590.933 rows=1 loops=1)
  Output: (COALESCE(max(CASE WHEN ((regexp_replace((ad.address_line1)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Flat 25 Arliss Court 24'::text) AND ((COALESCE(regexp_replace((ad.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text), ''::text) = ''::text) OR (regexp_replace((ad.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Calderon Road'::text)) AND (regexp_replace((ad.address_place)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Dartford'::text) AND ((ad.address_country_code)::text = 'GB'::text)) THEN 1 ELSE 0 END), 0)), (COALESCE(max(CASE WHEN (regexp_replace(upper((ad.address_postcode)::text), '\\_|\\W'::text, ''::text, 'g'::text) = 'WD17 1JY'::text) THEN 1 ELSE 0 END), 0)), (COALESCE(max(CASE WHEN (((cd_1.dob IS NOT NULL) AND (cd_1.dob = '1982-06-14'::date)) OR ((cd_1.birth_number IS NOT NULL) AND NULL::boolean)) THEN 1 ELSE 0 END), 0))
  Buffers: shared hit=931500 read=103761
  CTE user_detail
    ->  Finalize HashAggregate  (cost=423105.99..426854.87 rows=374888 width=104) (actual time=6110.633..6172.107 rows=115625 loops=1)
          Output: cp.user_id, max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 6)), max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 7)), max(to_date((cp.user_property_value)::text, 'YYYY-MM-DD'::text)) FILTER (WHERE (cp.user_property_type_id = 8)), max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 33))
          Group Key: cp.user_id
          Buffers: shared hit=908203 read=103761
          ->  Gather  (cost=335007.31..413733.79 rows=749776 width=104) (actual time=6024.383..6062.501 rows=115625 loops=1)
                Output: cp.user_id, (PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 6))), (PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 7))), (PARTIAL max(to_date((cp.user_property_value)::text, 'YYYY-MM-DD'::text)) FILTER (WHERE (cp.user_property_type_id = 8))), (PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 33)))
                Workers Planned: 2
                Workers Launched: 2
                Buffers: shared hit=908203 read=103761
                ->  Partial HashAggregate  (cost=334007.31..337756.19 rows=374888 width=104) (actual time=6017.847..6037.215 rows=38542 loops=3)
                      Output: cp.user_id, PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 6)), PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 7)), PARTIAL max(to_date((cp.user_property_value)::text, 'YYYY-MM-DD'::text)) FILTER (WHERE (cp.user_property_type_id = 8)), PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 33))
                      Group Key: cp.user_id
                      Buffers: shared hit=908203 read=103761
                      Worker 0: actual time=6017.372..6035.986 rows=37261 loops=1
                        Buffers: shared hit=292969 read=33275
                      Worker 1: actual time=6012.321..6032.378 rows=40788 loops=1
                        Buffers: shared hit=320593 read=35787
                      ->  Nested Loop  (cost=1630.78..321001.76 rows=520222 width=30) (actual time=48.770..5900.888 rows=434730 loops=3)
                            Output: cp.user_id, cp.user_property_value, cp.user_property_type_id
                            Buffers: shared hit=908203 read=103761
                            Worker 0: actual time=45.466..5905.504 rows=420402 loops=1
                              Buffers: shared hit=292969 read=33275
                            Worker 1: actual time=44.758..5889.927 rows=459654 loops=1
                              Buffers: shared hit=320593 read=35787
                            ->  Parallel Bitmap Heap Scan on public.user c  (cost=1630.22..22201.58 rows=48268 width=4) (actual time=26.536..39.410 rows=38542 loops=3)
                                  Output: c.user_id, c.currency_code, c.user_group_id, c.user_created_on, c.user_status_id, c.user_max_credit, c.user_last_updated_on, c.user_version
                                  Recheck Cond: (c.user_group_id = 38)
                                  Heap Blocks: exact=2249
                                  Buffers: shared hit=6896 read=319
                                  Worker 0: actual time=22.735..35.486 rows=37261 loops=1
                                    Buffers: shared hit=2303
                                  Worker 1: actual time=22.766..36.418 rows=40788 loops=1
                                    Buffers: shared hit=2343
                                  ->  Bitmap Index Scan on idx_user_user_group_id  (cost=0.00..1601.26 rows=115844 width=0) (actual time=33.224..33.224 rows=115625 loops=1)
                                        Index Cond: (c.user_group_id = 38)
                                        Buffers: shared hit=1 read=319
                            ->  Index Scan using idx_user_id_user_property on public.user_property cp  (cost=0.56..5.51 rows=68 width=30) (actual time=0.036..0.150 rows=11 loops=115625)
                                  Output: cp.user_id, cp.user_property_type_id, cp.user_property_created_on, cp.user_property_is_active, cp.user_property_value, cp.user_property_upper_value, cp.user_property_version
                                  Index Cond: (cp.user_id = c.user_id)
                                  Buffers: shared hit=901307 read=103442
                                  Worker 0: actual time=0.038..0.156 rows=11 loops=37261
                                    Buffers: shared hit=290666 read=33275
                                  Worker 1: actual time=0.034..0.142 rows=11 loops=40788
                                    Buffers: shared hit=318250 read=35787
  ->  Aggregate  (cost=19766.95..19766.96 rows=1 width=8) (actual time=6882.602..6882.605 rows=1 loops=1)
        Output: COALESCE(max(CASE WHEN ((regexp_replace((ad.address_line1)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Flat 25 Arliss Court 24'::text) AND ((COALESCE(regexp_replace((ad.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text), ''::text) = ''::text) OR (regexp_replace((ad.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Calderon Road'::text)) AND (regexp_replace((ad.address_place)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Dartford'::text) AND ((ad.address_country_code)::text = 'GB'::text)) THEN 1 ELSE 0 END), 0), COALESCE(max(CASE WHEN (regexp_replace(upper((ad.address_postcode)::text), '\\_|\\W'::text, ''::text, 'g'::text) = 'WD17 1JY'::text) THEN 1 ELSE 0 END), 0)
        Buffers: shared hit=908203 read=103761
        ->  Nested Loop Left Join  (cost=0.42..19766.22 rows=21 width=110) (actual time=6882.596..6882.597 rows=0 loops=1)
              Output: ad.address_line1, ad.address_line2, ad.address_place, ad.address_country_code, ad.address_postcode
              Buffers: shared hit=908203 read=103761
              ->  CTE Scan on user_detail cd  (cost=0.00..19681.62 rows=19 width=4) (actual time=6882.595..6882.595 rows=0 loops=1)
                    Output: cd.user_id, cd.first_name, cd.last_name, cd.dob, cd.birth_number
                    Filter: ((cd.user_id  2589384) AND (((regexp_replace(upper(cd.first_name), '\\_|\\W'::text, ''::text, 'g'::text) = 'Clyde'::text) AND (regexp_replace(upper(cd.last_name), '\\_|\\W'::text, ''::text, 'g'::text) = 'Len'::text)) OR ((regexp_replace(upper(cd.last_name), '\\_|\\W'::text, ''::text, 'g'::text) = 'Clyde'::text) AND (regexp_replace(upper(cd.first_name), '\\_|\\W'::text, ''::text, 'g'::text) = 'Len'::text))))
                    Rows Removed by Filter: 115625
                    Buffers: shared hit=908203 read=103761
              ->  Index Scan using address_idx_01 on public.address ad  (cost=0.42..4.44 rows=1 width=114) (never executed)
                    Output: ad.address_line1, ad.address_line2, ad.address_place, ad.address_country_code, ad.address_postcode, ad.user_id
                    Index Cond: (ad.user_id = cd.user_id)
  ->  Aggregate  (cost=46116.63..46116.64 rows=1 width=4) (actual time=706.525..707.941 rows=1 loops=1)
        Output: COALESCE(max(CASE WHEN (((cd_1.dob IS NOT NULL) AND (cd_1.dob = '1982-06-14'::date)) OR ((cd_1.birth_number IS NOT NULL) AND NULL::boolean)) THEN 1 ELSE 0 END), 0)
        Buffers: shared hit=23297
        ->  Hash Join  (cost=36282.83..46116.62 rows=1 width=36) (actual time=706.520..707.934 rows=0 loops=1)
              Output: cd_1.dob, cd_1.birth_number
              Hash Cond: (cd_1.user_id = ad_1.user_id)
              Buffers: shared hit=23297
              ->  CTE Scan on user_detail cd_1  (cost=0.00..8434.98 rows=373014 width=40) (actual time=0.002..0.003 rows=1 loops=1)
                    Output: cd_1.user_id, cd_1.first_name, cd_1.last_name, cd_1.dob, cd_1.birth_number
                    Filter: (cd_1.user_id  2589384)
              ->  Hash  (cost=36282.82..36282.82 rows=1 width=4) (actual time=706.499..707.911 rows=0 loops=1)
                    Output: ad_1.user_id
                    Buckets: 1024  Batches: 1  Memory Usage: 8kB
                    Buffers: shared hit=23297
                    ->  Gather  (cost=1000.00..36282.82 rows=1 width=4) (actual time=706.496..707.907 rows=0 loops=1)
                          Output: ad_1.user_id
                          Workers Planned: 2
                          Workers Launched: 2
                          Buffers: shared hit=23297
                          ->  Parallel Seq Scan on public.address ad_1  (cost=0.00..35282.72 rows=1 width=4) (actual time=701.969..701.970 rows=0 loops=3)
                                Output: ad_1.user_id
                                Filter: (((ad_1.address_country_code)::text = 'GB'::text) AND (regexp_replace((ad_1.address_line1)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Flat 25 Arliss Court 24'::text) AND (regexp_replace((ad_1.address_place)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Dartford'::text) AND ((COALESCE(regexp_replace((ad_1.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text), ''::text) = ''::text) OR (regexp_replace((ad_1.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Calderon Road'::text)))
                                Rows Removed by Filter: 295033
                                Buffers: shared hit=23297
                                Worker 0: actual time=699.642..699.644 rows=0 loops=1
                                  Buffers: shared hit=7331
                                Worker 1: actual time=700.498..700.499 rows=0 loops=1
                                  Buffers: shared hit=7984
Planning Time: 17.292 ms
Execution Time: 7601.934 ms

https://explain.depesz.com/s/cbmv#html I looked at a similar post regarding using the pg_trgm extension but made no difference when trying to index.

create index concurrently on address using gin (address_place gin_trgm_ops);

But The size of the user_property table is approx 2.5 million rows with the size of the address table also very small < 1.7m rows. Is there an efficient way to index on Regex_replace? or would a redesign of the query be needed? Any help much appreciated.

Asked by rdbmsNoob (459 rep)

Dec 9, 2022, 09:58 PM
Last activity: Dec 15, 2022, 01:18 AM

PostgreSQL - Index efficiently on REGEX_REPLACE()

Related Questions