PostgreSQL - Index efficiently on REGEX_REPLACE()
5
votes
1
answer
1107
views
I have a query which is designed to loop and search addresses for duplicates, the query uses REGEX_REPLACE. I am trying to index on the regex as on doing an explain and its doing a sequential scan on the user_property table with a filter on the regex
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) with user_detail AS (
SELECT user_id,
max(user_property_value) FILTER (WHERE user_property_type_id = 6 ) AS FIRST_NAME,
max(user_property_value) FILTER (WHERE user_property_type_id = 7 ) AS LAST_NAME,
max(TO_DATE(user_property_value, 'YYYY-MM-DD')) FILTER (WHERE user_property_type_id = 8 ) AS DOB,
max(user_property_value) FILTER (WHERE user_property_type_id = 33 ) AS BIRTH_NUMBER
FROM PUBLIC.user_property cp
JOIN PUBLIC.user c using (user_id)
WHERE c.user_group_id= '38'
AND cp.user_property_is_active
GROUP BY user_id
),
duplicate as (
SELECT COALESCE(MAX(
CASE WHEN REGEXP_REPLACE((address_line1), E'\\_|\\W','','g') = 'Flat 25 Arliss Court 24'
AND (
COALESCE(REGEXP_REPLACE((address_line2), E'\\_|\\W','','g'), '') = ''
OR REGEXP_REPLACE((address_line2), E'\\_|\\W','','g') = 'Calderon Road'
)
AND REGEXP_REPLACE((address_place), E'\\_|\\W','','g') = 'Dartford'
AND address_country_code = 'GB'
THEN 1 ELSE 0 END), 0) AS dup_name_address,
COALESCE(MAX(CASE WHEN REGEXP_REPLACE(UPPER(address_postcode), E'\\_|\\W','','g') = 'WD17 1JY' THEN 1 ELSE 0 END), 0) AS dup_name_postcode
FROM
user_detail cd
LEFT JOIN PUBLIC.address ad ON cd.user_id = ad.user_id
WHERE (
(REGEXP_REPLACE(UPPER(cd.FIRST_NAME), E'\\_|\\W', '', 'g') = 'Clyde'
AND REGEXP_REPLACE(UPPER(cd.LAST_NAME), E'\\_|\\W', '', 'g') = 'Len')
OR
(REGEXP_REPLACE(UPPER(cd.LAST_NAME), E'\\_|\\W', '', 'g') = 'Clyde'
AND REGEXP_REPLACE(UPPER(cd.FIRST_NAME), E'\\_|\\W', '', 'g') = 'Len')
)
AND cd.user_id != '2589384'
), dup_dob_address AS (
SELECT
COALESCE(MAX(CASE WHEN
(cd.DOB IS NOT NULL AND cd.DOB = '1982-06-14 00:00:00') OR (cd.BIRTH_NUMBER IS NOT NULL AND cd.BIRTH_NUMBER = null )
THEN 1 ELSE 0 END), 0) AS dob
FROM
user_detail cd
LEFT JOIN PUBLIC.address ad ON cd.user_id = ad.user_id
WHERE (
REGEXP_REPLACE((address_line1), E'\\_|\\W','','g') = 'Flat 25 Arliss Court 24'
AND (
COALESCE(REGEXP_REPLACE((address_line2), E'\\_|\\W','','g'), '') = ''
OR REGEXP_REPLACE((address_line2), E'\\_|\\W','','g') = 'Calderon Road'
)
AND REGEXP_REPLACE((address_place), E'\\_|\\W','','g') = 'Dartford'
AND address_country_code = 'GB'
)
AND cd.user_id != '2589384'
)
SELECT * FROM duplicate, dup_dob_address;
Explain result:
Nested Loop (cost=492738.45..492738.50 rows=1 width=12) (actual time=7589.136..7590.933 rows=1 loops=1)
Output: (COALESCE(max(CASE WHEN ((regexp_replace((ad.address_line1)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Flat 25 Arliss Court 24'::text) AND ((COALESCE(regexp_replace((ad.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text), ''::text) = ''::text) OR (regexp_replace((ad.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Calderon Road'::text)) AND (regexp_replace((ad.address_place)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Dartford'::text) AND ((ad.address_country_code)::text = 'GB'::text)) THEN 1 ELSE 0 END), 0)), (COALESCE(max(CASE WHEN (regexp_replace(upper((ad.address_postcode)::text), '\\_|\\W'::text, ''::text, 'g'::text) = 'WD17 1JY'::text) THEN 1 ELSE 0 END), 0)), (COALESCE(max(CASE WHEN (((cd_1.dob IS NOT NULL) AND (cd_1.dob = '1982-06-14'::date)) OR ((cd_1.birth_number IS NOT NULL) AND NULL::boolean)) THEN 1 ELSE 0 END), 0))
Buffers: shared hit=931500 read=103761
CTE user_detail
-> Finalize HashAggregate (cost=423105.99..426854.87 rows=374888 width=104) (actual time=6110.633..6172.107 rows=115625 loops=1)
Output: cp.user_id, max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 6)), max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 7)), max(to_date((cp.user_property_value)::text, 'YYYY-MM-DD'::text)) FILTER (WHERE (cp.user_property_type_id = 8)), max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 33))
Group Key: cp.user_id
Buffers: shared hit=908203 read=103761
-> Gather (cost=335007.31..413733.79 rows=749776 width=104) (actual time=6024.383..6062.501 rows=115625 loops=1)
Output: cp.user_id, (PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 6))), (PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 7))), (PARTIAL max(to_date((cp.user_property_value)::text, 'YYYY-MM-DD'::text)) FILTER (WHERE (cp.user_property_type_id = 8))), (PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 33)))
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=908203 read=103761
-> Partial HashAggregate (cost=334007.31..337756.19 rows=374888 width=104) (actual time=6017.847..6037.215 rows=38542 loops=3)
Output: cp.user_id, PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 6)), PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 7)), PARTIAL max(to_date((cp.user_property_value)::text, 'YYYY-MM-DD'::text)) FILTER (WHERE (cp.user_property_type_id = 8)), PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 33))
Group Key: cp.user_id
Buffers: shared hit=908203 read=103761
Worker 0: actual time=6017.372..6035.986 rows=37261 loops=1
Buffers: shared hit=292969 read=33275
Worker 1: actual time=6012.321..6032.378 rows=40788 loops=1
Buffers: shared hit=320593 read=35787
-> Nested Loop (cost=1630.78..321001.76 rows=520222 width=30) (actual time=48.770..5900.888 rows=434730 loops=3)
Output: cp.user_id, cp.user_property_value, cp.user_property_type_id
Buffers: shared hit=908203 read=103761
Worker 0: actual time=45.466..5905.504 rows=420402 loops=1
Buffers: shared hit=292969 read=33275
Worker 1: actual time=44.758..5889.927 rows=459654 loops=1
Buffers: shared hit=320593 read=35787
-> Parallel Bitmap Heap Scan on public.user c (cost=1630.22..22201.58 rows=48268 width=4) (actual time=26.536..39.410 rows=38542 loops=3)
Output: c.user_id, c.currency_code, c.user_group_id, c.user_created_on, c.user_status_id, c.user_max_credit, c.user_last_updated_on, c.user_version
Recheck Cond: (c.user_group_id = 38)
Heap Blocks: exact=2249
Buffers: shared hit=6896 read=319
Worker 0: actual time=22.735..35.486 rows=37261 loops=1
Buffers: shared hit=2303
Worker 1: actual time=22.766..36.418 rows=40788 loops=1
Buffers: shared hit=2343
-> Bitmap Index Scan on idx_user_user_group_id (cost=0.00..1601.26 rows=115844 width=0) (actual time=33.224..33.224 rows=115625 loops=1)
Index Cond: (c.user_group_id = 38)
Buffers: shared hit=1 read=319
-> Index Scan using idx_user_id_user_property on public.user_property cp (cost=0.56..5.51 rows=68 width=30) (actual time=0.036..0.150 rows=11 loops=115625)
Output: cp.user_id, cp.user_property_type_id, cp.user_property_created_on, cp.user_property_is_active, cp.user_property_value, cp.user_property_upper_value, cp.user_property_version
Index Cond: (cp.user_id = c.user_id)
Buffers: shared hit=901307 read=103442
Worker 0: actual time=0.038..0.156 rows=11 loops=37261
Buffers: shared hit=290666 read=33275
Worker 1: actual time=0.034..0.142 rows=11 loops=40788
Buffers: shared hit=318250 read=35787
-> Aggregate (cost=19766.95..19766.96 rows=1 width=8) (actual time=6882.602..6882.605 rows=1 loops=1)
Output: COALESCE(max(CASE WHEN ((regexp_replace((ad.address_line1)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Flat 25 Arliss Court 24'::text) AND ((COALESCE(regexp_replace((ad.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text), ''::text) = ''::text) OR (regexp_replace((ad.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Calderon Road'::text)) AND (regexp_replace((ad.address_place)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Dartford'::text) AND ((ad.address_country_code)::text = 'GB'::text)) THEN 1 ELSE 0 END), 0), COALESCE(max(CASE WHEN (regexp_replace(upper((ad.address_postcode)::text), '\\_|\\W'::text, ''::text, 'g'::text) = 'WD17 1JY'::text) THEN 1 ELSE 0 END), 0)
Buffers: shared hit=908203 read=103761
-> Nested Loop Left Join (cost=0.42..19766.22 rows=21 width=110) (actual time=6882.596..6882.597 rows=0 loops=1)
Output: ad.address_line1, ad.address_line2, ad.address_place, ad.address_country_code, ad.address_postcode
Buffers: shared hit=908203 read=103761
-> CTE Scan on user_detail cd (cost=0.00..19681.62 rows=19 width=4) (actual time=6882.595..6882.595 rows=0 loops=1)
Output: cd.user_id, cd.first_name, cd.last_name, cd.dob, cd.birth_number
Filter: ((cd.user_id 2589384) AND (((regexp_replace(upper(cd.first_name), '\\_|\\W'::text, ''::text, 'g'::text) = 'Clyde'::text) AND (regexp_replace(upper(cd.last_name), '\\_|\\W'::text, ''::text, 'g'::text) = 'Len'::text)) OR ((regexp_replace(upper(cd.last_name), '\\_|\\W'::text, ''::text, 'g'::text) = 'Clyde'::text) AND (regexp_replace(upper(cd.first_name), '\\_|\\W'::text, ''::text, 'g'::text) = 'Len'::text))))
Rows Removed by Filter: 115625
Buffers: shared hit=908203 read=103761
-> Index Scan using address_idx_01 on public.address ad (cost=0.42..4.44 rows=1 width=114) (never executed)
Output: ad.address_line1, ad.address_line2, ad.address_place, ad.address_country_code, ad.address_postcode, ad.user_id
Index Cond: (ad.user_id = cd.user_id)
-> Aggregate (cost=46116.63..46116.64 rows=1 width=4) (actual time=706.525..707.941 rows=1 loops=1)
Output: COALESCE(max(CASE WHEN (((cd_1.dob IS NOT NULL) AND (cd_1.dob = '1982-06-14'::date)) OR ((cd_1.birth_number IS NOT NULL) AND NULL::boolean)) THEN 1 ELSE 0 END), 0)
Buffers: shared hit=23297
-> Hash Join (cost=36282.83..46116.62 rows=1 width=36) (actual time=706.520..707.934 rows=0 loops=1)
Output: cd_1.dob, cd_1.birth_number
Hash Cond: (cd_1.user_id = ad_1.user_id)
Buffers: shared hit=23297
-> CTE Scan on user_detail cd_1 (cost=0.00..8434.98 rows=373014 width=40) (actual time=0.002..0.003 rows=1 loops=1)
Output: cd_1.user_id, cd_1.first_name, cd_1.last_name, cd_1.dob, cd_1.birth_number
Filter: (cd_1.user_id 2589384)
-> Hash (cost=36282.82..36282.82 rows=1 width=4) (actual time=706.499..707.911 rows=0 loops=1)
Output: ad_1.user_id
Buckets: 1024 Batches: 1 Memory Usage: 8kB
Buffers: shared hit=23297
-> Gather (cost=1000.00..36282.82 rows=1 width=4) (actual time=706.496..707.907 rows=0 loops=1)
Output: ad_1.user_id
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=23297
-> Parallel Seq Scan on public.address ad_1 (cost=0.00..35282.72 rows=1 width=4) (actual time=701.969..701.970 rows=0 loops=3)
Output: ad_1.user_id
Filter: (((ad_1.address_country_code)::text = 'GB'::text) AND (regexp_replace((ad_1.address_line1)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Flat 25 Arliss Court 24'::text) AND (regexp_replace((ad_1.address_place)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Dartford'::text) AND ((COALESCE(regexp_replace((ad_1.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text), ''::text) = ''::text) OR (regexp_replace((ad_1.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Calderon Road'::text)))
Rows Removed by Filter: 295033
Buffers: shared hit=23297
Worker 0: actual time=699.642..699.644 rows=0 loops=1
Buffers: shared hit=7331
Worker 1: actual time=700.498..700.499 rows=0 loops=1
Buffers: shared hit=7984
Planning Time: 17.292 ms
Execution Time: 7601.934 ms
https://explain.depesz.com/s/cbmv#html
I looked at a similar post regarding using the pg_trgm extension but made no difference when trying to index.
create index concurrently on address using gin (address_place gin_trgm_ops);
But
The size of the user_property table is approx 2.5 million rows with the size of the address table also very small < 1.7m rows.
Is there an efficient way to index on Regex_replace? or would a redesign of the query be needed?
Any help much appreciated.
Asked by rdbmsNoob
(459 rep)
Dec 9, 2022, 09:58 PM
Last activity: Dec 15, 2022, 01:18 AM
Last activity: Dec 15, 2022, 01:18 AM