How can I speed up GIST index query for most similar n items?
2
votes
0
answers
44
views
I have the following tables and indices:
CREATE TABLE IF NOT EXISTS users
(
id NUMERIC(20, 0) NOT NULL DEFAULT NEXTVAL('users_sequence') PRIMARY KEY,
first_name VARCHAR(512) DEFAULT NULL NULL,
last_name VARCHAR(512) DEFAULT NULL NULL,
full_name VARCHAR(1024) GENERATED ALWAYS AS
(CASE
WHEN first_name IS NULL AND
last_name IS NULL THEN NULL
ELSE
(TRIM(COALESCE(first_name, '') || ' ' || COALESCE(last_name, ''))) END) STORED,
deleted_at TIMESTAMP DEFAULT NULL NULL,
-- Some ~20 columns
);
CREATE TABLE IF NOT EXISTS user_aliases
(
id NUMERIC(20, 0) NOT NULL DEFAULT NEXTVAL('user_aliases_sequence') PRIMARY KEY,
entry_id NUMERIC(20, 0) NOT NULL,
first_name VARCHAR(512) DEFAULT NULL NULL,
last_name VARCHAR(512) DEFAULT NULL NULL,
full_name VARCHAR(1024) GENERATED ALWAYS AS
(CASE
WHEN first_name IS NULL AND
last_name IS NULL THEN NULL
ELSE
(TRIM(COALESCE(first_name, '') || ' ' || COALESCE(last_name, ''))) END) STORED,
deleted_at TIMESTAMP DEFAULT NULL NULL,
CONSTRAINT fk_user_aliases_entry_id FOREIGN KEY (entry_id) REFERENCES users (id) ON UPDATE CASCADE ON DELETE CASCADE
);
CREATE INDEX users_full_name_idx ON users USING GIST (full_name gist_trgm_ops) WHERE deleted_at IS NULL AND full_name IS NOT NULL;
CREATE INDEX user_aliases_full_name_idx ON user_aliases USING GIST (full_name gist_trgm_ops) WHERE deleted_at IS NULL AND full_name IS NOT NULL;
and my query:
SELECT id
FROM ((SELECT e.id,
e.full_name
FROM users e
WHERE e.full_name % 'some name here'
AND e.deleted_at IS NULL
LIMIT 10)
UNION
(SELECT a.entry_id AS id,
a.full_name
FROM user_aliases a
WHERE a.full_name % 'some name here'
AND a.deleted_at IS NULL
LIMIT 10)) filter_table
LIMIT 10
and this takes around ~300ms on average on a table with ~500K records, with the following execution plan:
Limit (cost=88.71..88.91 rows=10 width=536) (actual time=322.001..322.009 rows=2 loops=1)
" Output: e.id, e.full_name"
Buffers: shared hit=44390
-> HashAggregate (cost=88.71..88.91 rows=20 width=536) (actual time=321.998..322.004 rows=2 loops=1)
" Output: e.id, e.full_name"
" Group Key: e.id, e.full_name"
Batches: 1 Memory Usage: 24kB
Buffers: shared hit=44390
-> Append (cost=0.41..88.61 rows=20 width=536) (actual time=158.361..321.966 rows=2 loops=1)
Buffers: shared hit=44390
-> Limit (cost=0.41..43.92 rows=10 width=27) (actual time=158.358..206.304 rows=2 loops=1)
" Output: e.id, e.full_name"
Buffers: shared hit=30910
-> Index Scan using users_full_name_idx on master.users e (cost=0.41..209.25 rows=48 width=27) (actual time=158.355..206.290 rows=2 loops=1)
" Output: e.id, e.full_name"
Index Cond: ((e.full_name)::text % 'some name here'::text)
Buffers: shared hit=30910
-> Limit (cost=0.41..44.39 rows=10 width=37) (actual time=115.641..115.643 rows=0 loops=1)
" Output: a.entry_id, a.full_name"
Buffers: shared hit=13480
-> Index Scan using user_aliases_full_name_idx on master.user_aliases a (cost=0.41..92.77 rows=21 width=37) (actual time=115.633..115.633 rows=0 loops=1)
" Output: a.entry_id, a.full_name"
Index Cond: ((a.full_name)::text % 'some name here'::text)
Buffers: shared hit=13480
"Settings: effective_cache_size = '16GB', search_path = 'master', work_mem = '32MB'"
Planning Time: 1.210 ms
Execution Time: 322.081 ms
Now it is not the ideal query, since it LIMIT
s arbitrary records. When I update the query to:
SELECT id
FROM ((SELECT e.id,
e.full_name
FROM users e
WHERE e.full_name % 'some name here'
AND e.deleted_at IS NULL
ORDER BY e.full_name 'some name here'
LIMIT 10)
UNION
(SELECT a.entry_id AS id,
a.full_name
FROM user_aliases a
WHERE a.full_name % 'some name here'
AND a.deleted_at IS NULL
ORDER BY a.full_name 'some name here'
LIMIT 10)) filter_table
LIMIT 10
it now performs as it should, but the average execution time rises to ~700ms with the following plan:
Limit (cost=88.81..89.01 rows=10 width=536) (actual time=324.831..324.839 rows=2 loops=1)
" Output: ""*SELECT* 1"".id, ""*SELECT* 1"".full_name"
Buffers: shared hit=44390
-> HashAggregate (cost=88.81..89.01 rows=20 width=536) (actual time=324.827..324.834 rows=2 loops=1)
" Output: ""*SELECT* 1"".id, ""*SELECT* 1"".full_name"
" Group Key: ""*SELECT* 1"".id, ""*SELECT* 1"".full_name"
Batches: 1 Memory Usage: 24kB
Buffers: shared hit=44390
-> Append (cost=0.41..88.71 rows=20 width=536) (actual time=232.209..324.809 rows=2 loops=1)
Buffers: shared hit=44390
" -> Subquery Scan on ""*SELECT* 1"" (cost=0.41..44.07 rows=10 width=27) (actual time=232.207..232.219 rows=2 loops=1)"
" Output: ""*SELECT* 1"".id, ""*SELECT* 1"".full_name"
Buffers: shared hit=30910
-> Limit (cost=0.41..43.97 rows=10 width=31) (actual time=232.204..232.214 rows=2 loops=1)
" Output: e.id, e.full_name, (((e.full_name)::text 'some name here'::text))"
Buffers: shared hit=30910
-> Index Scan using users_full_name_idx on master.users e (cost=0.41..209.49 rows=48 width=31) (actual time=232.201..232.209 rows=2 loops=1)
" Output: e.id, e.full_name, ((e.full_name)::text 'some name here'::text)"
Index Cond: ((e.full_name)::text % 'some name here'::text)
Order By: ((e.full_name)::text 'some name here'::text)
Buffers: shared hit=30910
" -> Subquery Scan on ""*SELECT* 2"" (cost=0.41..44.54 rows=10 width=37) (actual time=92.579..92.581 rows=0 loops=1)"
" Output: ""*SELECT* 2"".id, ""*SELECT* 2"".full_name"
Buffers: shared hit=13480
-> Limit (cost=0.41..44.44 rows=10 width=41) (actual time=92.577..92.577 rows=0 loops=1)
" Output: a.entry_id, a.full_name, (((a.full_name)::text 'some name here'::text))"
Buffers: shared hit=13480
-> Index Scan using user_aliases_full_name_idx on master.user_aliases a (cost=0.41..92.88 rows=21 width=41) (actual time=92.572..92.573 rows=0 loops=1)
" Output: a.entry_id, a.full_name, ((a.full_name)::text 'some name here'::text)"
Index Cond: ((a.full_name)::text % 'some name here'::text)
Order By: ((a.full_name)::text 'some name here'::text)
Buffers: shared hit=13480
"Settings: effective_cache_size = '16GB', search_path = 'master', work_mem = '32MB'"
Planning Time: 1.115 ms
Execution Time: 324.906 ms
How do I get the most similar records, without generating the extra execution time overhead?
Asked by Hasan Can Saral
(175 rep)
Jun 11, 2025, 10:41 AM