How can I speed up GIST index query for most similar n items?

2 votes

0 answers

44 views

I have the following tables and indices:

CREATE TABLE IF NOT EXISTS users
(
    id                  NUMERIC(20, 0)                           NOT NULL DEFAULT NEXTVAL('users_sequence') PRIMARY KEY,
    first_name          VARCHAR(512)   DEFAULT NULL              NULL,
    last_name           VARCHAR(512)   DEFAULT NULL              NULL,
    full_name VARCHAR(1024) GENERATED ALWAYS AS
        (CASE
        WHEN first_name IS NULL AND
             last_name IS NULL THEN NULL
        ELSE
             (TRIM(COALESCE(first_name, '') || ' ' || COALESCE(last_name, ''))) END) STORED,
    deleted_at          TIMESTAMP      DEFAULT NULL              NULL,
    -- Some ~20 columns
);

CREATE TABLE IF NOT EXISTS user_aliases
(
    id         NUMERIC(20, 0)            NOT NULL DEFAULT NEXTVAL('user_aliases_sequence') PRIMARY KEY,
    entry_id   NUMERIC(20, 0)            NOT NULL,
    first_name VARCHAR(512) DEFAULT NULL NULL,
    last_name  VARCHAR(512) DEFAULT NULL NULL,
    full_name  VARCHAR(1024) GENERATED ALWAYS AS
        (CASE
            WHEN first_name IS NULL AND
                 last_name IS NULL THEN NULL
            ELSE
                 (TRIM(COALESCE(first_name, '') || ' ' || COALESCE(last_name, ''))) END) STORED,
    deleted_at TIMESTAMP      DEFAULT NULL NULL,
    CONSTRAINT fk_user_aliases_entry_id FOREIGN KEY (entry_id) REFERENCES users (id) ON UPDATE CASCADE ON DELETE CASCADE
);

CREATE INDEX users_full_name_idx ON users USING GIST (full_name gist_trgm_ops) WHERE deleted_at IS NULL AND full_name IS NOT NULL;

CREATE INDEX user_aliases_full_name_idx ON user_aliases USING GIST (full_name gist_trgm_ops) WHERE deleted_at IS NULL AND full_name IS NOT NULL;

and my query:

SELECT id
FROM ((SELECT e.id,
              e.full_name
       FROM users e
       WHERE e.full_name % 'some name here'
         AND e.deleted_at IS NULL
       LIMIT 10)
      UNION
      (SELECT a.entry_id AS id,
              a.full_name
       FROM user_aliases a
       WHERE a.full_name % 'some name here'
         AND a.deleted_at IS NULL
       LIMIT 10)) filter_table
LIMIT 10

and this takes around ~300ms on average on a table with ~500K records, with the following execution plan:

Limit  (cost=88.71..88.91 rows=10 width=536) (actual time=322.001..322.009 rows=2 loops=1)
"  Output: e.id, e.full_name"
  Buffers: shared hit=44390
  ->  HashAggregate  (cost=88.71..88.91 rows=20 width=536) (actual time=321.998..322.004 rows=2 loops=1)
"        Output: e.id, e.full_name"
"        Group Key: e.id, e.full_name"
        Batches: 1  Memory Usage: 24kB
        Buffers: shared hit=44390
        ->  Append  (cost=0.41..88.61 rows=20 width=536) (actual time=158.361..321.966 rows=2 loops=1)
              Buffers: shared hit=44390
              ->  Limit  (cost=0.41..43.92 rows=10 width=27) (actual time=158.358..206.304 rows=2 loops=1)
"                    Output: e.id, e.full_name"
                    Buffers: shared hit=30910
                    ->  Index Scan using users_full_name_idx on master.users e  (cost=0.41..209.25 rows=48 width=27) (actual time=158.355..206.290 rows=2 loops=1)
"                          Output: e.id, e.full_name"
                          Index Cond: ((e.full_name)::text % 'some name here'::text)
                          Buffers: shared hit=30910
              ->  Limit  (cost=0.41..44.39 rows=10 width=37) (actual time=115.641..115.643 rows=0 loops=1)
"                    Output: a.entry_id, a.full_name"
                    Buffers: shared hit=13480
                    ->  Index Scan using user_aliases_full_name_idx on master.user_aliases a  (cost=0.41..92.77 rows=21 width=37) (actual time=115.633..115.633 rows=0 loops=1)
"                          Output: a.entry_id, a.full_name"
                          Index Cond: ((a.full_name)::text % 'some name here'::text)
                          Buffers: shared hit=13480
"Settings: effective_cache_size = '16GB', search_path = 'master', work_mem = '32MB'"
Planning Time: 1.210 ms
Execution Time: 322.081 ms

Now it is not the ideal query, since it LIMITs arbitrary records. When I update the query to:

SELECT id
FROM ((SELECT e.id,
              e.full_name
       FROM users e
       WHERE e.full_name % 'some name here'
         AND e.deleted_at IS NULL
       ORDER BY e.full_name  'some name here'
       LIMIT 10)
      UNION
      (SELECT a.entry_id AS id,
              a.full_name
       FROM user_aliases a
       WHERE a.full_name % 'some name here'
         AND a.deleted_at IS NULL
       ORDER BY a.full_name  'some name here'
       LIMIT 10)) filter_table
LIMIT 10

it now performs as it should, but the average execution time rises to ~700ms with the following plan:

Limit  (cost=88.81..89.01 rows=10 width=536) (actual time=324.831..324.839 rows=2 loops=1)
"  Output: ""*SELECT* 1"".id, ""*SELECT* 1"".full_name"
  Buffers: shared hit=44390
  ->  HashAggregate  (cost=88.81..89.01 rows=20 width=536) (actual time=324.827..324.834 rows=2 loops=1)
"        Output: ""*SELECT* 1"".id, ""*SELECT* 1"".full_name"
"        Group Key: ""*SELECT* 1"".id, ""*SELECT* 1"".full_name"
        Batches: 1  Memory Usage: 24kB
        Buffers: shared hit=44390
        ->  Append  (cost=0.41..88.71 rows=20 width=536) (actual time=232.209..324.809 rows=2 loops=1)
              Buffers: shared hit=44390
"              ->  Subquery Scan on ""*SELECT* 1""  (cost=0.41..44.07 rows=10 width=27) (actual time=232.207..232.219 rows=2 loops=1)"
"                    Output: ""*SELECT* 1"".id, ""*SELECT* 1"".full_name"
                    Buffers: shared hit=30910
                    ->  Limit  (cost=0.41..43.97 rows=10 width=31) (actual time=232.204..232.214 rows=2 loops=1)
"                          Output: e.id, e.full_name, (((e.full_name)::text  'some name here'::text))"
                          Buffers: shared hit=30910
                          ->  Index Scan using users_full_name_idx on master.users e  (cost=0.41..209.49 rows=48 width=31) (actual time=232.201..232.209 rows=2 loops=1)
"                                Output: e.id, e.full_name, ((e.full_name)::text  'some name here'::text)"
                                Index Cond: ((e.full_name)::text % 'some name here'::text)
                                Order By: ((e.full_name)::text  'some name here'::text)
                                Buffers: shared hit=30910
"              ->  Subquery Scan on ""*SELECT* 2""  (cost=0.41..44.54 rows=10 width=37) (actual time=92.579..92.581 rows=0 loops=1)"
"                    Output: ""*SELECT* 2"".id, ""*SELECT* 2"".full_name"
                    Buffers: shared hit=13480
                    ->  Limit  (cost=0.41..44.44 rows=10 width=41) (actual time=92.577..92.577 rows=0 loops=1)
"                          Output: a.entry_id, a.full_name, (((a.full_name)::text  'some name here'::text))"
                          Buffers: shared hit=13480
                          ->  Index Scan using user_aliases_full_name_idx on master.user_aliases a  (cost=0.41..92.88 rows=21 width=41) (actual time=92.572..92.573 rows=0 loops=1)
"                                Output: a.entry_id, a.full_name, ((a.full_name)::text  'some name here'::text)"
                                Index Cond: ((a.full_name)::text % 'some name here'::text)
                                Order By: ((a.full_name)::text  'some name here'::text)
                                Buffers: shared hit=13480
"Settings: effective_cache_size = '16GB', search_path = 'master', work_mem = '32MB'"
Planning Time: 1.115 ms
Execution Time: 324.906 ms

How do I get the most similar records, without generating the extra execution time overhead?

Asked by Hasan Can Saral (175 rep)

Jun 11, 2025, 10:41 AM

How can I speed up GIST index query for most similar n items?

Related Questions