Sample Header Ad - 728x90

Postgresql strange behavior with non-ascii characters with triagram index present

1 vote
1 answer
263 views
I see some strange behavior when using either gin_trgm_ops or gist_trgm_ops indexes. There seems to be quite a difference in plans when using say ILIKE or ~ and search for ascii phrases vs multi-byte char phrases. As if there is higher cost when the operand is a non-ascii operand. Is what I am seeing expected? What is the reason for it? I tried it on Postgreql 12 and 13 latest of each. Here is a scenario:
CREATE DATABASE postgres
    WITH
    OWNER = postgres
    ENCODING = 'UTF8'
    LC_COLLATE = 'en_US.utf8'
    LC_CTYPE = 'en_US.utf8'
    TABLESPACE = pg_default
    CONNECTION LIMIT = -1
    IS_TEMPLATE = False;

-- snip

CREATE TABLE test_table (
    id uuid PRIMARY KEY,
    label varchar
);

-- insert 1m rows

VACUUM ANALYZE test_table;
In the data set I have 10 labels containing 'acl' and 10 containing '定す'. When using GIN index
CREATE INDEX test_table_label_gin_idx
    ON test_table USING gin
    (label gin_trgm_ops);
I see the following.
EXPLAIN ANALYZE SELECT * FROM test_table WHERE label ILIKE '%定す%' LIMIT 100;

Limit  (cost=1000.00..16573.18 rows=100 width=52) (actual time=392.153..395.095 rows=10 loops=1)
  ->  Gather  (cost=1000.00..16728.91 rows=101 width=52) (actual time=392.135..394.830 rows=10 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Parallel Seq Scan on test_table  (cost=0.00..15718.81 rows=42 width=52) (actual time=382.922..388.082 rows=3 loops=3)
              Filter: ((label)::text ~~* '%定す%'::text)
              Rows Removed by Filter: 338417
Planning Time: 0.656 ms
Execution Time: 395.233 ms


EXPLAIN ANALYZE SELECT * FROM test_table WHERE label ILIKE '%acl%' LIMIT 100;

Limit  (cost=28.78..400.51 rows=100 width=52) (actual time=0.072..0.406 rows=10 loops=1)
  ->  Bitmap Heap Scan on test_table  (cost=28.78..404.23 rows=101 width=52) (actual time=0.053..0.197 rows=10 loops=1)
        Recheck Cond: ((label)::text ~~* '%acl%'::text)
        Heap Blocks: exact=10
        ->  Bitmap Index Scan on test_table_label_gin_idx  (cost=0.00..28.76 rows=101 width=0) (actual time=0.025..0.034 rows=10 loops=1)
              Index Cond: ((label)::text ~~* '%acl%'::text)
Planning Time: 0.231 ms
Execution Time: 0.542 ms
With GIST
DROP INDEX test_table_label_gin_idx;

CREATE INDEX test_table_label_gist_idx
    ON test_table USING gist
    (label gist_trgm_ops);
I see
EXPLAIN ANALYZE SELECT * FROM test_table WHERE label ILIKE '%定す%' LIMIT 100;

Limit  (cost=13.19..384.92 rows=100 width=52) (actual time=303.772..1557.498 rows=10 loops=1)
  ->  Bitmap Heap Scan on test_table  (cost=13.19..388.64 rows=101 width=52) (actual time=303.752..1557.286 rows=10 loops=1)
        Recheck Cond: ((label)::text ~~* '%定す%'::text)
        Rows Removed by Index Recheck: 1015250
        Heap Blocks: exact=10431
        ->  Bitmap Index Scan on test_table_label_gist_idx  (cost=0.00..13.17 rows=101 width=0) (actual time=301.046..301.053 rows=1015260 loops=1)
              Index Cond: ((label)::text ~~* '%定す%'::text)
Planning Time: 0.215 ms
Execution Time: 1557.643 ms


EXPLAIN ANALYZE SELECT * FROM test_table WHERE label ILIKE '%acl%' LIMIT 100;

Limit  (cost=13.19..384.92 rows=100 width=52) (actual time=257.385..257.751 rows=10 loops=1)
  ->  Bitmap Heap Scan on test_table  (cost=13.19..388.64 rows=101 width=52) (actual time=257.366..257.551 rows=10 loops=1)
        Recheck Cond: ((label)::text ~~* '%acl%'::text)
        Heap Blocks: exact=10
        ->  Bitmap Index Scan on test_table_label_gist_idx  (cost=0.00..13.17 rows=101 width=0) (actual time=257.319..257.328 rows=10 loops=1)
              Index Cond: ((label)::text ~~* '%acl%'::text)
Planning Time: 0.377 ms
Execution Time: 257.948 ms
Just changing the the characters of the operand change the plan quite a lot. **Edit**
SELECT show_trgm('定す');

"{0x145ed8,0x6628fa,0x6cb12d}"

SELECT encode('定す', 'escape')

\345\256\232\343\201\231
This issue seems similar https://dba.stackexchange.com/questions/63193/postgresql-not-using-gin-trigram-index-when-performing-non-ascii-like-query
Asked by Pawel Zieminski (125 rep)
Oct 6, 2023, 08:38 PM
Last activity: Oct 7, 2023, 03:06 AM