Postgresql strange behavior with non-ascii characters with triagram index present
1
vote
1
answer
263
views
I see some strange behavior when using either
gin_trgm_ops
or gist_trgm_ops
indexes. There seems to be quite a difference in plans when using say ILIKE
or ~
and search for ascii phrases vs multi-byte char phrases. As if there is higher cost when the operand is a non-ascii operand.
Is what I am seeing expected? What is the reason for it?
I tried it on Postgreql 12 and 13 latest of each.
Here is a scenario:
CREATE DATABASE postgres
WITH
OWNER = postgres
ENCODING = 'UTF8'
LC_COLLATE = 'en_US.utf8'
LC_CTYPE = 'en_US.utf8'
TABLESPACE = pg_default
CONNECTION LIMIT = -1
IS_TEMPLATE = False;
-- snip
CREATE TABLE test_table (
id uuid PRIMARY KEY,
label varchar
);
-- insert 1m rows
VACUUM ANALYZE test_table;
In the data set I have 10 labels containing 'acl'
and 10 containing '定す'
.
When using GIN index
CREATE INDEX test_table_label_gin_idx
ON test_table USING gin
(label gin_trgm_ops);
I see the following.
EXPLAIN ANALYZE SELECT * FROM test_table WHERE label ILIKE '%定す%' LIMIT 100;
Limit (cost=1000.00..16573.18 rows=100 width=52) (actual time=392.153..395.095 rows=10 loops=1)
-> Gather (cost=1000.00..16728.91 rows=101 width=52) (actual time=392.135..394.830 rows=10 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on test_table (cost=0.00..15718.81 rows=42 width=52) (actual time=382.922..388.082 rows=3 loops=3)
Filter: ((label)::text ~~* '%定す%'::text)
Rows Removed by Filter: 338417
Planning Time: 0.656 ms
Execution Time: 395.233 ms
EXPLAIN ANALYZE SELECT * FROM test_table WHERE label ILIKE '%acl%' LIMIT 100;
Limit (cost=28.78..400.51 rows=100 width=52) (actual time=0.072..0.406 rows=10 loops=1)
-> Bitmap Heap Scan on test_table (cost=28.78..404.23 rows=101 width=52) (actual time=0.053..0.197 rows=10 loops=1)
Recheck Cond: ((label)::text ~~* '%acl%'::text)
Heap Blocks: exact=10
-> Bitmap Index Scan on test_table_label_gin_idx (cost=0.00..28.76 rows=101 width=0) (actual time=0.025..0.034 rows=10 loops=1)
Index Cond: ((label)::text ~~* '%acl%'::text)
Planning Time: 0.231 ms
Execution Time: 0.542 ms
With GIST
DROP INDEX test_table_label_gin_idx;
CREATE INDEX test_table_label_gist_idx
ON test_table USING gist
(label gist_trgm_ops);
I see
EXPLAIN ANALYZE SELECT * FROM test_table WHERE label ILIKE '%定す%' LIMIT 100;
Limit (cost=13.19..384.92 rows=100 width=52) (actual time=303.772..1557.498 rows=10 loops=1)
-> Bitmap Heap Scan on test_table (cost=13.19..388.64 rows=101 width=52) (actual time=303.752..1557.286 rows=10 loops=1)
Recheck Cond: ((label)::text ~~* '%定す%'::text)
Rows Removed by Index Recheck: 1015250
Heap Blocks: exact=10431
-> Bitmap Index Scan on test_table_label_gist_idx (cost=0.00..13.17 rows=101 width=0) (actual time=301.046..301.053 rows=1015260 loops=1)
Index Cond: ((label)::text ~~* '%定す%'::text)
Planning Time: 0.215 ms
Execution Time: 1557.643 ms
EXPLAIN ANALYZE SELECT * FROM test_table WHERE label ILIKE '%acl%' LIMIT 100;
Limit (cost=13.19..384.92 rows=100 width=52) (actual time=257.385..257.751 rows=10 loops=1)
-> Bitmap Heap Scan on test_table (cost=13.19..388.64 rows=101 width=52) (actual time=257.366..257.551 rows=10 loops=1)
Recheck Cond: ((label)::text ~~* '%acl%'::text)
Heap Blocks: exact=10
-> Bitmap Index Scan on test_table_label_gist_idx (cost=0.00..13.17 rows=101 width=0) (actual time=257.319..257.328 rows=10 loops=1)
Index Cond: ((label)::text ~~* '%acl%'::text)
Planning Time: 0.377 ms
Execution Time: 257.948 ms
Just changing the the characters of the operand change the plan quite a lot.
**Edit**
SELECT show_trgm('定す');
"{0x145ed8,0x6628fa,0x6cb12d}"
SELECT encode('定す', 'escape')
\345\256\232\343\201\231
This issue seems similar https://dba.stackexchange.com/questions/63193/postgresql-not-using-gin-trigram-index-when-performing-non-ascii-like-query
Asked by Pawel Zieminski
(125 rep)
Oct 6, 2023, 08:38 PM
Last activity: Oct 7, 2023, 03:06 AM
Last activity: Oct 7, 2023, 03:06 AM