Sample Header Ad - 728x90

Why is my Mac install of Postgres not tokenising Thai words?

2 votes
1 answer
175 views
On my Mac install of PG:
=# select to_tsvector('english', 'abcd สวัสดี');
 to_tsvector
-------------
 'abcd':1
(1 row)

=# select * from ts_debug('hello สวัสดี');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | hello | {english_stem} | english_stem | {hello}
 blank     | Space symbols   |  สวัสดี | {}             |              |
(2 rows)
On my Linux install of PG:
=# select to_tsvector('english', 'abcd สวัสดี');
    to_tsvector
-------------------
 'abcd':1 'สวัสดี':2
(1 row)

=# select * from ts_debug('hello สวัสดี');
   alias   |    description    | token |  dictionaries  |  dictionary  | lexemes
-----------+-------------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII   | hello | {english_stem} | english_stem | {hello}
 blank     | Space symbols     |       | {}             |              |
 word      | Word, all letters | สวัสดี  | {english_stem} | english_stem | {สวัสดี}
(3 rows)
So something is clearly different about the way the tokenisation is defined in PG. My question is, how do I figure out what is different and how do I make my mac install of PG work like the Linux one? On both installs:
# SHOW default_text_search_config;
 default_text_search_config
----------------------------
 pg_catalog.english
(1 row)

# show lc_ctype;
  lc_ctype
-------------
 en_US.UTF-8
(1 row)
So somehow this mac install thinks that thai letters are spaces... how do I debug this and fix the "Space Symbol" definition here. Interestingly this install works with Armenian, but falls over when we reach Hebrew
=# select * from ts_debug('ԵԵԵ');
 alias |    description    | token |  dictionaries  |  dictionary  | lexemes
-------+-------------------+-------+----------------+--------------+---------
 word  | Word, all letters | ԵԵԵ   | {english_stem} | english_stem | {եեե}
(1 row)

=# select * from ts_debug('אאא');
 alias |  description  | token | dictionaries | dictionary | lexemes
-------+---------------+-------+--------------+------------+---------
 blank | Space symbols | אאא   | {}           |            |
(1 row)
Only significant diff I am seeing is that one is compiled with clang and one with gcc
PostgreSQL 11.2 on x86_64-apple-darwin18.2.0, compiled by Apple LLVM version 10.0.0 (clang-1000.11.45.5), 64-bit

VS 

 PostgreSQL 11.2 (Ubuntu 11.2-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0, 64-bit
Asked by Sam Saffron (1114 rep)
Feb 26, 2019, 09:18 PM
Last activity: Jul 4, 2025, 12:03 AM