Sample Header Ad - 728x90

PostgreSQL unaccent and full text search for Arabic/Persian

4 votes
1 answer
538 views
I'm using an application that uses PostgreSQL as its database and it uses the unaccent extension to normalize text. I want to improve its search functionality by modifying the unaccent.rules file. I edit /usr/share/postgresql/16/tsearch_data/unaccent.rules and add some rules for Arabic unicode block (U+0600 to U+06ff ):
۰	0
۱	1
۲	2
۳	3
۴	4
۵	5
۶	6
۷	7
۸	8
۹	9
َ
and it's working fine.
SELECT unaccent('سَلام ۱۳۲');
 unaccent
----------
 سلام 123
(1 row)
# Problem 1: The problem is with Zero Width Non-Joiner (ZWNJ - U+200C ), It should be replaced with space (U+0020). سَلام‌علیکم -> سلام علیکم ### What I Tried: I tried this rows, but ether not working or giving error: - "‌" " ": invalid syntax: more than two strings in unaccent rule (warning) + it didn't work. - ‌ " ": invalid syntax: more than two strings in unaccent rule (warning) + it didn't work. - \u200C \u0020: Suggested by ChatGPT, but it didn't work. - \u200C " ": Suggested by ChatGPT, but it didn't work. > Note.1: > > In the first two lines above, there is an invisible ZWNJ character, which is shown as in VIM, but it's not visible in this post. > > vim showing zwnj as 200c > Note.2: > > I did't add all these lines at the same time, I try them one by one. > Note.3: > > There is no other rule for ZWNJ in the unaccent.rules # Problem 2: Is there a way to add new rule file instead of editing the default one? I can't edit the application source code and change the queries. Does adding something like /usr/share/postgresql/16/tsearch_data/arabic.stop or /usr/share/postgresql/16/tsearch_data/arabic.rules and restarting the service, make PostgreSQL to understand it? Is it required to run some query to reload the file? Is it required to change the way search is requested from the application?
Asked by M.A. Heshmat Khah (145 rep)
Oct 8, 2024, 04:55 PM
Last activity: Oct 8, 2024, 08:10 PM