PostgreSQL unaccent and full text search for Arabic/Persian
4
votes
1
answer
538
views
I'm using an application that uses PostgreSQL as its database and it uses the
> Note.2:
>
> I did't add all these lines at the same time, I try them one by one.
> Note.3:
>
> There is no other rule for ZWNJ in the
unaccent
extension to normalize text.
I want to improve its search functionality by modifying the unaccent.rules
file.
I edit /usr/share/postgresql/16/tsearch_data/unaccent.rules
and add some rules for Arabic unicode block (U+0600
to U+06ff
):
۰ 0
۱ 1
۲ 2
۳ 3
۴ 4
۵ 5
۶ 6
۷ 7
۸ 8
۹ 9
َ
and it's working fine.
SELECT unaccent('سَلام ۱۳۲');
unaccent
----------
سلام 123
(1 row)
# Problem 1:
The problem is with Zero Width Non-Joiner (ZWNJ - U+200C
), It should be replaced with space (U+0020
).
سَلامعلیکم
-> سلام علیکم
### What I Tried:
I tried this rows, but ether not working or giving error:
- "" " "
: invalid syntax: more than two strings in unaccent rule
(warning) + it didn't work.
- " "
: invalid syntax: more than two strings in unaccent rule
(warning) + it didn't work.
- \u200C \u0020
: Suggested by ChatGPT, but it didn't work.
- \u200C " "
: Suggested by ChatGPT, but it didn't work.
> Note.1:
>
> In the first two lines above, there is an invisible ZWNJ character, which is shown as in VIM, but it's not visible in this post.
>
> 
unaccent.rules
# Problem 2:
Is there a way to add new rule file instead of editing the default one? I can't edit the application source code and change the queries.
Does adding something like /usr/share/postgresql/16/tsearch_data/arabic.stop
or /usr/share/postgresql/16/tsearch_data/arabic.rules
and restarting the service, make PostgreSQL to understand it?
Is it required to run some query to reload the file?
Is it required to change the way search is requested from the application?
Asked by M.A. Heshmat Khah
(145 rep)
Oct 8, 2024, 04:55 PM
Last activity: Oct 8, 2024, 08:10 PM
Last activity: Oct 8, 2024, 08:10 PM