Matching Japanese regex (simple ranges) in bash doesn't work as intended
2
votes
1
answer
197
views
I am pretty sure my regexes are fine but they don't work with bash. I crafted them myself using https://unicode.org/charts/ . As you will see, they work properly with awk.
Here are the ranges to spare you the need to check them yourself, especially if you don't know Japanese:
- hiragana [ぁ-ゟ]
- ぁあぃいぅうぇえぉおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろゎわゐゑをんゔゕゖ>゙>゚__ゝゞゟ
- katakana [゠-ヿㇰ-ㇿ!-○]
- ゠ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ・ーヽヾヿ
- ㇰㇱㇲㇳㇴㇵㇶㇷㇸㇹㇺㇻㇼㇽㇾㇿ
- !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆。「」、・ヲァィゥェォャュョッーアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワン゙゚ᄀᄁᆪᄂᆬᆭᄃᄄᄅᆰᆱᆲᆳᆴᆵᄚᄆᄇᄈᄡᄉᄊᄋᄌᄍᄎᄏᄐᄑ하ᅢᅣᅤᅥᅦᅧᅨᅩᅪᅫᅬᅭᅮᅯᅰᅱᅲᅳᅴᅵ¢£¬ ̄¦¥₩│←↑→↓■○
I also have a regex to find kanjis
[一-龥]
but this one works as intended in bash.
The >>> wrong!
are comments I added to pinpoint where the problems are.
[[ "する" =~ [ぁ-ゟ] ]] && echo 'is hiragana' || echo 'is not hiragana'
is hiragana
echo 'する' | awk '/[ぁ-ゟ]/ {print "is hiragana"}'
is hiragana
[[ "スル" =~ [ぁ-ゟ] ]] && echo 'is hiragana' || echo 'is not hiragana'
is hiragana >>> wrong!
echo 'スル' | awk '/[ぁ-ゟ]/ {print "is hiragana"}'
[[ "僕" =~ [ぁ-ゟ] ]] && echo 'is hiragana' || echo 'is not hiragana'
is not hiragana
echo '僕' | awk '/[ぁ-ゟ]/ {print "is hiragana"}'
[[ "する" =~ [゠-ヿㇰ-ㇿ!-○] ]] && echo 'is katakana' || echo 'is not katakana'
is katakana >>> wrong!
echo 'する' | awk '/[゠-ヿㇰ-ㇿ!-○]/ {print "is katakana"}'
[[ "スル" =~ [゠-ヿㇰ-ㇿ!-○] ]] && echo 'is katakana' || echo 'is not katakana'
is katakana
echo 'スル' | awk '/[゠-ヿㇰ-ㇿ!-○]/ {print "is katakana"}'
is katakana
[[ "僕" =~ [゠-ヿㇰ-ㇿ!-○] ]] && echo 'is katakana' || echo 'is not katakana'
is not katakana
echo '僕' | awk '/[゠-ヿㇰ-ㇿ!-○]/ {print "is katakana"}'
It's like bash consider hiragana and katakana to be equivalent, like it converts them beforehand or something?
Asked by Some_user
(63 rep)
Apr 15, 2023, 02:25 PM
Last activity: Dec 31, 2023, 12:34 PM
Last activity: Dec 31, 2023, 12:34 PM