Convert between Unicode Normalization Forms on the unix command-line
42
votes
8
answers
11627
views
In Unicode, some character combinations have more than one representation.
For example, the character *ä* can be represented as
* "ä", that is the codepoint U+00E4 (two bytes
c3 a4
in UTF-8 encoding), or as
* "ä", that is the two codepoints U+0061 U+0308 (three bytes 61 cc 88
in UTF-8).
According to the Unicode standard, the two representations are equivalent but in different "normalization forms", see UAX #15: Unicode Normalization Forms .
The unix toolbox has all kinds of text transformation tools, *sed*, *tr*, *iconv*, Perl come to mind. How can I do quick and easy NF conversion on the command-line?
Asked by glts
(602 rep)
Sep 10, 2013, 06:47 PM
Last activity: Mar 22, 2025, 02:58 PM
Last activity: Mar 22, 2025, 02:58 PM