Sample Header Ad - 728x90

Convert between Unicode Normalization Forms on the unix command-line

42 votes
8 answers
11627 views
In Unicode, some character combinations have more than one representation. For example, the character *ä* can be represented as * "ä", that is the codepoint U+00E4 (two bytes c3 a4 in UTF-8 encoding), or as * "ä", that is the two codepoints U+0061 U+0308 (three bytes 61 cc 88 in UTF-8). According to the Unicode standard, the two representations are equivalent but in different "normalization forms", see UAX #15: Unicode Normalization Forms . The unix toolbox has all kinds of text transformation tools, *sed*, *tr*, *iconv*, Perl come to mind. How can I do quick and easy NF conversion on the command-line?
Asked by glts (602 rep)
Sep 10, 2013, 06:47 PM
Last activity: Mar 22, 2025, 02:58 PM