Sample Header Ad - 728x90

What is "length" of a string in Bourne shell compatibles' `${#string}`?

3 votes
1 answer
1058 views
Arising from [this](https://unix.stackexchange.com/questions/685602/count-bytes-of-filename/685603?noredirect=1#comment1295723_685603) discussion: When I have (zsh 5.8, bash 5.1.0)
var="ASCII"
echo "${var} has the length ${#var}, and is $(printf "%s" "$var"| wc -c) bytes long"
the answer is simple: these are 5 characters, occupying five bytes. Now, var=Müller yields
Müller has the length 6, and is 7 bytes long
Which suggests the ${#} operator counts codepoints, not bytes. This is a bit unclear [in POSIX](https://pubs.opengroup.org/onlinepubs/9699919799.2016edition/utilities/V3_chap02.html#tag_18_06_02) , where they say it counts "characters". This would be clearer if characters in POSIX C weren't octets, normally. Anyways: Nice! Kind of good, seeing that LANG==en_US.utf8. Now,
var='🧜🏿‍♀️'
echo "${var} has the length ${#var}, and is $(printf "%s" "$var"| wc -c) bytes long"
🧜🏿‍♀️ has the length 5, and is 17 bytes long
Soooo, we decompose "Mermaid of dark skin color" into the Unicode codepoint 1. Merperson 2. Dark skin tone 3. Zero-Width Joiner 4. Female 5. Print print the previous character as emoji Fine, so we're really counting Unicode codepoints!
var="e\xcc\x81"
echo "${var} has the length ${#var}, and is $(printf "%s" "$var"| wc -c) bytes long"
é has the length 9, and is 9 bytes long
(of course, my console font decided that the ´ combines with the following space, not the preceding e. The latter would be correct. But let's leave my rage about that for somewhen else.) Um, a slight "wat" is in order here.
> printf "e\xcc\x81"|wc -c
3
> printf "%s" "${var}" |wc -c
9
> echo -n ${var} |wc -c
3
> echo "${var} has the length ${#var}, and is $(printf "%s" "$var"| wc -c) bytes long"
é has the length 9, and is 9 bytes long
> printf "%s" "${var}" |xxd
00000000: 655c 7863 635c 7838 31                   e\xcc\x81
Here's where I give up. echo $var, echo ${var} and echo "${var}" all "correctly" emit three bytes. However, echo ${#var} tells me it's 9 charachters. Where is this documented/standardized, what's the rules for all this?
Asked by Marcus Müller (47127 rep)
Jan 9, 2022, 12:36 PM
Last activity: Jan 30, 2023, 04:42 PM