What is "length" of a string in Bourne shell compatibles' `${#string}`?
3
votes
1
answer
1058
views
Arising from [this](https://unix.stackexchange.com/questions/685602/count-bytes-of-filename/685603?noredirect=1#comment1295723_685603) discussion:
When I have (zsh 5.8, bash 5.1.0)
var="ASCII"
echo "${var} has the length ${#var}, and is $(printf "%s" "$var"| wc -c) bytes long"
the answer is simple: these are 5 characters, occupying five bytes.
Now, var=Müller
yields
Müller has the length 6, and is 7 bytes long
Which suggests the ${#}
operator counts codepoints, not bytes. This is a bit unclear [in POSIX](https://pubs.opengroup.org/onlinepubs/9699919799.2016edition/utilities/V3_chap02.html#tag_18_06_02) , where they say it counts "characters". This would be clearer if char
acters in POSIX C weren't octets, normally.
Anyways: Nice! Kind of good, seeing that LANG==en_US.utf8
.
Now,
var='🧜🏿♀️'
echo "${var} has the length ${#var}, and is $(printf "%s" "$var"| wc -c) bytes long"
🧜🏿♀️ has the length 5, and is 17 bytes long
Soooo, we decompose "Mermaid of dark skin color" into the Unicode codepoint
1. Merperson
2. Dark skin tone
3. Zero-Width Joiner
4. Female
5. Print print the previous character as emoji
Fine, so we're really counting Unicode codepoints!
var="e\xcc\x81"
echo "${var} has the length ${#var}, and is $(printf "%s" "$var"| wc -c) bytes long"
é has the length 9, and is 9 bytes long
(of course, my console font decided that the ´
combines with the following space, not the preceding e
. The latter would be correct. But let's leave my rage about that for somewhen else.)
Um, a slight "wat" is in order here.
> printf "e\xcc\x81"|wc -c
3
> printf "%s" "${var}" |wc -c
9
> echo -n ${var} |wc -c
3
> echo "${var} has the length ${#var}, and is $(printf "%s" "$var"| wc -c) bytes long"
é has the length 9, and is 9 bytes long
> printf "%s" "${var}" |xxd
00000000: 655c 7863 635c 7838 31 e\xcc\x81
Here's where I give up.
echo $var
, echo ${var}
and echo "${var}"
all "correctly" emit three bytes. However, echo ${#var}
tells me it's 9 charachters.
Where is this documented/standardized, what's the rules for all this?
Asked by Marcus Müller
(47127 rep)
Jan 9, 2022, 12:36 PM
Last activity: Jan 30, 2023, 04:42 PM
Last activity: Jan 30, 2023, 04:42 PM