Terminal: Help understanding behavior with UTF-8 text

0 votes

0 answers

79 views

I am trying to understand the following behavior I am observing on my Ubuntu system. Consider the following two files:

$ hexdump -C 1.txt
00000000  d9 82 d8 a8 d8 a7 d9 86  d9 8a 5e d9 84 d9 86 d8  |..........^.....|
00000010  b2 d8 a7 d8 b1 5d 31                              |.....]1|
00000017

and

$ hexdump -C 2.txt
00000000  d9 82 d8 a8 d8 a7 d9 86  d9 8a 5e d9 84 d9 86 d8  |..........^.....|
00000010  b2 d8 a7 d8 b1 5d 20                              |.....] |
00000017

We can check there is a single difference:

$ cmp 1.txt 2.txt
1.txt 2.txt differ: byte 23, line 1

However here is what I see on my side:

$ echo $LANG
C.UTF-8
$ cat 1.txt
قباني^لنزار]1
$ cat 2.txt
قباني^لنزار]

I really do not understand that behavior. I do not see neither an ALM (ARABIC LETTER MARK) Unicode Character nor a RLM (RIGHT-TO-LEFT MARK) in the utf-8 stream. For reference: * ALM in UTF-8 is d89c, while * RLM in UTF-8 is e2808f. Could someone explain the behavior I am seeing ? For reference:

$ head -3 /etc/os-release
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"

and

$ echo $TERM
xterm-256color
$ echo $SHELL
/bin/bash
$ bash --version
GNU bash, version 5.1.16(1)-release (x86_64-pc-linux-gnu)

Asked by malat (3429 rep)

Feb 20, 2024, 02:57 PM
Last activity: Feb 20, 2024, 04:40 PM

Terminal: Help understanding behavior with UTF-8 text

Related Questions