iconv fails to detect valid utf-8 character as utf-8

5 votes

2 answers

684 views

unicode

                          My input data is as follows (as generated by hexdump):

    000000f0  69 61 6e e2 80 99 73 20  65 79 65 73 20 61 62 72  |ian...s eyes abr|

When I open this html () file in Firefox, it displays these characters as:

    ian’s eyes abr

According to the link https://superuser.com/questions/1237545/characters-in-email-displayed-like-e2-80-99 , "E2 80 99 is the sequence of hex values that encode a right single quotation mark (’) in UTF-8".

This website concurs: https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128 

When I run this iconv command on the file containing these characters:

    iconv -f UTF-8 -t ISO-8859-15 test_chapter.html > blah.html

I get the output:

    iconv: illegal input sequence at position 243

and the content of "blah.html" is truncated exactly where the apostrophe would be.

So, to summarise, the internet says that is a valid sequence of bytes for UTF-8, but iconv disagrees.

Can anyone please help me understand what is going on. Is this a bug in iconv?

As a side note, when I use this html file with kindlegen to generate an AZW file, the character is not displayed correctly. All the internet can tell me is that I need to convert the file to UTF-8, but as far as I can tell, it already is!

Asked by AlastairG (213 rep)

Jan 6, 2025, 03:43 PM
Last activity: Jan 11, 2025, 12:48 AM

iconv fails to detect valid utf-8 character as utf-8

Related Questions