ERROR: invalid byte sequence for encoding "UTF8": 0xdc 0x36

1 vote

3 answers

176 views

postgresql python aws-aurora encoding copy

When running a \copy (either pgadmin or aws_s3.table_import_from_s3) of a 1.6GB file into an AWS Aurora Postgres-compatible database, I'm getting the following error:

ERROR:  invalid byte sequence for encoding "UTF8": 0xdc 0x36
CONTEXT:  COPY staging, line 99779: "L24000403170365 ACTIVEZONE LLC                                                                      ..."

EDIT: Here's what I could pull for table definition (but let me know if you want more): | column_name | data_type | character_maximum_length | is_nullable | column_default | | ----------- | --------- | ------------------------ | ----------- | -------------- | | raw | text | [null] | YES | [null] | EDIT: I also tried to change the column to bytea with no effect. The source is supposed to be ASCII, but I get the same error with explicit encodings like utf8, latin1, win1251, and win1252. EDIT: As requested in a reply, here's more information about the import commands. In pgadmin4, I'm right-click importing into the table which shows the following under the covers:

--command " "\\copy public.staging (\"raw\") FROM 'C:/data.txt' DELIMITER '|' ENCODING 'UTF8';""

I also use pgadmin4 to trigger the s3 table import by calling the query:

SELECT aws_s3.table_import_from_s3(
   'staging',
   '', 
   '(DELIMITER ''|'')',
   aws_commons.create_s3_uri('data', 'data.txt', 'us-east-1')
);

Under the covers, table_import_from_s3 calls the command:

copy staging from '/rdsdbdata/extensions/aws_s3/{{internal filename}}' with (DELIMITER '|')

The answer to similar questions is to clean up the source data so I pulled up python and tried to find the offending character. I couldn't find any evidence of an unusual character at or around the referenced line. For the sake of argument, I believe the following will scan the entire file (and you can see the results inline):

>>> def charinfile(filename, bytechar):
...     with open(filename, 'rb') as file:
...         byteline = file.readline()
...         while byteline:  # readline returns empty string at EOF
...             if byteline.find(bytechar) != -1:
...                 print("found!")
...                 return byteline
...             byteline = file.readline()
...         else:
...             print("not found")
...
>>> charinfile(filename, b'\xdc')
not found
>>> charinfile(filename, b'\xdc36')
not found
>>> charinfile(filename, b'6') # make sure the code is working
found!

I've also tried versions where I use strings instead of bytes with the same results. I can confirm that there are no blank lines before EOF (have used line counters to verify that I reach ~1m rows). What am I missing?

Asked by claytond (123 rep)

Mar 12, 2025, 06:42 PM
Last activity: Mar 24, 2025, 05:13 PM

ERROR: invalid byte sequence for encoding "UTF8": 0xdc 0x36

Related Questions