ERROR: invalid byte sequence for encoding "UTF8": 0xdc 0x36
1
vote
3
answers
176
views
When running a
\copy
(either pgadmin or aws_s3.table_import_from_s3
) of a 1.6GB file into an AWS Aurora Postgres-compatible database, I'm getting the following error:
ERROR: invalid byte sequence for encoding "UTF8": 0xdc 0x36
CONTEXT: COPY staging, line 99779: "L24000403170365 ACTIVEZONE LLC ..."
EDIT: Here's what I could pull for table definition (but let me know if you want more):
| column_name | data_type | character_maximum_length | is_nullable | column_default |
| ----------- | --------- | ------------------------ | ----------- | -------------- |
| raw | text | [null] | YES | [null] |
EDIT: I also tried to change the column to bytea
with no effect.
The source is supposed to be ASCII, but I get the same error with explicit encodings like utf8
, latin1
, win1251
, and win1252
.
EDIT: As requested in a reply, here's more information about the import commands. In pgadmin4, I'm right-click importing into the table which shows the following under the covers:
--command " "\\copy public.staging (\"raw\") FROM 'C:/data.txt' DELIMITER '|' ENCODING 'UTF8';""
I also use pgadmin4 to trigger the s3 table import by calling the query:
SELECT aws_s3.table_import_from_s3(
'staging',
'',
'(DELIMITER ''|'')',
aws_commons.create_s3_uri('data', 'data.txt', 'us-east-1')
);
Under the covers, table_import_from_s3
calls the command:
copy staging from '/rdsdbdata/extensions/aws_s3/{{internal filename}}' with (DELIMITER '|')
The answer to similar questions is to clean up the source data so I pulled up python and tried to find the offending character. I couldn't find any evidence of an unusual character at or around the referenced line. For the sake of argument, I believe the following will scan the entire file (and you can see the results inline):
>>> def charinfile(filename, bytechar):
... with open(filename, 'rb') as file:
... byteline = file.readline()
... while byteline: # readline returns empty string at EOF
... if byteline.find(bytechar) != -1:
... print("found!")
... return byteline
... byteline = file.readline()
... else:
... print("not found")
...
>>> charinfile(filename, b'\xdc')
not found
>>> charinfile(filename, b'\xdc36')
not found
>>> charinfile(filename, b'6') # make sure the code is working
found!
I've also tried versions where I use strings instead of bytes with the same results. I can confirm that there are no blank lines before EOF (have used line counters to verify that I reach ~1m rows).
What am I missing?
Asked by claytond
(123 rep)
Mar 12, 2025, 06:42 PM
Last activity: Mar 24, 2025, 05:13 PM
Last activity: Mar 24, 2025, 05:13 PM