UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte #23

anjanesh · 2023-03-23T05:10:07Z

After writing to the CSV from the table, I was trying to open the generated CSV and found that it contains 0xff on my Windows 11 machine.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

So I had to open it as utf-16

with open(tables-imported.csv', 'r', encoding = "utf-16") as f:

The text was updated successfully, but these errors were encountered:

cirosantilli · 2023-10-10T13:14:56Z

Slightly more precise repro:

python mysqldump-to-csv/mysqldump_to_csv.py <enwiki-latest-categorylinks.sql

blows up with:

Traceback (most recent call last):
  File "/home/ciro/down/wiki/mysqldump-to-csv/mysqldump_to_csv.py", line 114, in <module>
    main()
  File "/home/ciro/down/wiki/mysqldump-to-csv/mysqldump_to_csv.py", line 104, in main
    for line in fileinput.input():
  File "/usr/lib/python3.11/fileinput.py", line 251, in __next__
    line = self._readline()
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/fileinput.py", line 372, in _readline
    return self._readline()
           ^^^^^^^^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 1980: invalid continuation byte

The likely reason is that that file contains binary data on the third column, it's a dumpsterfire:

INSERT INTO `categorylinks` VALUES (10,'Redirects_from_moves','*..2NN:,@2.FBHRP:D6^A^W^Aܽ<DC>^L','2014-10-26 04:50:23','','uca-default-u-kn','page'),

enwiki-latest-page.sql still works.

cirosantilli · 2023-10-10T13:25:03Z

Not entirely sure why but the solution at: #17 worked for me. Likely it just treats things more byte-wise, could be buggy on print, but does not blow up at least.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte #23

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte #23

anjanesh commented Mar 23, 2023

cirosantilli commented Oct 10, 2023 •

edited

Loading

cirosantilli commented Oct 10, 2023 •

edited

Loading

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte #23

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte #23

Comments

anjanesh commented Mar 23, 2023

cirosantilli commented Oct 10, 2023 • edited Loading

cirosantilli commented Oct 10, 2023 • edited Loading

cirosantilli commented Oct 10, 2023 •

edited

Loading

cirosantilli commented Oct 10, 2023 •

edited

Loading