UnicodeDecodeError when Prepare Data and Vocab #4

youngornever · 2020-08-14T17:47:30Z

There is UnicodeDecodeError when I run the segment.py;
Actually, I find this error is caused by the data and the code is ok. For example, see the line 87.
And there are valid lines:5495620, error lines:12109, total lines:5507729.

Please check the dataset.

File "preprocess_zh/segment.py", line 119, in
lines = [ x.decode('utf8') for x in open(where).readlines() ]
File "/home/user/xxxx/anaconda3/envs/tf2sks/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 3696: invalid continuation byte

lines = []
err_line_ids = []
with open(where, "rb") as fp:
    for ii, x in enumerate(fp, 1):
        try:
            lines.append(x.decode('utf8'))
        except:
            err_line_ids.append(ii)
            # pdb.set_trace()    
print("valid lines:{}, error lines:{}, total lines:{}".format(len(lines), len(err_line_ids), len(lines)+len(err_line_ids)))

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError when Prepare Data and Vocab #4

UnicodeDecodeError when Prepare Data and Vocab #4

youngornever commented Aug 14, 2020

UnicodeDecodeError when Prepare Data and Vocab #4

UnicodeDecodeError when Prepare Data and Vocab #4

Comments

youngornever commented Aug 14, 2020

There is UnicodeDecodeError when I run the segment.py; Actually, I find this error is caused by the data and the code is ok. For example, see the line 87. And there are valid lines:5495620, error lines:12109, total lines:5507729. Please check the dataset.

There is UnicodeDecodeError when I run the segment.py;
Actually, I find this error is caused by the data and the code is ok. For example, see the line 87.
And there are valid lines:5495620, error lines:12109, total lines:5507729.

Please check the dataset.