You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is UnicodeDecodeError when I run the segment.py;
Actually, I find this error is caused by the data and the code is ok. For example, see the line 87.
And there are valid lines:5495620, error lines:12109, total lines:5507729.
Please check the dataset.
File "preprocess_zh/segment.py", line 119, in
lines = [ x.decode('utf8') for x in open(where).readlines() ]
File "/home/user/xxxx/anaconda3/envs/tf2sks/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 3696: invalid continuation byte
lines = []
err_line_ids = []
with open(where, "rb") as fp:
for ii, x in enumerate(fp, 1):
try:
lines.append(x.decode('utf8'))
except:
err_line_ids.append(ii)
# pdb.set_trace()
print("valid lines:{}, error lines:{}, total lines:{}".format(len(lines), len(err_line_ids), len(lines)+len(err_line_ids)))
The text was updated successfully, but these errors were encountered:
There is UnicodeDecodeError when I run the segment.py;
Actually, I find this error is caused by the data and the code is ok. For example, see the line 87.
And there are valid lines:5495620, error lines:12109, total lines:5507729.
Please check the dataset.
File "preprocess_zh/segment.py", line 119, in
lines = [ x.decode('utf8') for x in open(where).readlines() ]
File "/home/user/xxxx/anaconda3/envs/tf2sks/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 3696: invalid continuation byte
The text was updated successfully, but these errors were encountered: