wrong result. actual johab - expected latin1 #279

milahu · 2023-04-08T14:32:32Z

chardet returns "Johab with confidence 0.99" when it should return latin1

downstream issue: alexanderwink/subdl#37

repro

cd $(mktemp -d)
sub_id=3431287
wget https://dl.opensubtitles.org/en/download/sub/$sub_id
unzip -B $sub_id
chardetect *.srt 
# Irreversible.2002.DVDRip.XviD.AC3-DK.EN.srt: Johab with confidence 0.99
iconv -f johab -t utf8 *.srt >/dev/null 
# iconv: illegal input sequence at position 13187
dd if=$(ls *.srt) bs=1 skip=$((13187 - 8)) count=16 status=none | hexdump -C
# 00000000  79 21 0d 0a 2d 20 57 65  b4 72 65 20 67 6f 69 6e  |y!..- We.re goin|

# "johab" sounds weird. lets try latin1
iconv -f latin1 -t utf8 *.srt >/dev/null && echo ok
# ok

zip file: opensubtitles-org-3431287.zip

libmagic gives the correct result:

file -i *.srt 
# Irreversible.2002.DVDRip.XviD.AC3-DK.EN.srt: application/x-subrip; charset=iso-8859-1

The text was updated successfully, but these errors were encountered:

milahu · 2023-04-13T18:37:51Z

same problem with this file: Enemy.2013.en.1953477413.srt.zip

file -i Enemy.2013.en.1953477413.srt 
# Enemy.2013.en.1953477413.srt: application/x-subrip; charset=iso-8859-1
iconv -f iso-8859-1 -t utf8 Enemy.2013.en.1953477413.srt >/dev/null && echo ok
# ok

chardetect Enemy.2013.en.1953477413.srt
# Enemy.2013.en.1953477413.srt: CP949 with confidence 0.99
iconv -f CP949 -t utf8 Enemy.2013.en.1953477413.srt >/dev/null && echo ok
# iconv: illegal input sequence at position 2570

dd if=Enemy.2013.en.1953477413.srt bs=1 skip=$((2570 - 8)) count=16 status=none | hexdump -C
# 00000000  2c 32 32 32 0d 0a 2d 20  b6 20 74 65 65 6e 61 67  |,222..- . teenag|

head -c$((2570 + 8)) Enemy.2013.en.1953477413.srt | tail -c16 | iconv -f iso-8859-1 -t utf8
# ,222
# - ¶ teenag

offending character is ¶ = pilcrow

workaround: use libmagic

import magic
s = "hellö".encode("latin1")
encoding = magic.detect_from_content(s).encoding
if encoding not in {"us-ascii", "utf-8", "unknown-8bit", "binary"}:
  # convert to utf8
  s = s.decode(encoding).encode("utf8") # bytes -> str -> bytes

dan-blanchard · 2023-04-14T15:54:59Z

chardet is probabilistic and sometimes wrong, especially for shorter strings or when there is only one non-ascii character

milahu · 2023-04-14T18:42:32Z

chardet is probabilistic and sometimes wrong

yes, that is the problem here

even if chardet is faster than libmagic, a wrong result is 100% useless

milahu · 2023-04-15T08:15:59Z

chardet is probabilistic and sometimes wrong

also, if this is a known limitation of chardet, it should be in the readme

dan-blanchard closed this as completed Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wrong result. actual johab - expected latin1 #279

wrong result. actual johab - expected latin1 #279

milahu commented Apr 8, 2023 •

edited

Loading

milahu commented Apr 13, 2023 •

edited

Loading

dan-blanchard commented Apr 14, 2023

milahu commented Apr 14, 2023

milahu commented Apr 15, 2023

wrong result. actual johab - expected latin1 #279

wrong result. actual johab - expected latin1 #279

Comments

milahu commented Apr 8, 2023 • edited Loading

milahu commented Apr 13, 2023 • edited Loading

dan-blanchard commented Apr 14, 2023

milahu commented Apr 14, 2023

milahu commented Apr 15, 2023

milahu commented Apr 8, 2023 •

edited

Loading

milahu commented Apr 13, 2023 •

edited

Loading