-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wrong result. actual johab - expected latin1 #279
Comments
same problem with this file: Enemy.2013.en.1953477413.srt.zip file -i Enemy.2013.en.1953477413.srt
# Enemy.2013.en.1953477413.srt: application/x-subrip; charset=iso-8859-1
iconv -f iso-8859-1 -t utf8 Enemy.2013.en.1953477413.srt >/dev/null && echo ok
# ok
chardetect Enemy.2013.en.1953477413.srt
# Enemy.2013.en.1953477413.srt: CP949 with confidence 0.99
iconv -f CP949 -t utf8 Enemy.2013.en.1953477413.srt >/dev/null && echo ok
# iconv: illegal input sequence at position 2570
dd if=Enemy.2013.en.1953477413.srt bs=1 skip=$((2570 - 8)) count=16 status=none | hexdump -C
# 00000000 2c 32 32 32 0d 0a 2d 20 b6 20 74 65 65 6e 61 67 |,222..- . teenag|
head -c$((2570 + 8)) Enemy.2013.en.1953477413.srt | tail -c16 | iconv -f iso-8859-1 -t utf8
# ,222
# - ¶ teenag offending character is ¶ = pilcrow workaround: use libmagic import magic
s = "hellö".encode("latin1")
encoding = magic.detect_from_content(s).encoding
if encoding not in {"us-ascii", "utf-8", "unknown-8bit", "binary"}:
# convert to utf8
s = s.decode(encoding).encode("utf8") # bytes -> str -> bytes |
chardet is probabilistic and sometimes wrong, especially for shorter strings or when there is only one non-ascii character |
yes, that is the problem here even if chardet is faster than libmagic, a wrong result is 100% useless |
also, if this is a known limitation of chardet, it should be in the readme |
chardet returns "Johab with confidence 0.99" when it should return latin1
downstream issue: alexanderwink/subdl#37
repro
zip file: opensubtitles-org-3431287.zip
libmagic gives the correct result:
The text was updated successfully, but these errors were encountered: