Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong result. actual johab - expected latin1 #279

Closed
milahu opened this issue Apr 8, 2023 · 4 comments
Closed

wrong result. actual johab - expected latin1 #279

milahu opened this issue Apr 8, 2023 · 4 comments

Comments

@milahu
Copy link

milahu commented Apr 8, 2023

chardet returns "Johab with confidence 0.99" when it should return latin1

downstream issue: alexanderwink/subdl#37

repro

cd $(mktemp -d)
sub_id=3431287
wget https://dl.opensubtitles.org/en/download/sub/$sub_id
unzip -B $sub_id
chardetect *.srt 
# Irreversible.2002.DVDRip.XviD.AC3-DK.EN.srt: Johab with confidence 0.99
iconv -f johab -t utf8 *.srt >/dev/null 
# iconv: illegal input sequence at position 13187
dd if=$(ls *.srt) bs=1 skip=$((13187 - 8)) count=16 status=none | hexdump -C
# 00000000  79 21 0d 0a 2d 20 57 65  b4 72 65 20 67 6f 69 6e  |y!..- We.re goin|

# "johab" sounds weird. lets try latin1
iconv -f latin1 -t utf8 *.srt >/dev/null && echo ok
# ok

zip file: opensubtitles-org-3431287.zip

libmagic gives the correct result:

file -i *.srt 
# Irreversible.2002.DVDRip.XviD.AC3-DK.EN.srt: application/x-subrip; charset=iso-8859-1
@milahu
Copy link
Author

milahu commented Apr 13, 2023

same problem with this file: Enemy.2013.en.1953477413.srt.zip

file -i Enemy.2013.en.1953477413.srt 
# Enemy.2013.en.1953477413.srt: application/x-subrip; charset=iso-8859-1
iconv -f iso-8859-1 -t utf8 Enemy.2013.en.1953477413.srt >/dev/null && echo ok
# ok

chardetect Enemy.2013.en.1953477413.srt
# Enemy.2013.en.1953477413.srt: CP949 with confidence 0.99
iconv -f CP949 -t utf8 Enemy.2013.en.1953477413.srt >/dev/null && echo ok
# iconv: illegal input sequence at position 2570

dd if=Enemy.2013.en.1953477413.srt bs=1 skip=$((2570 - 8)) count=16 status=none | hexdump -C
# 00000000  2c 32 32 32 0d 0a 2d 20  b6 20 74 65 65 6e 61 67  |,222..- . teenag|

head -c$((2570 + 8)) Enemy.2013.en.1953477413.srt | tail -c16 | iconv -f iso-8859-1 -t utf8
# ,222
# - ¶ teenag

offending character is ¶ = pilcrow

workaround: use libmagic

import magic
s = "hellö".encode("latin1")
encoding = magic.detect_from_content(s).encoding
if encoding not in {"us-ascii", "utf-8", "unknown-8bit", "binary"}:
  # convert to utf8
  s = s.decode(encoding).encode("utf8") # bytes -> str -> bytes

@dan-blanchard
Copy link
Member

chardet is probabilistic and sometimes wrong, especially for shorter strings or when there is only one non-ascii character

@milahu
Copy link
Author

milahu commented Apr 14, 2023

chardet is probabilistic and sometimes wrong

yes, that is the problem here

even if chardet is faster than libmagic, a wrong result is 100% useless

@milahu
Copy link
Author

milahu commented Apr 15, 2023

chardet is probabilistic and sometimes wrong

also, if this is a known limitation of chardet, it should be in the readme

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants