You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The scraper would benefit a lot from having default encoding aliases already embedded. In many situations, we can be 99% sure of the proper encoding to use.
Defaults should be for now:
u => utf-8
unicode => utf-8
65001 => utf-8
urf-8 => utf-8
utf-f => utf-8
utf-08 => utf-8
utp-8 => utf-8
windows-8859-1 => iso-8859-1
ansi => windows-1252
uft-8 => utf-8
iso-utf-8 => utf-8
iso88591 => iso-8859-1
All these bad encoding have been found in the wild, and it is always a bit hard to fail a scrape "just for that", there is no ambiguities in these cases.
Setting the encoding aliases CLI arg would not drop these defaults, but complement/enrich them. The CLI value must however superseed the default (i.e. if unicode is redefined in --encoding-aliases CLI argument, it is the value from CLI - and not the default - which is used).
The text was updated successfully, but these errors were encountered:
The scraper would benefit a lot from having default encoding aliases already embedded. In many situations, we can be 99% sure of the proper encoding to use.
Defaults should be for now:
u
=>utf-8
unicode
=>utf-8
65001
=>utf-8
urf-8
=>utf-8
utf-f
=>utf-8
utf-08
=>utf-8
utp-8
=>utf-8
windows-8859-1
=>iso-8859-1
ansi
=>windows-1252
uft-8
=>utf-8
iso-utf-8
=>utf-8
iso88591
=>iso-8859-1
All these bad encoding have been found in the wild, and it is always a bit hard to fail a scrape "just for that", there is no ambiguities in these cases.
Setting the encoding aliases CLI arg would not drop these defaults, but complement/enrich them. The CLI value must however superseed the default (i.e. if
unicode
is redefined in--encoding-aliases
CLI argument, it is the value from CLI - and not the default - which is used).The text was updated successfully, but these errors were encountered: