Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide default encoding aliases #416

Open
benoit74 opened this issue Nov 5, 2024 · 0 comments
Open

Provide default encoding aliases #416

benoit74 opened this issue Nov 5, 2024 · 0 comments
Labels
enhancement New feature or request good first issue Good for newcomers
Milestone

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Nov 5, 2024

The scraper would benefit a lot from having default encoding aliases already embedded. In many situations, we can be 99% sure of the proper encoding to use.

Defaults should be for now:

  • u => utf-8
  • unicode => utf-8
  • 65001 => utf-8
  • urf-8 => utf-8
  • utf-f => utf-8
  • utf-08 => utf-8
  • utp-8 => utf-8
  • windows-8859-1 => iso-8859-1
  • ansi => windows-1252
  • uft-8 => utf-8
  • iso-utf-8 => utf-8
  • iso88591 => iso-8859-1

All these bad encoding have been found in the wild, and it is always a bit hard to fail a scrape "just for that", there is no ambiguities in these cases.

Setting the encoding aliases CLI arg would not drop these defaults, but complement/enrich them. The CLI value must however superseed the default (i.e. if unicode is redefined in --encoding-aliases CLI argument, it is the value from CLI - and not the default - which is used).

@benoit74 benoit74 added enhancement New feature or request good first issue Good for newcomers labels Nov 5, 2024
@benoit74 benoit74 added this to the 2.2.0 milestone Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant