Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce default_encoding parameter to [set|autodetect] the encoding if the charset is missing from the headers #284

Merged
merged 2 commits into from
Apr 10, 2024

Conversation

deedy5
Copy link
Contributor

@deedy5 deedy5 commented Mar 31, 2024

PR summary:

  1. Added default_encoding parameter
  2. Updated github actions

Using the default encoding

When requests are made to a site without explicit character set information from the server, but the encoding is known, it's advisable to explicitly set the default encoding.

from curl_cffi import requests

resp = requests.get(..., default_encoding="shift-jis")
print(response.encoding)  # Shows charset given in the Content-Type charset, or else "shift-jis".
print(response.text)  # The text decoded with the Content-Type charset, or using "shift-jis".

Using character set auto-detection

When the server doesn't provide character set information reliably, and the encoding is unknown, enabling auto-detection allows for a best-effort guess when converting bytes to text. To activate auto-detection, set the default_encoding parameter to a function that takes input bytes and returns the appropriate character set for decoding.

Let's take a look at autodetection using charset-normalizer

from curl_cffi import requests
from charset_normalizer import detect

def autodetect(content):
    return detect(content).get("encoding")

# Using a session with character-set autodetection enabled.
s = requests.Session(default_encoding=autodetect)
resp = s.get(...)
print(resp.encoding)  # Shows charset given in the Content-Type charset, or else autodetect() result.
print(resp.text)  # The text decoded with the Content-Type charset, or encoding, detected via autodetect.

@T-256
Copy link
Contributor

T-256 commented Mar 31, 2024

fyi httpx uses utf-8 with errors=replace as default fallback. it uses charset_normalizer as optional feature.

@deedy5 deedy5 force-pushed the charset_normalizer branch from fb021fb to bd90ef4 Compare March 31, 2024 23:26
@T-256
Copy link
Contributor

T-256 commented Mar 31, 2024

requests - fallback to auto-detected encoding

Needs decision. (I prefer httpx way, to not force new dependency.)

@deedy5 deedy5 force-pushed the charset_normalizer branch 5 times, most recently from e2072f7 to c8cafcb Compare March 31, 2024 23:58
@deedy5
Copy link
Contributor Author

deedy5 commented Apr 1, 2024

httpx PR with useful information
encode/httpx#2165

@perklet
Copy link
Collaborator

perklet commented Apr 1, 2024

I also prefer to add this as an optional feature.

@deedy5 deedy5 force-pushed the charset_normalizer branch from 3c23431 to 1aa8c23 Compare April 1, 2024 08:06
@deedy5 deedy5 changed the title Charset_normalizer for response encoding detection Add default_encoding parameter to set | autodetect the encoding if the charset is missing from the headers Apr 1, 2024
@deedy5
Copy link
Contributor Author

deedy5 commented Apr 1, 2024

@yifeikong Update the documentation where you see fit, I've added a description in the first post.

@deedy5 deedy5 force-pushed the charset_normalizer branch from b0a6c14 to 4d7aaaa Compare April 1, 2024 15:03
Copy link
Contributor

@T-256 T-256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, Thank you!

@deedy5 deedy5 force-pushed the charset_normalizer branch 3 times, most recently from 1c9e80e to 95407e6 Compare April 1, 2024 19:29
@deedy5 deedy5 force-pushed the charset_normalizer branch 9 times, most recently from 87116bc to bdd2783 Compare April 1, 2024 22:02
@deedy5 deedy5 force-pushed the charset_normalizer branch 3 times, most recently from 7822b17 to e480780 Compare April 1, 2024 22:22
@deedy5
Copy link
Contributor Author

deedy5 commented Apr 1, 2024

The problem in guthub workflow check Build bdist wheels and test (macos-12) occurred when installing pipx.
https://github.com/yifeikong/curl_cffi/blob/418e452c99dee5da176f0b0a768337cd5509c4c5/.github/workflows/build-and-test.yaml#L70
But pipx already installed:
macos-12
macos-14
So I removed that action and all works.

@deedy5 deedy5 force-pushed the charset_normalizer branch from da5da98 to ba3eb25 Compare April 2, 2024 00:04
@deedy5 deedy5 changed the title Add default_encoding parameter to set | autodetect the encoding if the charset is missing from the headers Add default_encoding parameter to [set|autodetect] the encoding if the charset is missing from the headers + update github actions + update requirements Apr 2, 2024
@deedy5 deedy5 changed the title Add default_encoding parameter to [set|autodetect] the encoding if the charset is missing from the headers + update github actions + update requirements Add default_encoding parameter to [set|autodetect] the encoding if the charset is missing from the headers + update github actions + update requirements + update Ruff linter rules Apr 2, 2024
@coletdjnz
Copy link
Contributor

coletdjnz commented Apr 2, 2024

I recommend splitting out unrelated changes into another PR(s), as it helps with review (and for downstream users to audit)

@deedy5 deedy5 force-pushed the charset_normalizer branch from 12c83e5 to dc69172 Compare April 3, 2024 01:53
@deedy5 deedy5 changed the title Add default_encoding parameter to [set|autodetect] the encoding if the charset is missing from the headers + update github actions + update requirements + update Ruff linter rules Introduce default_encoding parameter to [set|autodetect] the encoding if the charset is missing from the headers Apr 3, 2024
@deedy5
Copy link
Contributor Author

deedy5 commented Apr 3, 2024

Split the PR into several, in this one I left only adding the default_encoder parameter and updating the github actions to remove the bug in the tests.

@perklet
Copy link
Collaborator

perklet commented Apr 7, 2024

Generally, I think the encoding/charset properies should be kept, other wise it would be a breaking change, which should be introduced only when the major version bumps.

You can add new properties or change the underlying implementation, though.

Thanks for fixing the CI!

@deedy5 deedy5 force-pushed the charset_normalizer branch from 799e566 to 3f04ced Compare April 7, 2024 14:26
@deedy5
Copy link
Contributor Author

deedy5 commented Apr 7, 2024

@yifeikong 🏁

@perklet perklet merged commit 96d4c52 into lexiforest:main Apr 10, 2024
7 checks passed
@deedy5 deedy5 deleted the charset_normalizer branch April 10, 2024 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants