Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Unicode encoding issues in Bazel's use of Starlark #24417

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Nov 20, 2024

Bazel internally uses String as a container for raw bytes assumed to be UTF-8, which differs from ordinary usage of String as a container for UTF-16 characters. This requires special implementations of certain Starlark functions that care about the notion of a "character":

  • {l,r,}strip must not strip non-ASCII whitespace as it may be part of a UTF-8-encoded non-whitespace character.
  • json.decode has to emit UTF-8 bytes rather than UTF-16 characters.

Compatibility is verified by running all script-based tests both parsed as UTF-8 and using Bazel's internal encoding.

@fmeum fmeum force-pushed the 23859-unicode-starlark branch 3 times, most recently from c62a5eb to 0435c09 Compare November 25, 2024 15:49
@fmeum fmeum marked this pull request as ready for review November 25, 2024 15:49
@fmeum fmeum requested review from tjgq and removed request for tetromino and brandjon November 25, 2024 15:49
@github-actions github-actions bot added the awaiting-review PR is awaiting review from an assigned reviewer label Nov 25, 2024
@tjgq tjgq requested a review from tetromino November 25, 2024 21:33
@iancha1992 iancha1992 added the team-Starlark-Integration Issues involving Bazel's integration with Starlark, excluding builtin symbols label Nov 25, 2024
@tetromino tetromino self-assigned this Jan 14, 2025
@tetromino
Copy link
Contributor

To be clear, are all string values that Bazel currently passes to split encoded consistently as raw-UTF-bytes? Or is there a possibility that we have a mixture of encodings?

@fmeum fmeum force-pushed the 23859-unicode-starlark branch from 0435c09 to 3728e16 Compare January 16, 2025 08:40
@fmeum fmeum force-pushed the 23859-unicode-starlark branch from 3728e16 to 5de090d Compare January 16, 2025 08:40
@fmeum
Copy link
Collaborator Author

fmeum commented Jan 16, 2025

To be clear, are all string values that Bazel currently passes to split encoded consistently as raw-UTF-bytes? Or is there a possibility that we have a mixture of encodings?

All strings fed into and obtained from Starlark, in fact every string retained by Bazel, should now be raw UTF-8 bytes. Strings with other encodings are only produced or consumed at I/O boundaries such as in FileSystem implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-review PR is awaiting review from an assigned reviewer team-Starlark-Integration Issues involving Bazel's integration with Starlark, excluding builtin symbols
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants