Fix Unicode encoding issues in Bazel's use of Starlark #24417

fmeum · 2024-11-20T17:06:42Z

Bazel internally uses String as a container for raw bytes assumed to be UTF-8, which differs from ordinary usage of String as a container for UTF-16 characters. This requires special implementations of certain Starlark functions that care about the notion of a "character":

{l,r,}strip must not strip non-ASCII whitespace as it may be part of a UTF-8-encoded non-whitespace character.
json.decode has to emit UTF-8 bytes rather than UTF-16 characters.

Compatibility is verified by running all script-based tests both parsed as UTF-8 and using Bazel's internal encoding.

tetromino · 2025-01-15T23:04:49Z

To be clear, are all string values that Bazel currently passes to split encoded consistently as raw-UTF-bytes? Or is there a possibility that we have a mixture of encodings?

fmeum · 2025-01-16T08:44:32Z

To be clear, are all string values that Bazel currently passes to split encoded consistently as raw-UTF-bytes? Or is there a possibility that we have a mixture of encodings?

All strings fed into and obtained from Starlark, in fact every string retained by Bazel, should now be raw UTF-8 bytes. Strings with other encodings are only produced or consumed at I/O boundaries such as in FileSystem implementations.

fmeum force-pushed the 23859-unicode-starlark branch 3 times, most recently from c62a5eb to 0435c09 Compare November 25, 2024 15:49

fmeum marked this pull request as ready for review November 25, 2024 15:49

fmeum requested review from brandjon and tetromino as code owners November 25, 2024 15:49

fmeum requested review from tjgq and removed request for tetromino and brandjon November 25, 2024 15:49

github-actions bot added the awaiting-review PR is awaiting review from an assigned reviewer label Nov 25, 2024

fmeum mentioned this pull request Nov 25, 2024

Avoid char array allocation in Starlark format #23763

Open

tjgq requested a review from tetromino November 25, 2024 21:33

iancha1992 added the team-Starlark-Integration Issues involving Bazel's integration with Starlark, excluding builtin symbols label Nov 25, 2024

tetromino self-assigned this Jan 14, 2025

fmeum force-pushed the 23859-unicode-starlark branch from 0435c09 to 3728e16 Compare January 16, 2025 08:40

Fix Unicode encoding issues in Bazel's use of Starlark

5de090d

fmeum force-pushed the 23859-unicode-starlark branch from 3728e16 to 5de090d Compare January 16, 2025 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Unicode encoding issues in Bazel's use of Starlark #24417

Fix Unicode encoding issues in Bazel's use of Starlark #24417

fmeum commented Nov 20, 2024 •

edited

Loading

tetromino commented Jan 15, 2025

fmeum commented Jan 16, 2025

Fix Unicode encoding issues in Bazel's use of Starlark #24417

Are you sure you want to change the base?

Fix Unicode encoding issues in Bazel's use of Starlark #24417

Conversation

fmeum commented Nov 20, 2024 • edited Loading

tetromino commented Jan 15, 2025

fmeum commented Jan 16, 2025

fmeum commented Nov 20, 2024 •

edited

Loading