Python: Fix string encoding dataset check failure #17807

tausbn · 2024-10-18T15:06:46Z

Fixes a dataset check failure for the py_cobjectnames relation seen on python/cpython.

Pull Request checklist

All query authors

A change note is added if necessary. See the documentation in this repository.
All new queries have appropriate .qhelp. See the documentation in this repository.
QL tests are added if necessary. See Testing custom queries in the GitHub documentation.
New and changed queries have correct query metadata. See the documentation in this repository.

Internal query authors only

Autofixes generated based on these changes are valid, only needed if this PR makes significant changes to .ql, .qll, or .qhelp files. See the documentation (internal access required).
Changes are validated at scale (internal access required).
Adding a new query? Consider also adding the query to autofix.

Note that this test checks that the current setup creates dataset check violations. A later commit will fix this (and flip the negation in the test).

Here's an example of one of these errors: ``` INVALID_KEY predicate py_cobjectnames(@py_cobject obj, string name) The key set {obj} does not functionally determine all fields. Here is a pair of tuples that agree on the key set but differ at index 1: Tuple 1 in row 63874: (72088,"u'<X>'") Tuple 2 in row 63875: (72088,"u'<?>'") ``` (Here, the substring `X` should really be the Unicode character U+FFFD, but for some reason I'm not allowed to put that in this commit message.) Inside the extractor, we assign IDs based on the string type (bytestring or Unicode) and a hash of the UTF-8 encoded content of the string. In this case, however, certain _different_ strings were receiving the same hash, due to replacement characters in the encoding process. In particular, we were converting unencodable characters to question marks in one place, and to U+FFFD in another place. This caused a discrepancy that lead to the dataset check error. To fix this, we put in a custom error handler that always puts the U+FFFD character in place of unencodable characters. With this, the strings now agree, and hence there is no clash.

This test should now validate that we no longer have dataset check errors even when there are unencodable characters.

yoff

LGTM

tausbn · 2024-10-23T12:26:41Z

Had to rerun the experiment, as the first time round didn't actually run the dataset check. The second run shows that we're now free from the string encoding error. 💪

github-actions bot added the Python label Oct 18, 2024

Python: Add test for string encoding dataset check

d01593e

Note that this test checks that the current setup creates dataset check violations. A later commit will fix this (and flip the negation in the test).

tausbn force-pushed the tausbn/python-fix-string-encoding-dataset-check-failure branch from f8d3419 to d01593e Compare October 21, 2024 12:08

tausbn added 2 commits October 21, 2024 15:31

Python: Flip test expectation

ae4a4bb

This test should now validate that we no longer have dataset check errors even when there are unencodable characters.

tausbn added the no-change-note-required This PR does not need a change note label Oct 21, 2024

tausbn marked this pull request as ready for review October 21, 2024 15:53

tausbn requested a review from a team as a code owner October 21, 2024 15:53

yoff approved these changes Oct 23, 2024

View reviewed changes

tausbn merged commit e1e3568 into main Oct 23, 2024
10 checks passed

tausbn deleted the tausbn/python-fix-string-encoding-dataset-check-failure branch October 23, 2024 12:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Fix string encoding dataset check failure #17807

Python: Fix string encoding dataset check failure #17807

tausbn commented Oct 18, 2024

yoff left a comment

tausbn commented Oct 23, 2024

Python: Fix string encoding dataset check failure #17807

Python: Fix string encoding dataset check failure #17807

Conversation

tausbn commented Oct 18, 2024

Pull Request checklist

All query authors

Internal query authors only

yoff left a comment

Choose a reason for hiding this comment

tausbn commented Oct 23, 2024