Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Fix string encoding dataset check failure #17807

Merged
merged 3 commits into from
Oct 23, 2024

Conversation

tausbn
Copy link
Contributor

@tausbn tausbn commented Oct 18, 2024

Fixes a dataset check failure for the py_cobjectnames relation seen on python/cpython.

Pull Request checklist

All query authors

Internal query authors only

  • Autofixes generated based on these changes are valid, only needed if this PR makes significant changes to .ql, .qll, or .qhelp files. See the documentation (internal access required).
  • Changes are validated at scale (internal access required).
  • Adding a new query? Consider also adding the query to autofix.

Note that this test checks that the current setup creates dataset check
violations. A later commit will fix this (and flip the negation in the
test).
@tausbn tausbn force-pushed the tausbn/python-fix-string-encoding-dataset-check-failure branch from f8d3419 to d01593e Compare October 21, 2024 12:08
Here's an example of one of these errors:
```
INVALID_KEY predicate py_cobjectnames(@py_cobject obj, string name)

The key set {obj} does not functionally determine all fields. Here is a
pair of tuples that agree on the key set but differ at index 1: Tuple 1
in row 63874: (72088,"u'<X>'") Tuple 2 in row 63875: (72088,"u'<?>'")
```
(Here, the substring `X` should really be the Unicode character U+FFFD,
but for some reason I'm not allowed to put that in this commit message.)

Inside the extractor, we assign IDs based on the string type (bytestring
or Unicode) and a hash of the UTF-8 encoded content of the string. In
this case, however, certain _different_ strings were receiving the same
hash, due to replacement characters in the encoding process.

In particular, we were converting unencodable characters to question
marks in one place, and to U+FFFD in another place. This caused a
discrepancy that lead to the dataset check error.

To fix this, we put in a custom error handler that always puts the
U+FFFD character in place of unencodable characters. With this, the
strings now agree, and hence there is no clash.
This test should now validate that we no longer have dataset check
errors even when there are unencodable characters.
@tausbn tausbn added the no-change-note-required This PR does not need a change note label Oct 21, 2024
@tausbn tausbn marked this pull request as ready for review October 21, 2024 15:53
@tausbn tausbn requested a review from a team as a code owner October 21, 2024 15:53
Copy link
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tausbn
Copy link
Contributor Author

tausbn commented Oct 23, 2024

Had to rerun the experiment, as the first time round didn't actually run the dataset check. The second run shows that we're now free from the string encoding error. 💪

@tausbn tausbn merged commit e1e3568 into main Oct 23, 2024
10 checks passed
@tausbn tausbn deleted the tausbn/python-fix-string-encoding-dataset-check-failure branch October 23, 2024 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-change-note-required This PR does not need a change note Python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants