-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
protoparse: ParseFiles doesn't escape non-ascii characters in string values #441
Comments
I think your example file may have been malformed when you pasted it into a GitHub comment. What I see is that � is the Unicode replacement character (U+FFFD), not Since protobuf source is supposed to be UTF-8, I'm surprised it would accept a literal Protoc definitely does not encode such values with an escape code. That is your diff library showing you the raw byte value as an escape sequence. The actual issue is that protoparse is seeing the invalid UTF-8 encoded value and silently converting it to the replacement character (because that is usually how UTF-8 handling is done). Apparently, protoc does not and allows unescaped data that is invalid UTF-8. (TBH, that smells like a bug in protoc's handling of input -- it should probably use the Unicode replacement character or outright reject the input file due to bad encoding.) In any event, I think I know reasonably simple fix to make protoparse match protoc. But I'll actually file a bug against protoc first, to see what they think the correct behavior should be. |
You're right about it being the diff doing the escaping. That makes some inconsistent behavior I saw make sense. Given that, it really does seem like it's protoc doing the wrong thing here and protoreflect shouldn't try to emulate it. |
I've filed a bug with the protobuf project: protocolbuffers/protobuf#9175 Depending on the response to that, I could update
I think all of the runtimes reject string data if it is not UTF-8, so definitely strange that |
Comparing ParseFiles and protoc for the file below shows this difference:
protoc seems to replace characters above 0x7f in option values with their
\x
escape code, but protoc does not.The � above is
0xbc
The text was updated successfully, but these errors were encountered: