-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xkbcommon chokes on UTF-8 BOM #393
Comments
Thnak you for your bug report. Some context first: the encoding of the files processed by xkbcommon is unspecified: it processes streams of bytes. It expects ASCII encoding for bytes < @whot @bluetech maybe we should add a specific message for error on the first token of a file, suggesting to check the encoding? So that leaves UTF-8 with heading BOM. Yes it’s legal, but I thought it was a deprecated practice. Anyway, the fix is quite simple: see #394. |
Thank you! For the record, as far as "deprecated" goes, I ran down the Wikipedia cite, I find the UTF-8 BOM documented at the top of page 941 of the current standard, they say the sequence "can" serve as a UTF-8 signature. There is no mention of deprecation. "@whot @bluetech maybe we should add a specific message for error on the first token of a file, suggesting to check the encoding?" Some feedback as a user: I think the single piece of information that would have been most useful to me in the message I got was an indication that this first character was not a regular printable character. When I got a "first character bad" I opened up the file in vi and saw nothing wrong, so I assumed there was some kind of misunderstanding about what format to use. It didn't occur to me to check there was a binary invisible messing me up, but if the error message had gently encouraged me in that direction, I would have figured it out faster. |
@mcclure thanks for the feedback. I agree logging that the token is less than ideal without printing it. But then printing it correctly depends of the encoding of your terminal. And detecting character property is going too far. I would have also spent some time before figuring out this bug. I think the most useful error message is suggesting encoding error with explanation of what is expected. No deprecation for BOM in UTF-8 then. But I still think using BOM in UTF-8 files is very Window-ish, that's why we did not encounter this bug before. Anyway, let's be more portable and standard! Thanks for reporting. |
Yes, not that it matters, but the way I made this mistake in the first place was I copied my custom XCompose directly from Windows, where I use an XCompose emulator extension called WinCompose. |
I think this is a very valid use case! |
I am on Ubuntu 23.10. I am developing with Rust and Cargo. I am not using xkbcommon directly, so I am not sure how to tell the exact version of libxkbcommon I am using, but if you tell me I can look it up.
I compile and run (with
cargo run
) any Rust program using winit. For example https://github.com/mcclure/webgpu-tutorial-rs or example "Editor" from https://github.com/iced-rs/iced.For any such app, at boot it prints the line:
Additionally, I find XCompose sequences do not work.
The error is perplexing to me because my .XCompose begins with
<Multi_key> <minus> <minus>: "—"
. I try changing the first line toinclude "%L"
on a recommendation; this does nothing for this particular error..Then it hits me. I run:
Oh no.
I use Sublime Text to remove the BOM from ~/.XCompose.
Immediately, the problem goes away and XCompose sequences work.
The BOM, as you may already know, is an element of the Unicode Standard for identifying which Unicode representation a particular file is using (UTF-8, UTF-16 LE, or UTF-16 BE). This BOM is conventional in UTF-16, but relatively rare in UTF-8. However it is legal in UTF-8 (Wikipedia, with cites: "The Unicode Standard permits the BOM in UTF-8[4] but does not require or recommend its use.[5]"), and some editors will add BOMs to UTF-8 files as a matter of course.
Because a .XCompose file will often contain UTF-8, a UTF-8 BOM is a reasonable thing to include. In my testing, a UTF-8 BOM is supported and correctly parsed (which is to say: it is ignored) in .XCompose by many other applications, including Firefox and Chrome.
My "Expected Behavior" is that xkbcommon, on encountering a UTF-8 BOM file, should silently ignore the BOM and parse the file as UTF-8.
For extra credit, if xkbcommon encounters a UTF-16 BOM, it should print a useful error message; currently it prints a bunch of "unrecognized token"s, but a better error message would be "ERROR: File is UTF-16".
The text was updated successfully, but these errors were encountered: