Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xkbcommon chokes on UTF-8 BOM #393

Closed
mcclure opened this issue Oct 29, 2023 · 5 comments · Fixed by #394
Closed

xkbcommon chokes on UTF-8 BOM #393

mcclure opened this issue Oct 29, 2023 · 5 comments · Fixed by #394
Labels
bug Indicates an unexpected problem or unintended behavior compose Indicates a need for improvements or additions to Compose handling

Comments

@mcclure
Copy link

mcclure commented Oct 29, 2023

I am on Ubuntu 23.10. I am developing with Rust and Cargo. I am not using xkbcommon directly, so I am not sure how to tell the exact version of libxkbcommon I am using, but if you tell me I can look it up.

I compile and run (with cargo run) any Rust program using winit. For example https://github.com/mcclure/webgpu-tutorial-rs or example "Editor" from https://github.com/iced-rs/iced.

For any such app, at boot it prints the line:

xkbcommon: ERROR: /home/mcc/.XCompose:1:1: unrecognized token

Additionally, I find XCompose sequences do not work.

The error is perplexing to me because my .XCompose begins with <Multi_key> <minus> <minus>: "—". I try changing the first line to include "%L" on a recommendation; this does nothing for this particular error..

Then it hits me. I run:

mcc@Anthy:~/work/r/other/iced/examples/editor$ file ~/.XCompose 
/home/mcc/.XCompose: C source, Unicode text, UTF-8 (with BOM) text

Oh no.
I use Sublime Text to remove the BOM from ~/.XCompose.
Immediately, the problem goes away and XCompose sequences work.

The BOM, as you may already know, is an element of the Unicode Standard for identifying which Unicode representation a particular file is using (UTF-8, UTF-16 LE, or UTF-16 BE). This BOM is conventional in UTF-16, but relatively rare in UTF-8. However it is legal in UTF-8 (Wikipedia, with cites: "The Unicode Standard permits the BOM in UTF-8[4] but does not require or recommend its use.[5]"), and some editors will add BOMs to UTF-8 files as a matter of course.

Because a .XCompose file will often contain UTF-8, a UTF-8 BOM is a reasonable thing to include. In my testing, a UTF-8 BOM is supported and correctly parsed (which is to say: it is ignored) in .XCompose by many other applications, including Firefox and Chrome.

My "Expected Behavior" is that xkbcommon, on encountering a UTF-8 BOM file, should silently ignore the BOM and parse the file as UTF-8.

For extra credit, if xkbcommon encounters a UTF-16 BOM, it should print a useful error message; currently it prints a bunch of "unrecognized token"s, but a better error message would be "ERROR: File is UTF-16".

@wismill
Copy link
Member

wismill commented Oct 29, 2023

Thnak you for your bug report.

Some context first: the encoding of the files processed by xkbcommon is unspecified: it processes streams of bytes. It expects ASCII encoding for bytes < 0x80. So UTF-8 encoding works, but UTF-16 and UTF-32 will fail. So I do not think we should have a specific UTF-16 detection using the BOM, as it may be missing.

@whot @bluetech maybe we should add a specific message for error on the first token of a file, suggesting to check the encoding?

So that leaves UTF-8 with heading BOM. Yes it’s legal, but I thought it was a deprecated practice. Anyway, the fix is quite simple: see #394.

@wismill wismill added bug Indicates an unexpected problem or unintended behavior compose Indicates a need for improvements or additions to Compose handling labels Oct 29, 2023
@mcclure
Copy link
Author

mcclure commented Oct 29, 2023

Thank you!

For the record, as far as "deprecated" goes, I ran down the Wikipedia cite, I find the UTF-8 BOM documented at the top of page 941 of the current standard, they say the sequence "can" serve as a UTF-8 signature. There is no mention of deprecation.

"@whot @bluetech maybe we should add a specific message for error on the first token of a file, suggesting to check the encoding?"

Some feedback as a user: I think the single piece of information that would have been most useful to me in the message I got was an indication that this first character was not a regular printable character. When I got a "first character bad" I opened up the file in vi and saw nothing wrong, so I assumed there was some kind of misunderstanding about what format to use. It didn't occur to me to check there was a binary invisible messing me up, but if the error message had gently encouraged me in that direction, I would have figured it out faster.

@wismill
Copy link
Member

wismill commented Oct 29, 2023

@mcclure thanks for the feedback. I agree logging that the token is less than ideal without printing it. But then printing it correctly depends of the encoding of your terminal. And detecting character property is going too far.

I would have also spent some time before figuring out this bug. I think the most useful error message is suggesting encoding error with explanation of what is expected.

No deprecation for BOM in UTF-8 then. But I still think using BOM in UTF-8 files is very Window-ish, that's why we did not encounter this bug before. Anyway, let's be more portable and standard! Thanks for reporting.

@mcclure
Copy link
Author

mcclure commented Oct 29, 2023

But I still think using BOM in utf8 files is very Window-ish, that's why we did not encounter this bug before.

Yes, not that it matters, but the way I made this mistake in the first place was I copied my custom XCompose directly from Windows, where I use an XCompose emulator extension called WinCompose.

@wismill
Copy link
Member

wismill commented Oct 29, 2023

I think this is a very valid use case!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior compose Indicates a need for improvements or additions to Compose handling
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants