xkbcommon chokes on UTF-8 BOM #393

mcclure · 2023-10-29T01:59:50Z

I am on Ubuntu 23.10. I am developing with Rust and Cargo. I am not using xkbcommon directly, so I am not sure how to tell the exact version of libxkbcommon I am using, but if you tell me I can look it up.

I compile and run (with cargo run) any Rust program using winit. For example https://github.com/mcclure/webgpu-tutorial-rs or example "Editor" from https://github.com/iced-rs/iced.

For any such app, at boot it prints the line:

xkbcommon: ERROR: /home/mcc/.XCompose:1:1: unrecognized token

Additionally, I find XCompose sequences do not work.

The error is perplexing to me because my .XCompose begins with <Multi_key> <minus> <minus>: "—". I try changing the first line to include "%L" on a recommendation; this does nothing for this particular error..

Then it hits me. I run:

mcc@Anthy:~/work/r/other/iced/examples/editor$ file ~/.XCompose 
/home/mcc/.XCompose: C source, Unicode text, UTF-8 (with BOM) text

Oh no.
I use Sublime Text to remove the BOM from ~/.XCompose.
Immediately, the problem goes away and XCompose sequences work.

The BOM, as you may already know, is an element of the Unicode Standard for identifying which Unicode representation a particular file is using (UTF-8, UTF-16 LE, or UTF-16 BE). This BOM is conventional in UTF-16, but relatively rare in UTF-8. However it is legal in UTF-8 (Wikipedia, with cites: "The Unicode Standard permits the BOM in UTF-8[4] but does not require or recommend its use.[5]"), and some editors will add BOMs to UTF-8 files as a matter of course.

Because a .XCompose file will often contain UTF-8, a UTF-8 BOM is a reasonable thing to include. In my testing, a UTF-8 BOM is supported and correctly parsed (which is to say: it is ignored) in .XCompose by many other applications, including Firefox and Chrome.

My "Expected Behavior" is that xkbcommon, on encountering a UTF-8 BOM file, should silently ignore the BOM and parse the file as UTF-8.

For extra credit, if xkbcommon encounters a UTF-16 BOM, it should print a useful error message; currently it prints a bunch of "unrecognized token"s, but a better error message would be "ERROR: File is UTF-16".

The text was updated successfully, but these errors were encountered:

wismill · 2023-10-29T07:02:07Z

Thnak you for your bug report.

Some context first: the encoding of the files processed by xkbcommon is unspecified: it processes streams of bytes. It expects ASCII encoding for bytes < 0x80. So UTF-8 encoding works, but UTF-16 and UTF-32 will fail. So I do not think we should have a specific UTF-16 detection using the BOM, as it may be missing.

@whot @bluetech maybe we should add a specific message for error on the first token of a file, suggesting to check the encoding?

So that leaves UTF-8 with heading BOM. Yes it’s legal, but I thought it was a deprecated practice. Anyway, the fix is quite simple: see #394.

mcclure · 2023-10-29T16:06:03Z

Thank you!

For the record, as far as "deprecated" goes, I ran down the Wikipedia cite, I find the UTF-8 BOM documented at the top of page 941 of the current standard, they say the sequence "can" serve as a UTF-8 signature. There is no mention of deprecation.

"@whot @bluetech maybe we should add a specific message for error on the first token of a file, suggesting to check the encoding?"

Some feedback as a user: I think the single piece of information that would have been most useful to me in the message I got was an indication that this first character was not a regular printable character. When I got a "first character bad" I opened up the file in vi and saw nothing wrong, so I assumed there was some kind of misunderstanding about what format to use. It didn't occur to me to check there was a binary invisible messing me up, but if the error message had gently encouraged me in that direction, I would have figured it out faster.

wismill · 2023-10-29T16:36:15Z

@mcclure thanks for the feedback. I agree logging that the token is less than ideal without printing it. But then printing it correctly depends of the encoding of your terminal. And detecting character property is going too far.

I would have also spent some time before figuring out this bug. I think the most useful error message is suggesting encoding error with explanation of what is expected.

No deprecation for BOM in UTF-8 then. But I still think using BOM in UTF-8 files is very Window-ish, that's why we did not encounter this bug before. Anyway, let's be more portable and standard! Thanks for reporting.

mcclure · 2023-10-29T16:39:44Z

But I still think using BOM in utf8 files is very Window-ish, that's why we did not encounter this bug before.

Yes, not that it matters, but the way I made this mistake in the first place was I copied my custom XCompose directly from Windows, where I use an XCompose emulator extension called WinCompose.

wismill · 2023-10-29T17:13:46Z

I think this is a very valid use case!

wismill mentioned this issue Oct 29, 2023

Skip heading UTF-8 encoded BOM #394

Merged

wismill added bug Indicates an unexpected problem or unintended behavior compose Indicates a need for improvements or additions to Compose handling labels Oct 29, 2023

wismill closed this as completed in #394 Oct 30, 2023

wismill mentioned this issue Oct 30, 2023

Detect invalid file encoding early #397

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xkbcommon chokes on UTF-8 BOM #393

xkbcommon chokes on UTF-8 BOM #393

mcclure commented Oct 29, 2023

wismill commented Oct 29, 2023

mcclure commented Oct 29, 2023

wismill commented Oct 29, 2023 •

edited

Loading

mcclure commented Oct 29, 2023

wismill commented Oct 29, 2023

xkbcommon chokes on UTF-8 BOM #393

xkbcommon chokes on UTF-8 BOM #393

Comments

mcclure commented Oct 29, 2023

wismill commented Oct 29, 2023

mcclure commented Oct 29, 2023

wismill commented Oct 29, 2023 • edited Loading

mcclure commented Oct 29, 2023

wismill commented Oct 29, 2023

wismill commented Oct 29, 2023 •

edited

Loading