keysyms: Fast and complete case mappings (Unicode 15.1)

The current code to handle keysym case mappings is quite complex and slow. It is also incomplete, as it does not cover recent Unicode database. Finally, it does not handle title case correctly. It would be easier if we were to use only a lookup table, but a trivial implementation would lead to a huge array: the cased characters range from `U+0041` to `U+`1F189, i.e. a span of 127 304 elements. Thus we need some tricks to compress the lookup table. We based our work on the post: https://github.com/apankrat/notes/blob/3c551cb028595fd34046c5761fd12d1692576003/fast-case-conversion/README.md The compression algorithm is roughly: 1. Compute the delta between the characters and their mappings. 2. Split the delta array in chunk of a given size. 3. Rearrange the order of the chunks in order to optimize consecutive chunks overlap. 4. Create a data table with the reordered chunks and an index table that maps the original chunk index to its offset in the data table. The compression algorithm is then applied a second time to the previous index table. The complete algorithm optimizes the two chunk sizes in order to get the lowest total data size. The mappings were generated using CPython 3.12.4, PyICU 2.13, PyYaml 6.0.1 and ICU 75.1. Also: - Added explicit list of named keysyms and their case mappings. - Added benchmark for case mappings. - Rework ICU tests. Note: 13b30f4 introduced a fix for sharp S `U+00DF`. With the new implementation, the *conversion* functions `xkb_keysym_to_{lower,upper}` leave it *unchanged*, while the *predicate* functions `xkb_keysym_is_{lower,upper_or_title}` produce the expected results: ```c xkb_keysym_to_upper(XKB_KEY_ssharp) == XKB_KEY_ssharp; xkb_keysym_to_lower(XKB_KEY_ssharp) == XKB_KEY_ssharp; xkb_keysym_to_lower(XKB_KEY_Ssharp) == XKB_KEY_ssharp; xkb_keysym_is_lower (XKB_KEY_ssharp) == true; xkb_keysym_is_upper_or_title(XKB_KEY_Ssharp) == true; ```
xkbcommon · Jul 27, 2024 · e83d08d · e83d08d
1 parent b8b872d
commit e83d08d
Show file tree

Hide file tree

Showing 12 changed files with 2,948 additions and 597 deletions.
diff --git a/bench/keysym-case-mappings.c b/bench/keysym-case-mappings.c
@@ -53,7 +53,7 @@ struct TestedFunction {
 static const struct TestedFunction functions[] = {
     { {.toLower = xkb_keysym_to_lower, .toUpper = xkb_keysym_to_upper},
       "to_lower & to_upper" },
-    { {.isLower = xkb_keysym_is_lower, .isUpper = xkb_keysym_is_upper},
+    { {.isLower = xkb_keysym_is_lower, .isUpper = xkb_keysym_is_upper_or_title},
       "is_lower & is_upper" },
 };
 

diff --git a/changes/api/+unicode-15.breaking.md b/changes/api/+unicode-15.breaking.md
@@ -0,0 +1,20 @@
+Updated keysyms case mappings to cover full **[Unicode 15.1]**. This change
+provides a *consistent behavior* with respect to case mappings, and affects
+the following:
+
+- `xkb_keysym_to_lower()` and `xkb_keysym_to_upper()` give different ouput
+  for keysyms not covered previously and handle *title*-cased keysyms.
+
+  Example of title-cased keysym: `0x10001f2` (`U+01F2` “ǲ”):
+  - `xkb_keysym_to_lower(0x10001f2) == 0x10001f3` (`U+01F3` “ǳ”)
+  - `xkb_keysym_to_upper(0x10001f2) == 0x10001f1` (`U+01F1` “Ǳ”)
+- *Implicit* alphabetic key types are better detected, because they use the
+  latest Unicode case mappings and now handle the *title*-cased keysyms the
+  same way as upper-case ones.
+
+Note: As before, only *simple* case mappings (i.e. one-to-one) are supported.
+For example, the full upper case of `U+01F0` “ǰ” is “J̌” (2 characters: `U+004A`
+and `U+030C`), which would require 2 keysyms, which is not supported by the
+current API.
+
+[Unicode 15.1]: https://www.unicode.org/versions/Unicode15.1.0/
diff --git a/data/keysyms.yaml b/data/keysyms.yaml
@@ -2260,7 +2260,7 @@
 0x08f6:
   name: function
   code point: 0x0192
-  upper: 0x1000191
+  upper: 0x1000191 # U0191
 0x08fb:
   name: leftarrow
   code point: 0x2190
@@ -4874,120 +4874,159 @@
 0x10010d0:
   name: Georgian_an
   code point: 0x10D0
+  upper: 0x1001c90 # U1C90
 0x10010d1:
   name: Georgian_ban
   code point: 0x10D1
+  upper: 0x1001c91 # U1C91
 0x10010d2:
   name: Georgian_gan
   code point: 0x10D2
+  upper: 0x1001c92 # U1C92
 0x10010d3:
   name: Georgian_don
   code point: 0x10D3
+  upper: 0x1001c93 # U1C93
 0x10010d4:
   name: Georgian_en
   code point: 0x10D4
+  upper: 0x1001c94 # U1C94
 0x10010d5:
   name: Georgian_vin
   code point: 0x10D5
+  upper: 0x1001c95 # U1C95
 0x10010d6:
   name: Georgian_zen
   code point: 0x10D6
+  upper: 0x1001c96 # U1C96
 0x10010d7:
   name: Georgian_tan
   code point: 0x10D7
+  upper: 0x1001c97 # U1C97
 0x10010d8:
   name: Georgian_in
   code point: 0x10D8
+  upper: 0x1001c98 # U1C98
 0x10010d9:
   name: Georgian_kan
   code point: 0x10D9
+  upper: 0x1001c99 # U1C99
 0x10010da:
   name: Georgian_las
   code point: 0x10DA
+  upper: 0x1001c9a # U1C9A
 0x10010db:
   name: Georgian_man
   code point: 0x10DB
+  upper: 0x1001c9b # U1C9B
 0x10010dc:
   name: Georgian_nar
   code point: 0x10DC
+  upper: 0x1001c9c # U1C9C
 0x10010dd:
   name: Georgian_on
   code point: 0x10DD
+  upper: 0x1001c9d # U1C9D
 0x10010de:
   name: Georgian_par
   code point: 0x10DE
+  upper: 0x1001c9e # U1C9E
 0x10010df:
   name: Georgian_zhar
   code point: 0x10DF
+  upper: 0x1001c9f # U1C9F
 0x10010e0:
   name: Georgian_rae
   code point: 0x10E0
+  upper: 0x1001ca0 # U1CA0
 0x10010e1:
   name: Georgian_san
   code point: 0x10E1
+  upper: 0x1001ca1 # U1CA1
 0x10010e2:
   name: Georgian_tar
   code point: 0x10E2
+  upper: 0x1001ca2 # U1CA2
 0x10010e3:
   name: Georgian_un
   code point: 0x10E3
+  upper: 0x1001ca3 # U1CA3
 0x10010e4:
   name: Georgian_phar
   code point: 0x10E4
+  upper: 0x1001ca4 # U1CA4
 0x10010e5:
   name: Georgian_khar
   code point: 0x10E5
+  upper: 0x1001ca5 # U1CA5
 0x10010e6:
   name: Georgian_ghan
   code point: 0x10E6
+  upper: 0x1001ca6 # U1CA6
 0x10010e7:
   name: Georgian_qar
   code point: 0x10E7
+  upper: 0x1001ca7 # U1CA7
 0x10010e8:
   name: Georgian_shin
   code point: 0x10E8
+  upper: 0x1001ca8 # U1CA8
 0x10010e9:
   name: Georgian_chin
   code point: 0x10E9
+  upper: 0x1001ca9 # U1CA9
 0x10010ea:
   name: Georgian_can
   code point: 0x10EA
+  upper: 0x1001caa # U1CAA
 0x10010eb:
   name: Georgian_jil
   code point: 0x10EB
+  upper: 0x1001cab # U1CAB
 0x10010ec:
   name: Georgian_cil
   code point: 0x10EC
+  upper: 0x1001cac # U1CAC
 0x10010ed:
   name: Georgian_char
   code point: 0x10ED
+  upper: 0x1001cad # U1CAD
 0x10010ee:
   name: Georgian_xan
   code point: 0x10EE
+  upper: 0x1001cae # U1CAE
 0x10010ef:
   name: Georgian_jhan
   code point: 0x10EF
+  upper: 0x1001caf # U1CAF
 0x10010f0:
   name: Georgian_hae
   code point: 0x10F0
+  upper: 0x1001cb0 # U1CB0
 0x10010f1:
   name: Georgian_he
   code point: 0x10F1
+  upper: 0x1001cb1 # U1CB1
 0x10010f2:
   name: Georgian_hie
   code point: 0x10F2
+  upper: 0x1001cb2 # U1CB2
 0x10010f3:
   name: Georgian_we
   code point: 0x10F3
+  upper: 0x1001cb3 # U1CB3
 0x10010f4:
   name: Georgian_har
   code point: 0x10F4
+  upper: 0x1001cb4 # U1CB4
 0x10010f5:
   name: Georgian_hoe
   code point: 0x10F5
+  upper: 0x1001cb5 # U1CB5
 0x10010f6:
   name: Georgian_fi
   code point: 0x10F6
+  upper: 0x1001cb6 # U1CB6
 0x1001e02:
   name: Babovedot
   code point: 0x1E02

diff --git a/include/xkbcommon/xkbcommon.h b/include/xkbcommon/xkbcommon.h
@@ -551,17 +551,29 @@ xkb_utf32_to_keysym(uint32_t ucs);
  *
  * If there is no such form, the keysym is returned unchanged.
  *
- * The conversion rules may be incomplete; prefer to work with the Unicode
- * representation instead, when possible.
+ * The conversion rules are the *simple* (i.e. one-to-one) Unicode case
+ * mappings and do not depend on the locale. If you need the special
+ * case mappings (i.e. not one-to-one or locale-dependent), prefer to
+ * work with the Unicode representation instead, when possible.
+ *
+ * @since 0.8.0: Initial implementation, based on `libX11`.
+ * @since 1.8.0: Use Unicode 15.1 mappings for complete Unicode coverage.
  */
 xkb_keysym_t
 xkb_keysym_to_upper(xkb_keysym_t ks);
 
 /**
  * Convert a keysym to its lowercase form.
  *
- * The conversion rules may be incomplete; prefer to work with the Unicode
- * representation instead, when possible.
+ * If there is no such form, the keysym is returned unchanged.
+ *
+ * The conversion rules are the *simple* (i.e. one-to-one) Unicode case
+ * mappings and do not depend on the locale. If you need the special
+ * case mappings (i.e. not one-to-one or locale-dependent), prefer to
+ * work with the Unicode representation instead, when possible.
+ *
+ * @since 0.8.0: Initial implementation, based on `libX11`.
+ * @since 1.8.0: Use Unicode 15.1 mappings for complete Unicode coverage.
  */
 xkb_keysym_t
 xkb_keysym_to_lower(xkb_keysym_t ks);

diff --git a/meson.build b/meson.build
@@ -223,6 +223,7 @@ libxkbcommon_sources = [
     'src/darray.h',
     'src/keysym.c',
     'src/keysym.h',
+    'src/keysym-case-mappings.c',
     'src/keysym-utf.c',
     'src/ks_tables.h',
     'src/keymap.c',