Skip to content

Commit

Permalink
keysyms: Fast and complete case mappings (Unicode 15.0)
Browse files Browse the repository at this point in the history
The current code to handle keysym case mappings is quite complex and
slow. It is also incomplete, as it does not cover recent Unicode
database.

It would be easier if we were to use only a lookup table, but a trivial
implementation would lead to a huge array: the cased characters range
from `U+0041` to `U+1E943`, i.e. a span of 125 187 elements.

Thus we need some tricks to compress the lookup table. We base our work
on the post:
https://github.com/apankrat/notes/blob/3c551cb028595fd34046c5761fd12d1692576003/fast-case-conversion/README.md

The compression algorithm is roughly:
1. Compute the delta between the characters and their mappings.
2. Split the delta array in chunk of a given size.
3. Rearrange the order of the chunks in order to optimize consecutive
   chunks overlap.
4. Create a data table with the reordered chunks and an index table that
   maps the original chunk index to its offset in the data table.

The compression algorithm is then applied a second time to the previous
index table.

The complete algorithm optimizes the two chunk sizes in order to get the lowest
total data size.

The mappings were generated using CPython 3.12, PyICU 2.12, PyYaml 6.0.1
and icu 73.2.
  • Loading branch information
wismill committed Jan 11, 2024
1 parent d2fdd68 commit 95bc4a4
Show file tree
Hide file tree
Showing 6 changed files with 2,342 additions and 544 deletions.
41 changes: 40 additions & 1 deletion data/keysyms.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2260,7 +2260,7 @@
0x08f6:
name: function
code point: 0x0192
upper: 0x1000191
upper: 0x1000191 # U0191
0x08fb:
name: leftarrow
code point: 0x2190
Expand Down Expand Up @@ -4866,120 +4866,159 @@
0x10010d0:
name: Georgian_an
code point: 0x10D0
upper: 0x1001c90 # U1C90
0x10010d1:
name: Georgian_ban
code point: 0x10D1
upper: 0x1001c91 # U1C91
0x10010d2:
name: Georgian_gan
code point: 0x10D2
upper: 0x1001c92 # U1C92
0x10010d3:
name: Georgian_don
code point: 0x10D3
upper: 0x1001c93 # U1C93
0x10010d4:
name: Georgian_en
code point: 0x10D4
upper: 0x1001c94 # U1C94
0x10010d5:
name: Georgian_vin
code point: 0x10D5
upper: 0x1001c95 # U1C95
0x10010d6:
name: Georgian_zen
code point: 0x10D6
upper: 0x1001c96 # U1C96
0x10010d7:
name: Georgian_tan
code point: 0x10D7
upper: 0x1001c97 # U1C97
0x10010d8:
name: Georgian_in
code point: 0x10D8
upper: 0x1001c98 # U1C98
0x10010d9:
name: Georgian_kan
code point: 0x10D9
upper: 0x1001c99 # U1C99
0x10010da:
name: Georgian_las
code point: 0x10DA
upper: 0x1001c9a # U1C9A
0x10010db:
name: Georgian_man
code point: 0x10DB
upper: 0x1001c9b # U1C9B
0x10010dc:
name: Georgian_nar
code point: 0x10DC
upper: 0x1001c9c # U1C9C
0x10010dd:
name: Georgian_on
code point: 0x10DD
upper: 0x1001c9d # U1C9D
0x10010de:
name: Georgian_par
code point: 0x10DE
upper: 0x1001c9e # U1C9E
0x10010df:
name: Georgian_zhar
code point: 0x10DF
upper: 0x1001c9f # U1C9F
0x10010e0:
name: Georgian_rae
code point: 0x10E0
upper: 0x1001ca0 # U1CA0
0x10010e1:
name: Georgian_san
code point: 0x10E1
upper: 0x1001ca1 # U1CA1
0x10010e2:
name: Georgian_tar
code point: 0x10E2
upper: 0x1001ca2 # U1CA2
0x10010e3:
name: Georgian_un
code point: 0x10E3
upper: 0x1001ca3 # U1CA3
0x10010e4:
name: Georgian_phar
code point: 0x10E4
upper: 0x1001ca4 # U1CA4
0x10010e5:
name: Georgian_khar
code point: 0x10E5
upper: 0x1001ca5 # U1CA5
0x10010e6:
name: Georgian_ghan
code point: 0x10E6
upper: 0x1001ca6 # U1CA6
0x10010e7:
name: Georgian_qar
code point: 0x10E7
upper: 0x1001ca7 # U1CA7
0x10010e8:
name: Georgian_shin
code point: 0x10E8
upper: 0x1001ca8 # U1CA8
0x10010e9:
name: Georgian_chin
code point: 0x10E9
upper: 0x1001ca9 # U1CA9
0x10010ea:
name: Georgian_can
code point: 0x10EA
upper: 0x1001caa # U1CAA
0x10010eb:
name: Georgian_jil
code point: 0x10EB
upper: 0x1001cab # U1CAB
0x10010ec:
name: Georgian_cil
code point: 0x10EC
upper: 0x1001cac # U1CAC
0x10010ed:
name: Georgian_char
code point: 0x10ED
upper: 0x1001cad # U1CAD
0x10010ee:
name: Georgian_xan
code point: 0x10EE
upper: 0x1001cae # U1CAE
0x10010ef:
name: Georgian_jhan
code point: 0x10EF
upper: 0x1001caf # U1CAF
0x10010f0:
name: Georgian_hae
code point: 0x10F0
upper: 0x1001cb0 # U1CB0
0x10010f1:
name: Georgian_he
code point: 0x10F1
upper: 0x1001cb1 # U1CB1
0x10010f2:
name: Georgian_hie
code point: 0x10F2
upper: 0x1001cb2 # U1CB2
0x10010f3:
name: Georgian_we
code point: 0x10F3
upper: 0x1001cb3 # U1CB3
0x10010f4:
name: Georgian_har
code point: 0x10F4
upper: 0x1001cb4 # U1CB4
0x10010f5:
name: Georgian_hoe
code point: 0x10F5
upper: 0x1001cb5 # U1CB5
0x10010f6:
name: Georgian_fi
code point: 0x10F6
upper: 0x1001cb6 # U1CB6
0x1001e02:
name: Babovedot
code point: 0x1E02
Expand Down
1 change: 1 addition & 0 deletions meson.build
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,7 @@ libxkbcommon_sources = [
'src/darray.h',
'src/keysym.c',
'src/keysym.h',
'src/keysym-case-mappings.c',
'src/keysym-utf.c',
'src/ks_tables.h',
'src/keymap.c',
Expand Down
Loading

0 comments on commit 95bc4a4

Please sign in to comment.