Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create Engram for Russian? #49

Open
binarybottle opened this issue Nov 26, 2022 · 17 comments
Open

How to create Engram for Russian? #49

binarybottle opened this issue Nov 26, 2022 · 17 comments

Comments

@binarybottle
Copy link
Owner

I received an email last week:

[Russian has] 33 letters. So I think the only option is to use Home and Page Up for ъ and ь (these keys used for typing in Japanese layout for Kinesis keyboard). I made preliminary version of the layout (for ergonomic orthonormal keyboard like new Kinesis Advantage360):
[ № = − + / ’ ° § % * ]
Ф 1 2 3 4 5 6 7 8 9 0 Э

    БУКЛ  –(  —) ДГЗХЦ
    ЫВАЕ  ,;    .:   НОСТЁ
    ЙЧИЯ  -_  ?!   РМПЖ
    Ш                         ЩЮ
              Ъ@  Ь#
               «„     »“

I made some changes in the typographical symbols scheme according to Russian typographic tradition. Letter arranged by frequency, including bigrams and trigrams. But I am still not sure if it is good or not.

@binarybottle
Copy link
Owner Author

@iandoug -- The person who wrote to me is pointing me to the Leipzig corpus for Russian (as mentioned here: http://practicalcryptography.com/cryptanalysis/letter-frequencies-various-languages/russian-letter-frequencies/). Do you know if it's good enough for relying on to optimize a Russian key layout?

@iandoug
Copy link

iandoug commented Nov 26, 2022

Hi Arno

I usually differentiate between lower and upper (because it affects Shift key usage). The crypto sites don't.

On the other hand, you also work unicase, so if they've run the numbers then it should be okay. :-)

For me to do it would require trying to first clean up all the non-Russian and other crud in their files. I used to try and do it the hard way by fixing each errant character but that is tedious in the extreme. So lately I adopted a more brute force method... have a list of valid characters, any lines containing a character NOT in that set is simply dropped ... the whole line. Which may slightly skew the distribution of the valid characters but is vastly more practical.

The layout above seems to be missing some chars from US ANSI that are needed in programming. Also on the WP image.

Wikipedia shows big-ass enter variant, I would have thought they would use ISO at least for an extra key.
https://en.wikipedia.org/wiki/Keyboard_layout#/media/File:KB_Russian.svg

WP says "Keyboards in Russia always have Cyrillic letters on the keytops as well as Latin letters. Usually Cyrillic and Latin letters are labeled with different colors. "

Wonder if you can access the English chars while using the Russian layout ?....

Let me take a look at how big the Russian corpus is ...

Cheers, Ian

@iandoug
Copy link

iandoug commented Nov 26, 2022

I see WP also has a letter frequency chart (towards bottom)

https://en.wikipedia.org/wiki/Russian_alphabet

Can we get a canonical list of which chars must be on keyboard?

I would probably want to put the brackets, #, | etc back :-)

I've grabbed the Russian files from Leipzig ... they have other-Soviet-non-Russian sources which I am ignoring for now.

Cheers, Ian

@iandoug
Copy link

iandoug commented Nov 26, 2022

Leipzig has some stats. Can't see letter freq

https://cls.corpora.uni-leipzig.de/en/rus_mixed_2013

but here is punctuation:
https://cls.corpora.uni-leipzig.de/en/rus_mixed_2013/2.1.4_Special%20Characters.html

Contains characters that I would typically ignore.

@binarybottle
Copy link
Owner Author

Thank you for taking a look, @iandoug!

I have no idea what the complete list of characters should contain...

@iandoug
Copy link

iandoug commented Nov 26, 2022

Russian alphabet (WP order)
АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя

Other characters on current Russian (Windows) keyboard:
1234567890!"№;%:?*₽()-_=+/.,

Others on US ANSI:
@#$^&[]{}|`~'<>

@iandoug
Copy link

iandoug commented Nov 26, 2022

russian-dirty.txt : 1,697,253,979 bytes, 963,432,405 characters according to wc
russian-clean1.txt: 1,219,345,880 bytes, 670,203,855 characters according to wc

I used following chars:
АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя1234567890!№;%:?*₽()-_=+/.,«»" plus space, tab and enter
Included the «» because quite common, and in layout above.
Fixed assorted dashes and ellipsis.

Char freq and bigram counts attached.
russianfreq1.txt
rawfollow-russian1.csv

Busy generating 1MB chained bigrams but its struggling a bit so don't know how good the final result will be.

Sample of ignored text attached.
ignored

@iandoug
Copy link

iandoug commented Nov 26, 2022

generating

@iandoug
Copy link

iandoug commented Nov 26, 2022

Chained bigrams.
russianmonkeytest-1MB.zip

@binarybottle
Copy link
Owner Author

Amazing -- Thank you for cleaning up the corpus! You know, I would write to those who maintain the original corpus so that they host your cleaned version and credit you accordingly.

@7orlum
Copy link

7orlum commented Nov 30, 2022

It would be better to generate Щ after double click Ш and Ъ after double click Ь. These two letters are used really rare unlike Home and Page Up keys.

@asim215
Copy link

asim215 commented Feb 27, 2023

I think all punctuation of engram in the middle must be preserved.
You can also combine on one button if needed: (е, ё), (и, й), (ь, ъ), (ш, щ), ... .
I would prefer in syntax symbols engram layout and on free numbers querty. (engram over querty) It will be some kind of hybrid. Also Home and Page Up / Down must be on separate keys from characters/numbers layout.
Right now I use EgroDone with engram and russian (querty). So russian layout repeat qwerty on this keyboard.

@binarybottle
Copy link
Owner Author

binarybottle commented May 6, 2023

@iandoug -- Any thoughts on the above comments, or do you think your corpus and tables are ready?

What do the two columns of numbers represent in russianfreq1.txt?

I am having difficulty parsing the
rawfollow-russian1.csv file. To create an engram-russian layout, I just need bigrams and bigram frequencies. Would you be able to create a table with these?

No rush on responding to any of the above, as I have plenty to do, and I think #58 is a more pressing challenge.

@iandoug
Copy link

iandoug commented May 7, 2023

Those are probably wrong, will send revised versions in the week.

@binarybottle
Copy link
Owner Author

Thank you, but again no rush -- I just want to make sure I have the right data to work with when I get to this in the future.

@SaphireLattice
Copy link

A particularly annoying problem is the Ё. This letter has been basically an afterthought for quite a while, and I wonder if the corpus might be contaminated by people substituting it with Е for basically a century at this point.

I'm also quite curious how to fit the "extra" letters that the Russian alphabet requires on, say, ortholinear split keyboard. Even the base English version requires some care.

Probably need another issue opened for this, but it's the problem that made me look at this page in first place. I've been trying to consider how to even try to fit Engram on my Sofle (mirrored split, each side being 4x6 main key area and 5 below that of which 2 are thumb) and I've realized that it would require shuffling quite a bit around. Which would have been fine if I didn't also need to have ЙЦУКЕН (JCUKEN) around, which is mapped with QWERTY keyboard in mind, and so. I suppose on home desktop I can do whatever to make things work okay, but it still makes for an awkward setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants