How to create engram for Polish? #46

AKmatiAK · 2022-09-08T10:39:32Z

Is it hard to get engram layout for language like polish? I created my layout based on engram, but maybe it can be further optimized.

AKmatiAK · 2022-09-09T21:12:44Z

I updated my layout, here it is:

altgr layer remains the same

binarybottle · 2022-09-09T23:03:32Z

Please take a look at the optimized layout for Spanish (https://github.com/binarybottle/engram-es) that we created.

If there is very accurate and representative 1-gram and bigram frequency data for the Polish language (including symbols), then we could apply a modified version of the code to generate an Engram layout optimized for Polish.

AKmatiAK · 2022-09-10T19:30:14Z

Hi. I contacted polish corpus creators and I got data up to 5-grams. It's available here: n-grams pl data. Do I need to further process it or it's enough?

also, here is list of avaible resources that might be useful: link

binarybottle · 2022-09-12T15:45:27Z

This corpus looks pretty official! I like that it has a broad variety of book and news sources. Too bad it doesn't include spoken transcripts or social media sources. Anyway, I would be happy to help with this but it will be a couple of months before I can get to it -- buried with projects right now.

AKmatiAK · 2022-09-12T20:18:38Z

Ok, thank u for help ;)

binarybottle · 2022-09-30T02:08:31Z

@iandoug -- Given your experience helping to clean up the Spanish corpus, do you have any concerns about the proposed Polish corpus?: http://zil.ipipan.waw.pl/NKJPNGrams

iandoug · 2022-09-30T06:37:27Z

Hi Arno

For keyboard layout use, I prefer to strip texts not normally typed on computer keyboard (like spoken transcripts or tweets) because that will mess up the character frequencies and n-grams.

"Each unigram is maximum continuous chunk of non-whitespace lower-case characters."

That is the normal way of doing it. Ian of course is not normal and does it Case Sensitive ... :-) Because typing Th is different to th.

It looks like they only have "1-million-word subcorpus" available to download.

Is this typical Polish text?

Masz ty duszę? Powiedz!
Tak jest. Od ręki.
To chyba dobra formuła.

Do lines starting with a dash indicate dialogue? There seems to be a lot of that, which is going to bump up the dash frequency. Should maybe strip out leading dashes.

I have not seen any HTML etc, but since this is "manually annotated", I guess it is "clean" in that regard. There are some ALL CAPS sentences.

Will see if I can extract the text and do some analysis over the weekend. We are currently having rolling blackouts so that messes up plans.

iandoug · 2022-09-30T07:00:15Z

For future reference: Leipzig

https://wortschatz.uni-leipzig.de/en/download/Polish#pol_newscrawl_2011

iandoug · 2022-09-30T07:30:45Z

What's the difference in quotes? Should both be on keyboard?

która znalazła się w zestawieniu "Billboard Magazine".

” 2012, a zespół otrzymał nominację do nagród „Songlines Music Awards” 2012 w kategorii „Best Group”.
” (2012) oraz realizator dźwięku przy filmie „

AKmatiAK · 2022-09-30T08:01:35Z

Do lines starting with a dash indicate dialogue? There seems to be a lot of that, which is going to bump up the dash frequency. Should maybe strip out leading dashes.

yes, they're for dialogues, yet not often used for things other than books.

What's the difference in quotes? Should both be on keyboard?

bottom quotes are rarely used, shouldn't be on keyboard (they are now superseeded by both upper quotes and meaning is the same)

Is this typical Polish text?

Masz ty duszę? Powiedz!
Tak jest. Od ręki.
To chyba dobra formuła.

2,3 typical, 1 is correct but rather from books

binarybottle · 2022-09-30T14:11:56Z

Thank you for taking a look, @iandoug! I don't know Polish, so I will defer to @AKmatiAK and other Polish speakers/typists. A corpus of only 1 million words is pretty small, but I hope it represents what people type.

iandoug · 2022-10-01T14:48:18Z

I took a look at the linked corpus, not wild about it, seems to contain a lot of dialogue. Will try cleaning up some of it as next step after this.

Instead, I grabbed all the1M files from the Leipzig Polish corpus. After looking at those, decided to only use the "news" files, the rest is going to be a mess to clean. So that supplies 9 million sentences.

After tweaking my Spanish cleanup program, now have a 688 MB text file to play with. I grabbed some Polish books from Gutenbreg ... only a few, most seem to be poetry or dialogue-heavy novels. Will try my usual "extract some text" approach with those to add to the Leipzig file.

Current char distribution looks like this.
Provisional list, may change ...

char-dist-1.txt

polishfreq1.txt

iandoug · 2022-10-01T14:55:05Z

Gonna have to do this on ISO not ANSI ... need that extra key. Even then, going to be challenging. :-)

112 characters?

binarybottle · 2022-10-01T16:15:45Z

@iandoug -- Using news files sounds reasonable, but I wouldn't throw out dialogues -- they are far closer to how people type emails than books are.

AKmatiAK · 2022-10-01T16:26:40Z

I took further look at NJKP n-grams and they're heavily bloated with parliament sessions transcriptions or something like this, so they're pretty useless. news/internet is the way to go. I'll take a look at leipzig files.

iandoug · 2022-10-01T16:46:31Z

Sample from "Web" corpus attached.

Will do your "single-case" frequencies and bigrams in due course.

web-sample.txt

The dialogues all like this:

% short sentence 1.
% short sentence 2.
% short sentence 3.

where % is the - character. Markdown getting in the way again.

AKmatiAK · 2022-10-01T16:48:57Z

idk how to read n-grams from leipzig. Is there any instruction for this?

Gonna have to do this on ISO not ANSI ... need that extra key. Even then, going to be challenging. :-)

112 characters?

~~We use ISO keyboard, same as in US here.~~ Both ISO and ANSI. Polish characters on altgr. 112 characters without space and enter

iandoug · 2022-10-01T16:55:07Z

First attempt at bigrams. Am playing with trial layout, I see from Wikipedia that people actually prefer the ANSI boards over ISO so am using that.

Also UDHR in Polish as temporary test file. The UN no longer seems to have .txt downloads, just PDF on on web page.

udhr-polish.txt
bigrams-polish1.csv

csv is tab-separated.

Most common:
ie ni na ow st ze cz rz po ch an ra pr wi zy ro ia za wa ta dz sz od ki en ko ar ej mi li ci zi ac

AKmatiAK · 2022-10-01T17:12:49Z

I see from Wikipedia that people actually prefer the ANSI boards over ISO so am using that.

honestly I bought one ANSI keyboard and don't like it, but it shouldn't be a problem to make two variants of layout? I checked what layout currently sold in Poland keyboards have and no standard is followed, both ISO and ANSI are used (not like I previously wrote, I though that ISO is standard here).

iandoug · 2022-10-01T17:44:31Z

I see from Wikipedia that people actually prefer the ANSI boards over ISO so am using that.

honestly I bought one ANSI keyboard and don't like it, but it shouldn't be a problem to make two variants of layout? I checked what layout currently sold in Poland keyboards have and no standard is followed, both ISO and ANSI are used (not like I previously wrote, I though that ISO is standard here).

Yeah, I was surprised at what Wikipedia said about that. At the moment I have enough keys on ANSI, though must put Euro somewhere. Spanish and French more tricky because multiple diacritics per vowel. You only have 2 on Z.

Here's first attempt at chained bigrams, since the UDHR character frequency is not very good. But not happy with this file either, has too many digits. Which is a consequence of the "news" input I suppose.

polishmonkeytest.txt

iandoug · 2022-10-01T19:10:57Z

Okay finally got somewhere but it "feels" a mess, probably because I know nothing about Polish. But it will give you something to compare against.

Ignore the layouts with .en. in the name, they are missing the Polish letters so their scores are wrong.

The bottom one is the "Programmer" layout which WP says is the most common.
It might make sense to put some of these letters on their own keys, instead of Q V X since these 3 are not native Polish and thus rare. Or at least ł.

ł ord 197 hex c5 8602003
ż ord 197 hex c5 4490722
ó ord 195 hex c3 4461711
ś ord 197 hex c5 2988511
ć ord 196 hex c4 2271532

iandoug · 2022-10-01T19:35:15Z

Enough for today. Getting better.

iandoug · 2022-10-02T15:38:40Z

Been playing around. Current best version, changes my be too dramatic for easy acceptance.

Can compare performance against default Programmer version at bottom of list. Ignore .en. layouts.

iandoug · 2022-10-02T15:39:41Z

Think I need the diacritic S letters on separate key, which means switching to ISO form factor.

iandoug · 2022-10-02T15:42:01Z

Hand balance is 58:42, but can't find spot on right for popular letters on left ...

AKmatiAK · 2022-10-02T17:06:28Z

This is my current layout I was creating since about month by simple intuition and applying fixes based on what I thought should be changed etc. so it might be useful to some extent in designing engram-pl. It lacks some keys I know because I changed it frequently. in my subjecive opinion, cie trigram is very frequent and should be placed on keyboard (but I may be wrong). Also, mixing different letters on one key is not very good idea imo, it might be faster but is unintuitive. only ź should be placed on another letter, also placing ł on i instead of L is reasonable because i found it easy to remember somehow.

btw: what I like a lot in ISO is far better thumb access to altgr and one more letter at home row. I couldn't achieve it on ANSI and because that I sticked with my old ISO one.

of course I have caps and ctrl swapped ;)

iandoug · 2022-10-02T20:41:54Z

Mmm so of course you would use a form factor that is not in KLA ..... neither ANSI nor ISO :-)

sz is a common bigram so should not be on same finger.

Q V X are not in your alphabet so it makes no sense to waste whole keys on them. They are only there because of QWERTY.

I made an ISO version, realised I had the spacebar on the wrong thumb, so had to basaclly mirror the layout to fix it.

Hand balance is nearly perfect now. ANSI version slightly better, but ISO puts the space bars further away and there's nothing I can do about that. Other metrics are better.

I may have used the wrong input file to create the chained bigrams, so redid it.

polishmonkeytest2.txt

iandoug · 2022-10-02T20:43:24Z

The Q X V can be put in better places ... first get the Polish to work :-)

binarybottle · 2022-10-03T01:33:39Z

@iandoug -- Thank you for hitting this hard over the weekend! I am slammed this week but hope to take a look at what you're doing next weekend.

iandoug · 2022-10-03T06:54:14Z

Was not intending to but once you start fiddling with layouts ... like a drug :-)

Also have other stuff to do this week, will ty to improve corpus when I have time.

AKmatiAK · 2022-10-03T07:38:02Z

sz is a common bigram so should not be on same finger.

my right fingers position is ATZS + altgr
edit: left hand JCIE + space. this way I use pinky only for ctrl (on caps) and my hand position is more straight. and right pinky for enter

Also why ó and ł are on such strange positions? They are typing distance optimized too?

iandoug · 2022-10-03T08:26:34Z

Is this correct? I added the pipe character "|" back.

iandoug · 2022-10-03T08:32:44Z

Also why ó and ł are on such strange positions? They are typing distance optimized too?

Your accented characters seem to be almost treated as separate letters rather than "stressed" versions of the version without the diacritic. Certainly judging by the frequency of some of them.

Those letters are where they need to be so that the layout scores well. Your layout scores better than the default, but could do a lot better. Will send screenshot if above layout is correct.

iandoug · 2022-10-03T16:16:06Z

I joined the few books I had together and cleaned up the unwanted characters. The file is 1,116,925 bytes.

The character frequency came out as
iaeoznsrcwymtdkpł,ujl.bęgąhżśó-ć!ńPWAO;NfTZD?:"IźSRKC_JMBG*L10EU[]Ł824'F5HŻ736v)Ś9VY(xXq=/ÓQĘĄŃŹĆ{~&^`

while the frequency for the Leipzig "news" files is

aieonzrwstyckdpmujlł.bg,ęhążóś⮠PćfWS-0"KńMN1TA2ZBDORJCIG:LE53U4)(ź9F?687H!VvŚŁ/ŻxX'Y%;q+Q&ĘŹ@`ĆÓĄ*>~][$Ń<_=€#|^}{

The book's order for the most frequent characters is different, probably a consequence of using the main character's name a lot. I normally just take short extracts of books to avoid this, but don't have enough to do that (besides having somehow lost the program I wrote to do that).

So don't think I will include these texts. Will see what I can get out of the "official" corpus posted above.

The problem with that corpus is that it is intended for "parts of speech" analysis, not "what do people type on keyboards" like we need.

AKmatiAK · 2022-10-03T20:17:05Z

Is this correct? I added the pipe character "|" back.

Will send screenshot if above layout is correct.

only difference is minus sign on different key, in place where middle dot was, but this doesn't matter. my layout also has problem with lack of greater/equal. for score this shouldn't make difference

ah I forgot, there isn't "!" near N

you can put greater/equal on slashes, the only reason I didn't do this is because I want to make this keys a modifiers in future and I didn't wanted to memorize those keys there.

AKmatiAK · 2022-10-03T20:23:30Z

Here's another letter frequency data: https://sjp.pwn.pl/poradnia/haslo/;7072

iandoug · 2022-10-03T20:38:40Z

Here's another letter frequency data: https://sjp.pwn.pl/poradnia/haslo/;7072

That's probably both cases merged ... the order of the most frequent is same as mine up to around c / y. I will make unicase list, also bigrams, for Arno. Thanks.... gives me confidence in my corpus.

iandoug · 2022-10-03T20:44:38Z

ah I forgot, there isn't "!" near N

There were two !, I removed the one on the letter keys and left the one on 1 as "standard".
I had trouble with - and _ , the font on your keyboard is not so clear. Where should minus be? I did not see middle dot ...

I need to write a checker program to check layouts for all needed characters and no duplicates. Will add < and > to yours.

AKmatiAK · 2022-10-04T08:30:51Z

How can I make layout for KLA? You can send me it and I will modify.

! on dot is better, you can delete it from 1, it's not needed
to make it more straightforward because my layout lacks _ too you can simply swap - and _ on my layout you created in KLA.

iandoug · 2022-10-04T08:51:29Z

I uploaded a playground. Your layout is the second one (click Configure at top), please fix, then export the json and send to me to replace.
[email protected]

https://klanext.keyboard-design.com/pl/

Thanks :-)

o-x-e-y · 2022-10-05T02:05:28Z

Hey I've made layouts for a lot of languages in the past, and coincidentally I was actually thinking today about making something for Polish! My analyzer can be found here, it's written in rust and comes with a useful repl to interact with it.

I'm using corpora from Leipzich Wortschatz, also mentioned by Ian earlier. I know these are not fully representative of casual texting and everyday typing, but having compared some news(crawls) and similar between different corpora for English I'm pretty confident they're very close to being representative in any case. A lot of word usage between news articles and websites ends up being the same as more casual usage of the language.

For corpus processing, I transpose everything to lowercase including punct, meaning _ becomes -, " becomes ', etc and toss out numbers and their corresponding punctuation. I also transpose some variations of certain punctuation, mostly different quotation marks, to their ascii version. In this step I also tag on an accent key (denoted with *), with the following functionality:

*a -> ą
*o -> ó
*z -> ź
*s -> ś
*c -> ć
*n -> ń

Seems pretty self-explanatory. For the eventual layout, you can implement these with a dead key. You might notice ł, ę and ż are missing however and you would be right, as those get their own dedicated key on the keyboard, courtesy of them being a lot more common than q, v and x. For punctuation I use ., , and ', which gives us the following 30 keys to use for keyboard layout generation:

a b c d e f g h i j k l m n o p ł r s t u ę w ż y z ' , . *

This does denote one of the limitations of my analyzer, in that it can only optimize for the main 3x10 keys and nothing around it. In this case that is fine however, since there aren't any keys left out as it stands.

From there, I can run generate 2500 in my analyzer, which does all the work for us! Polish seems to be a weird language in that it has a lot of keys between 3 and 1% freq, rather than having some high usage keys and then usage falling off more quickly. This meant that creating nice pinky columns got quite hard. A solution to this could be to add ę to the accent key for example and remove its dedicated key, but then you have to hit more keys in the end which doesn't seem super ideal. It might be worth it though and is probably worth exploring.

Some of the layouts I found were:

ł t s k j  f . e u '
r c n w m  , z a o i
l d b p g  ż * ę y h
Sfb:  1.116%
Dsfb: 7.935%
Finger Speed: 5.798
    [0.382, 0.543, 0.806, 1.507, 0.578, 0.602, 0.734, 0.646]
Scissors: 0.291%

Inrolls: 23.094%
Outrolls: 23.163%
Total Rolls: 46.258%
Onehands: 1.086%

Alternates: 35.394%
Alternates (sfs): 9.812%
Total Alternates: 45.206%

Redirects: 3.277%
Bad Redirects: 0.182%
Total Redirects: 3.459%

Bad Sfbs: 0.541%,
Sft: 0.011%

This rcnw variant, which besides the relatively wonky left pinky seems pretty amazing. Both high rolls and high alternation with very low redirects, but it's got relatively high finger speed. As far as I've seen though, it appears to be very difficult to suppress that much further.

ż t r w p  f . e y '
s c n k m  l z a o i
g d ł b j  , * ę u h
Sfb:  1.025%
Dsfb: 8.153%
Finger Speed: 5.631
    [0.261, 0.543, 0.433, 1.581, 0.831, 0.602, 0.734, 0.646]
Scissors: 0.541%

Inrolls: 26.256%
Outrolls: 21.175%
Total Rolls: 47.431%
Onehands: 1.155%

Alternates: 33.333%
Alternates (sfs): 9.484%
Total Alternates: 42.817%

Redirects: 4.611%
Bad Redirects: 0.244%
Total Redirects: 4.855%

Bad Sfbs: 0.477%,
Sft: 0.012%

This scnk variant, which has slightly higher scissors but less sfbs, and should be another sound option. It does have g on the bottom row however, which is the case because gd occurs around 0.06%. You could probably move g to top row and be completely fine though.

Any thoughts? I might play around with it tomorrow. By the the way, may keyboard layout playground has Polish too now, so you can play around with these (or any other layouts posted here) over there as well. Good luck yall!

AKmatiAK · 2022-10-14T22:34:29Z

Hi. I used your bigram data and layouts, and improved my layout basing on this, while trying to change it as little as possible to don't have to learn new from scratch again :P I also corrected few fingers for pressing keys for more accurate representation of layout.
fwyr nowy.txt

AKmatiAK · 2022-10-14T23:12:38Z

@o-x-e-y is this analyzer only for ortho keyboards?

btw it lacks ó and ź

o-x-e-y · 2022-10-15T00:19:47Z

The analyzer currently only supports 3x10, but the heatmap it uses is made for rowstag so it does optimize for that (angle mod specifically). Also ó and ź are there, that's what the accent key is for

AKmatiAK · 2022-10-15T09:03:12Z

Pretty nice. Maybe I use one of your layouts? I checked mine and it's just worse so I'm going to start pain of learning again :P I may also check results of your layouts in KLA so we can compare it to Ian's and find which is best.

iandoug · 2022-10-15T09:41:29Z

Ian needs to redo ... the problem is that KLA does not support "magic diacritic keys" like Oxey's analyzer.... only AltGr style.

AKmatiAK · 2022-10-16T21:16:31Z

scnk in KLA scores similar to ian10, so it looks like your layouts are close to ideal.
scnk

binarybottle · 2022-10-29T13:06:29Z

Just stepping back into this exchange after some time away (Halloween costume complete!). Is there still an interest in running the Engram protocol on excerpts from the Polish corpus to optimize Engram for Polish?

AKmatiAK · 2022-10-30T20:03:42Z

Ian and oxey optimized it close to limit, so if it would take a lot of effort it's not neccesary.

binarybottle · 2022-10-31T02:45:01Z

I am wary of standard optimization criteria when it comes to evaluating comfortable rather than efficient typing, but if you are happy with it, that's great!

binarybottle changed the title ~~How to create engram for other languages?~~ How to create engram for Polish? Sep 30, 2022

AKmatiAK mentioned this issue Jul 17, 2023

questions AKmatiAK/fwyr-layout#1

Open

How to create engram for Polish? #46

How to create engram for Polish? #46

Comments

AKmatiAK commented Sep 8, 2022

AKmatiAK commented Sep 9, 2022 • edited Loading

binarybottle commented Sep 9, 2022

AKmatiAK commented Sep 10, 2022 • edited Loading

binarybottle commented Sep 12, 2022

AKmatiAK commented Sep 12, 2022

binarybottle commented Sep 30, 2022

iandoug commented Sep 30, 2022

iandoug commented Sep 30, 2022

iandoug commented Sep 30, 2022

AKmatiAK commented Sep 30, 2022 • edited Loading

binarybottle commented Sep 30, 2022

iandoug commented Oct 1, 2022 • edited Loading

iandoug commented Oct 1, 2022

binarybottle commented Oct 1, 2022

AKmatiAK commented Oct 1, 2022

iandoug commented Oct 1, 2022 • edited Loading

AKmatiAK commented Oct 1, 2022 • edited Loading

iandoug commented Oct 1, 2022 • edited Loading

AKmatiAK commented Oct 1, 2022

iandoug commented Oct 1, 2022

iandoug commented Oct 1, 2022

iandoug commented Oct 1, 2022

iandoug commented Oct 2, 2022

iandoug commented Oct 2, 2022

iandoug commented Oct 2, 2022

AKmatiAK commented Oct 2, 2022 • edited Loading

iandoug commented Oct 2, 2022

iandoug commented Oct 2, 2022

binarybottle commented Oct 3, 2022

iandoug commented Oct 3, 2022

AKmatiAK commented Oct 3, 2022 • edited Loading

iandoug commented Oct 3, 2022 • edited Loading

iandoug commented Oct 3, 2022

iandoug commented Oct 3, 2022

AKmatiAK commented Oct 3, 2022 • edited Loading

AKmatiAK commented Oct 3, 2022

iandoug commented Oct 3, 2022

iandoug commented Oct 3, 2022

AKmatiAK commented Oct 4, 2022

iandoug commented Oct 4, 2022

o-x-e-y commented Oct 5, 2022

AKmatiAK commented Oct 14, 2022 • edited Loading

AKmatiAK commented Oct 14, 2022 • edited Loading

o-x-e-y commented Oct 15, 2022

AKmatiAK commented Oct 15, 2022

iandoug commented Oct 15, 2022

AKmatiAK commented Oct 16, 2022

binarybottle commented Oct 29, 2022

AKmatiAK commented Oct 30, 2022

binarybottle commented Oct 31, 2022

AKmatiAK commented Sep 9, 2022 •

edited

Loading

AKmatiAK commented Sep 10, 2022 •

edited

Loading

AKmatiAK commented Sep 30, 2022 •

edited

Loading

iandoug commented Oct 1, 2022 •

edited

Loading

iandoug commented Oct 1, 2022 •

edited

Loading

AKmatiAK commented Oct 1, 2022 •

edited

Loading

iandoug commented Oct 1, 2022 •

edited

Loading

AKmatiAK commented Oct 2, 2022 •

edited

Loading

AKmatiAK commented Oct 3, 2022 •

edited

Loading

iandoug commented Oct 3, 2022 •

edited

Loading

AKmatiAK commented Oct 3, 2022 •

edited

Loading

AKmatiAK commented Oct 14, 2022 •

edited

Loading

AKmatiAK commented Oct 14, 2022 •

edited

Loading