-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to create engram for Polish? #46
Comments
Please take a look at the optimized layout for Spanish (https://github.com/binarybottle/engram-es) that we created. If there is very accurate and representative 1-gram and bigram frequency data for the Polish language (including symbols), then we could apply a modified version of the code to generate an Engram layout optimized for Polish. |
Hi. I contacted polish corpus creators and I got data up to 5-grams. It's available here: n-grams pl data. Do I need to further process it or it's enough? also, here is list of avaible resources that might be useful: link |
This corpus looks pretty official! I like that it has a broad variety of book and news sources. Too bad it doesn't include spoken transcripts or social media sources. Anyway, I would be happy to help with this but it will be a couple of months before I can get to it -- buried with projects right now. |
Ok, thank u for help ;) |
@iandoug -- Given your experience helping to clean up the Spanish corpus, do you have any concerns about the proposed Polish corpus?: http://zil.ipipan.waw.pl/NKJPNGrams |
Hi Arno For keyboard layout use, I prefer to strip texts not normally typed on computer keyboard (like spoken transcripts or tweets) because that will mess up the character frequencies and n-grams. "Each unigram is maximum continuous chunk of non-whitespace lower-case characters." That is the normal way of doing it. Ian of course is not normal and does it Case Sensitive ... :-) Because typing Th is different to th. It looks like they only have "1-million-word subcorpus" available to download. Is this typical Polish text? Masz ty duszę? Powiedz! Do lines starting with a dash indicate dialogue? There seems to be a lot of that, which is going to bump up the dash frequency. Should maybe strip out leading dashes. I have not seen any HTML etc, but since this is "manually annotated", I guess it is "clean" in that regard. There are some ALL CAPS sentences. Will see if I can extract the text and do some analysis over the weekend. We are currently having rolling blackouts so that messes up plans. |
For future reference: Leipzig https://wortschatz.uni-leipzig.de/en/download/Polish#pol_newscrawl_2011 |
What's the difference in quotes? Should both be on keyboard? która znalazła się w zestawieniu "Billboard Magazine". ” 2012, a zespół otrzymał nominację do nagród „Songlines Music Awards” 2012 w kategorii „Best Group”. |
yes, they're for dialogues, yet not often used for things other than books.
bottom quotes are rarely used, shouldn't be on keyboard (they are now superseeded by both upper quotes and meaning is the same)
2,3 typical, 1 is correct but rather from books |
I took a look at the linked corpus, not wild about it, seems to contain a lot of dialogue. Will try cleaning up some of it as next step after this. Instead, I grabbed all the1M files from the Leipzig Polish corpus. After looking at those, decided to only use the "news" files, the rest is going to be a mess to clean. So that supplies 9 million sentences. After tweaking my Spanish cleanup program, now have a 688 MB text file to play with. I grabbed some Polish books from Gutenbreg ... only a few, most seem to be poetry or dialogue-heavy novels. Will try my usual "extract some text" approach with those to add to the Leipzig file. Current char distribution looks like this. |
Gonna have to do this on ISO not ANSI ... need that extra key. Even then, going to be challenging. :-) 112 characters? |
@iandoug -- Using news files sounds reasonable, but I wouldn't throw out dialogues -- they are far closer to how people type emails than books are. |
I took further look at NJKP n-grams and they're heavily bloated with parliament sessions transcriptions or something like this, so they're pretty useless. news/internet is the way to go. I'll take a look at leipzig files. |
Sample from "Web" corpus attached. Will do your "single-case" frequencies and bigrams in due course. The dialogues all like this: % short sentence 1. where % is the - character. Markdown getting in the way again. |
idk how to read n-grams from leipzig. Is there any instruction for this?
|
First attempt at bigrams. Am playing with trial layout, I see from Wikipedia that people actually prefer the ANSI boards over ISO so am using that. Also UDHR in Polish as temporary test file. The UN no longer seems to have .txt downloads, just PDF on on web page. udhr-polish.txt csv is tab-separated. Most common: |
honestly I bought one ANSI keyboard and don't like it, but it shouldn't be a problem to make two variants of layout? I checked what layout currently sold in Poland keyboards have and no standard is followed, both ISO and ANSI are used (not like I previously wrote, I though that ISO is standard here). |
Yeah, I was surprised at what Wikipedia said about that. At the moment I have enough keys on ANSI, though must put Euro somewhere. Spanish and French more tricky because multiple diacritics per vowel. You only have 2 on Z. Here's first attempt at chained bigrams, since the UDHR character frequency is not very good. But not happy with this file either, has too many digits. Which is a consequence of the "news" input I suppose. |
Okay finally got somewhere but it "feels" a mess, probably because I know nothing about Polish. But it will give you something to compare against. Ignore the layouts with .en. in the name, they are missing the Polish letters so their scores are wrong. The bottom one is the "Programmer" layout which WP says is the most common. ł ord 197 hex c5 8602003 |
Think I need the diacritic S letters on separate key, which means switching to ISO form factor. |
Hand balance is 58:42, but can't find spot on right for popular letters on left ... |
The Q X V can be put in better places ... first get the Polish to work :-) |
@iandoug -- Thank you for hitting this hard over the weekend! I am slammed this week but hope to take a look at what you're doing next weekend. |
Was not intending to but once you start fiddling with layouts ... like a drug :-) Also have other stuff to do this week, will ty to improve corpus when I have time. |
my right fingers position is ATZS + altgr Also why ó and ł are on such strange positions? They are typing distance optimized too? |
Your accented characters seem to be almost treated as separate letters rather than "stressed" versions of the version without the diacritic. Certainly judging by the frequency of some of them. Those letters are where they need to be so that the layout scores well. Your layout scores better than the default, but could do a lot better. Will send screenshot if above layout is correct. |
I joined the few books I had together and cleaned up the unwanted characters. The file is 1,116,925 bytes. The character frequency came out as while the frequency for the Leipzig "news" files is aieonzrwstyckdpmujlł.bg,ęhążóś⮠PćfWS-0"KńMN1TA2ZBDORJCIG:LE53U4)(ź9F?687H!VvŚŁ/ŻxX'Y%;q+Q&ĘŹ@`ĆÓĄ*>~][$Ń<_=€#|^}{ The book's order for the most frequent characters is different, probably a consequence of using the main character's name a lot. I normally just take short extracts of books to avoid this, but don't have enough to do that (besides having somehow lost the program I wrote to do that). So don't think I will include these texts. Will see what I can get out of the "official" corpus posted above. The problem with that corpus is that it is intended for "parts of speech" analysis, not "what do people type on keyboards" like we need. |
only difference is minus sign on different key, in place where middle dot was, but this doesn't matter. my layout also has problem with lack of greater/equal. for score this shouldn't make difference ah I forgot, there isn't "!" near N you can put greater/equal on slashes, the only reason I didn't do this is because I want to make this keys a modifiers in future and I didn't wanted to memorize those keys there. |
Here's another letter frequency data: https://sjp.pwn.pl/poradnia/haslo/;7072 |
That's probably both cases merged ... the order of the most frequent is same as mine up to around c / y. I will make unicase list, also bigrams, for Arno. Thanks.... gives me confidence in my corpus. |
There were two !, I removed the one on the letter keys and left the one on 1 as "standard". I need to write a checker program to check layouts for all needed characters and no duplicates. Will add < and > to yours. |
How can I make layout for KLA? You can send me it and I will modify. ! on dot is better, you can delete it from 1, it's not needed |
I uploaded a playground. Your layout is the second one (click Configure at top), please fix, then export the json and send to me to replace. https://klanext.keyboard-design.com/pl/ Thanks :-) |
Hey I've made layouts for a lot of languages in the past, and coincidentally I was actually thinking today about making something for Polish! My analyzer can be found here, it's written in rust and comes with a useful repl to interact with it. I'm using corpora from Leipzich Wortschatz, also mentioned by Ian earlier. I know these are not fully representative of casual texting and everyday typing, but having compared some news(crawls) and similar between different corpora for English I'm pretty confident they're very close to being representative in any case. A lot of word usage between news articles and websites ends up being the same as more casual usage of the language. For corpus processing, I transpose everything to lowercase including punct, meaning
Seems pretty self-explanatory. For the eventual layout, you can implement these with a dead key. You might notice
This does denote one of the limitations of my analyzer, in that it can only optimize for the main 3x10 keys and nothing around it. In this case that is fine however, since there aren't any keys left out as it stands. From there, I can run Some of the layouts I found were:
This
This Any thoughts? I might play around with it tomorrow. By the the way, may keyboard layout playground has Polish too now, so you can play around with these (or any other layouts posted here) over there as well. Good luck yall! |
Hi. I used your bigram data and layouts, and improved my layout basing on this, while trying to change it as little as possible to don't have to learn new from scratch again :P I also corrected few fingers for pressing keys for more accurate representation of layout. |
@o-x-e-y is this analyzer only for ortho keyboards? btw it lacks ó and ź |
The analyzer currently only supports 3x10, but the heatmap it uses is made for rowstag so it does optimize for that (angle mod specifically). Also |
Pretty nice. Maybe I use one of your layouts? I checked mine and it's just worse so I'm going to start pain of learning again :P I may also check results of your layouts in KLA so we can compare it to Ian's and find which is best. |
Ian needs to redo ... the problem is that KLA does not support "magic diacritic keys" like Oxey's analyzer.... only AltGr style. |
scnk in KLA scores similar to ian10, so it looks like your layouts are close to ideal. |
Just stepping back into this exchange after some time away (Halloween costume complete!). Is there still an interest in running the Engram protocol on excerpts from the Polish corpus to optimize Engram for Polish? |
Ian and oxey optimized it close to limit, so if it would take a lot of effort it's not neccesary. |
I am wary of standard optimization criteria when it comes to evaluating comfortable rather than efficient typing, but if you are happy with it, that's great! |
Is it hard to get engram layout for language like polish? I created my layout based on engram, but maybe it can be further optimized.
The text was updated successfully, but these errors were encountered: