-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inconsistencies in Armenian dictionary #9
Comments
Yeah that would be great! @echodroff and @emilyahn are the creators of that using the XPF system. I know they were updating some lexicons a while back for VoxCommunis, but it looks like Armenian hasn't been updated. It looks like Wikipron has scraped Wiktionary for Eastern Armenian and Western Armenian, so that might be an easier start point if the Wiktionary pronunciations are a good basis. If you go this route, I'd be happy to host that as an MFA phoneset dictionary and train a corresponding model (or host one that you train). |
The Wiktionary data is pretty reliable (I cleaned it up in 2021). However, the two dialects have radically different transcriptions for the same word. Basically any voiced plosive in Eastern, is voicless aspirated in Western; while any voiceless unaspirated plosive in Eastern is voiced in Western. So using separate models (a hye one and hyw one) may be wise. A complication though is that the Vox recordings had both hye and hyw speakers pooled together. Perhaps I can ask the maker of the recordings (who I know) if I can break up their corpus into the two dialects? I can also provide audio archives of both hye and hyw speech if the Vox corpus isn't big enough. I'm new to MFA (still doing the tutorials). When you say "that route", do you mean just taking all the Wiktionary words as the pronunciation dictionary and then you guys re-run the models? I'm happy to help in any way, even if it's just correcting the existing pronunciation dictionary that you guys have. |
Yeah, so what I did for all the MFA dictionaries/acoustic models was:
So I would say, as long as the speaker information is somewhere in the corpus, it should be possible to generate the speaker-dictionary mapping (that's what I've done with Common Voice and other corpora), and then have a fall back that contains all variants for speakers that are not specified. In terms of data, Common Voice has 2 hours of data, and that's the only one I can find with Armenian data on OpenSLR and open-speech-corpora, so getting as much data from other sources that you know of would make for a much better acoustic model. (I'll also try to expand this walkthrough into an actual docs page as a concrete example for end-to-end training) |
Thank you for flagging this, and yes, we were planning on updating that model. (Michael, if you're already on this, let us know!) Ultimately our goal is to have good G2P for the given audio recording for downstream phonetic analysis. If Eastern and Western Armenian are very different, and it's possible to split the Common Voice dataset into two dialects, that would be great. It looks like participants did not report their accent at least in the Common Voice v7 version, but perhaps that's something the Common Voice folks could sort out post-hoc or for the future. |
Also, reading your original comment more closely, it looks like we will struggle to add the schwas with XPF alone. (The affricate issue was a missing line in our Python script.) I think the Wikipron route might be preferable at this point; the bigger problem now is figuring out which dialect to use for each recording given the lack of metadata. I wonder if we could build some dialect classifier given this information about stop voicing / run a first pass that allows for both pronunciations in the lexicon. |
Schwa: Yeah... knowing the schwa requires a mix of phonology and morphology info. It's a pain to predict... Classifier: The maker of the Vox corpus does have some guidelines for doing dialect splits. I can provide a list of 'rules' that distinguish the dialects -- like the above voicing difference and other. Metadata: For Vox, because it's only 2 hours, I could potentially just listen to the recordings and provide the metadata myself on whether some sentence is hye or hyw. I remember I provided audio recordings for it. But I don't know how the insides work (like where I can listen to each recording and provide metadata). I just emailed the Vox maker for this now. PS: I emailed Michael before you first commented, offering some lists of potential audio corpora to use. I'm mostly just unsure what are the minimum corpus-annotation requirements for MFA. Like can it be orthography-less, transcription-less, etc? |
Hello,
On the MFA page for Armenian, it seems the dictionary is being based off of Armenian transliteration instead of transcriptions.
If you want, I can re-transcribe your dictionary file using a mix of Wiktionary + my native judgments.
The text was updated successfully, but these errors were encountered: