inconsistencies in Armenian dictionary #9

jhdeov · 2022-06-17T05:58:06Z

Hello,

On the MFA page for Armenian, it seems the dictionary is being based off of Armenian transliteration instead of transcriptions.

In the consonants table, the system is breaking up affricates. This creates non-existent IPA symbols like sʰ. The example it cites is կից [k i t sʰ], but this word is actually [kit͡sʰ] with an an aspirated affricate /t͡sʰ/. Paradoxically, the XPF page doesn't have an /sʰ/ symbol.
The examples seem to be transliterations instead of transcriptions because it's omitting epenthetic schwas which are unwritten but prescriptively present. For example, for the /ɡ/ symbol, a cited example is գլուխ [ɡ l u χ], but this word is actually [ɡəluχ] with an epenthetic schwa, as also reported on Wiktionary

If you want, I can re-transcribe your dictionary file using a mix of Wiktionary + my native judgments.

mmcauliffe · 2022-06-19T20:12:18Z

Yeah that would be great! @echodroff and @emilyahn are the creators of that using the XPF system. I know they were updating some lexicons a while back for VoxCommunis, but it looks like Armenian hasn't been updated.

It looks like Wikipron has scraped Wiktionary for Eastern Armenian and Western Armenian, so that might be an easier start point if the Wiktionary pronunciations are a good basis. If you go this route, I'd be happy to host that as an MFA phoneset dictionary and train a corresponding model (or host one that you train).

jhdeov · 2022-06-19T20:21:15Z

The Wiktionary data is pretty reliable (I cleaned it up in 2021). However, the two dialects have radically different transcriptions for the same word. Basically any voiced plosive in Eastern, is voicless aspirated in Western; while any voiceless unaspirated plosive in Eastern is voiced in Western. So using separate models (a hye one and hyw one) may be wise. A complication though is that the Vox recordings had both hye and hyw speakers pooled together. Perhaps I can ask the maker of the recordings (who I know) if I can break up their corpus into the two dialects? I can also provide audio archives of both hye and hyw speech if the Vox corpus isn't big enough.

I'm new to MFA (still doing the tutorials). When you say "that route", do you mean just taking all the Wiktionary words as the pronunciation dictionary and then you guys re-run the models? I'm happy to help in any way, even if it's just correcting the existing pronunciation dictionary that you guys have.

mmcauliffe · 2022-06-19T20:44:25Z

Yeah, so what I did for all the MFA dictionaries/acoustic models was:

Download the scraped Wikipron dictionaries per dialect
Run a clean up script with basic rules for cleaning up ligatures, some narrow diacritics, tone markings, etc so that the resulting dictionary uses a mostly standardized set of symbols across languages. The clean up rules are per dialectal dictionary.
Create G2P models based on the dictionary (and fix any random typo/pronunciation errors in the dictionary using the phones symbol table)
Create a speaker-dictionary mapping that has speakers assigned to specific dialect dictionaries and a "default" dictionary that contains all pronunciations for speakers that have no information about what the dialect is.
Run a validation script to get a list of OOVs in the corpora using the initial speaker-dictionary mapping.
Run a G2P generation script to supplement the original dictionary (and re run the speaker-dictionary mapping script to regenerate the default dictionary with the new pronunciations)
Train acoustic model using the speaker-dictionary mapping

So I would say, as long as the speaker information is somewhere in the corpus, it should be possible to generate the speaker-dictionary mapping (that's what I've done with Common Voice and other corpora), and then have a fall back that contains all variants for speakers that are not specified.

In terms of data, Common Voice has 2 hours of data, and that's the only one I can find with Armenian data on OpenSLR and open-speech-corpora, so getting as much data from other sources that you know of would make for a much better acoustic model.

(I'll also try to expand this walkthrough into an actual docs page as a concrete example for end-to-end training)

echodroff · 2022-06-19T20:57:27Z

Thank you for flagging this, and yes, we were planning on updating that model. (Michael, if you're already on this, let us know!) Ultimately our goal is to have good G2P for the given audio recording for downstream phonetic analysis. If Eastern and Western Armenian are very different, and it's possible to split the Common Voice dataset into two dialects, that would be great. It looks like participants did not report their accent at least in the Common Voice v7 version, but perhaps that's something the Common Voice folks could sort out post-hoc or for the future.

echodroff · 2022-06-19T21:09:52Z

Also, reading your original comment more closely, it looks like we will struggle to add the schwas with XPF alone. (The affricate issue was a missing line in our Python script.) I think the Wikipron route might be preferable at this point; the bigger problem now is figuring out which dialect to use for each recording given the lack of metadata. I wonder if we could build some dialect classifier given this information about stop voicing / run a first pass that allows for both pronunciations in the lexicon.

jhdeov · 2022-06-19T21:16:12Z

Schwa: Yeah... knowing the schwa requires a mix of phonology and morphology info. It's a pain to predict...

Classifier: The maker of the Vox corpus does have some guidelines for doing dialect splits. I can provide a list of 'rules' that distinguish the dialects -- like the above voicing difference and other.

Metadata: For Vox, because it's only 2 hours, I could potentially just listen to the recordings and provide the metadata myself on whether some sentence is hye or hyw. I remember I provided audio recordings for it. But I don't know how the insides work (like where I can listen to each recording and provide metadata). I just emailed the Vox maker for this now.

PS: I emailed Michael before you first commented, offering some lists of potential audio corpora to use. I'm mostly just unsure what are the minimum corpus-annotation requirements for MFA. Like can it be orthography-less, transcription-less, etc?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inconsistencies in Armenian dictionary #9

inconsistencies in Armenian dictionary #9

jhdeov commented Jun 17, 2022

mmcauliffe commented Jun 19, 2022

jhdeov commented Jun 19, 2022

mmcauliffe commented Jun 19, 2022 •

edited

Loading

echodroff commented Jun 19, 2022

echodroff commented Jun 19, 2022

jhdeov commented Jun 19, 2022 •

edited

Loading

inconsistencies in Armenian dictionary #9

inconsistencies in Armenian dictionary #9

Comments

jhdeov commented Jun 17, 2022

mmcauliffe commented Jun 19, 2022

jhdeov commented Jun 19, 2022

mmcauliffe commented Jun 19, 2022 • edited Loading

echodroff commented Jun 19, 2022

echodroff commented Jun 19, 2022

jhdeov commented Jun 19, 2022 • edited Loading

mmcauliffe commented Jun 19, 2022 •

edited

Loading

jhdeov commented Jun 19, 2022 •

edited

Loading