Skip to content

Latest commit

 

History

History
66 lines (50 loc) · 4.28 KB

README.md

File metadata and controls

66 lines (50 loc) · 4.28 KB

Information-Theoretic Characterization of Vowel Harmony: A Cross-Linguistic Study on Word Lists

Requirements

Implementation was done in a conda environment with the latest version of Python 3.8. After installing Anaconda or Miniconda, create a conda environment via:

conda create -n env python=3.8
conda activate env

Most required packages can be installed via the requirements.txt file calling pip recursively:

pip install -r requirements.txt

Unzip the clts zip folder from the OSF project or download version 2.2.0 CLTS:

wget https://github.com/cldf-clts/clts/archive/refs/tags/v2.2.0.zip
unzip v2.2.0.zip

Furthermore, version 0.9 of the Lexibank version of the NorthEuraLex dataset has to be installed. It is also provided with some adjustments made to the Manchu data (these changes are also available when downloading the git tree rather than the release version). This is how you would obtain it from GitHub:

wget https://github.com/lexibank/northeuralex/archive/refs/tags/v4.0.zip
unzip v4.0.zip
cd northeuralex-4.0
pip install .

As soon as the dataset is installed, configure the path to the CLTS data by running cldfbench catconfig and then entering the absolute path into the config file in /home/$USER/.config/cldf/catalog.ini:

[clones]
clts = /path/to/clts

Then, cd back to the location of this package and install it:

cd path/to/repo
pip install .

And you're set!

Training

The models can be trained by running the notebook train_nelex.ipynb. The models along with the data used for training & testing will be saved automatically in notebooks/out/. Otherwise you can use the pretrained models by downloading the nelex_unique zip folder and extracting it into the out folder. Only models for NELEX10 are provided.

Analysis

One notebook performs the analysis for a single language. The results are output in the form of latex tables and plots (used with minimal changes in the thesis document itself). The table below maps the notebooks to the type of analysis:

Notebook Experiment Description
all.ipynb Masking Compares mean surprisal in the vowel-only and consonant-only condition for all languages in NorthEuraLex
nelex10.ipynb Masking Evaluates surprisal in the masking experiments for the languages in NELEX10
finnish.ipynb Harmony Feature surprisal for Finnish +-BACK feature
hungarian.ipynb Harmony Feature surprisal for Hungarian +-BACK feature
turkish.ipynb Harmony Feature surprisal for Turkish +-BACK and +-ROUND features
manchu.ipynb Harmony Feature surprisal for Manchu +-BACK feature
khalkha_mongolian Harmony Feature surprisal for Khalkha Mongolian +-ATR and +-ROUND features
non_vh_langs.ipynb Harmony Feature surprisal for languages without vowel harmony, for +-BACK and +-ROUND features
surprisal_reduction.iypnb Plots difference between harmonic and disharmonic distribution for all feature-language combinations for NELEX10
vh_vs_non_vh.ipynb Plots mean differences for the 5 vowel harmony languages for +-BACK and +-ROUND by feature