-
Notifications
You must be signed in to change notification settings - Fork 1
Data
The CSLU Foreign Accented English dataset is located in
/home/sravana/data/cslu_fae_corpus
on tempest. See the index.html
file for details (note that some of the folders are intentionally empty).
PLP+delta+deltadelta+deltadeltadelta features are extracted in HTK binary format into the plp
directory, and parsed into text format readable by numpy.loadtxt
in plptxt
(note name change from npytxt
).
Since the text format files are too large, I've also saved them in .npz (numpy compressed format), readable by numpy.load
in the npz
directory. Compression is about 20%.
numpy.load('AR.npz')
, for example, will return a dictionary mapping the filenames in the AR
directory to their numpy arrays.
Gender information for each language is in the gender
directory.
traintestsplit
in the repository contains a fixed split of training and testing data. Let's use this for consistency. x-fold validation is too slow in the experimental stage.
An undergraduate student at Swarthmore apparently crowdsourced transcriptions for part of the corpus for this thesis! See http://www.swarthmore.edu/sites/default/files/assets/documents/linguistics/2009_AkasakaRyo.pdf Unfortunately, it seems that the coverage is very low. Transcriptions have been uploaded to tempest.
- Nationwide Speech Project Corpus (American dialects). http://www.ling.ohio-state.edu/~cclopper/nsp/
- CU Accents Dataset (corpus used by Angkititrakul and Hansen, 2006). Can't find it online.
testing folder
testing sidebar