Skip to content
Sravana Reddy edited this page Nov 14, 2015 · 9 revisions

CSLU FAE

The CSLU Foreign Accented English dataset is located in

/home/sravana/data/cslu_fae_corpus 

on tempest. See the index.html file for details (note that some of the folders are intentionally empty).

PLP+delta+deltadelta+deltadeltadelta features are extracted in HTK binary format into the plp directory, and parsed into text format readable by numpy.loadtxt in plptxt (note name change from npytxt).

Since the text format files are too large, I've also saved them in .npz (numpy compressed format), readable by numpy.load in the npz directory. Compression is about 20%.

numpy.load('AR.npz'), for example, will return a dictionary mapping the filenames in the AR directory to their numpy arrays.

Gender information for each language is in the gender directory.

traintestsplit in the repository contains a fixed split of training and testing data. Let's use this for consistency. x-fold validation is too slow in the experimental stage.

Transcriptions

An undergraduate student at Swarthmore apparently crowdsourced transcriptions for part of the corpus for this thesis! See http://www.swarthmore.edu/sites/default/files/assets/documents/linguistics/2009_AkasakaRyo.pdf Unfortunately, it seems that the coverage is very low. Transcriptions have been uploaded to tempest.

Other corpora to investigate

  1. Nationwide Speech Project Corpus (American dialects). http://www.ling.ohio-state.edu/~cclopper/nsp/
  2. CU Accents Dataset (corpus used by Angkititrakul and Hansen, 2006). Can't find it online.

testing sidebar

Clone this wiki locally