-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Where to obtain datasets for training? #62
Comments
Hi, Europarl can be downloaded from here: http://hltshare.fbk.eu/IWSLT2012/training-monolingual-europarl.tgz The TED dataset was preprocessed by the authors of http://www.lrec-conf.org/proceedings/lrec2016/pdf/103_Paper.pdf and the resulting dataset is shared at: https://drive.google.com/file/d/0B13Cc1a7ebTuMElFWGlYcUlVZ0k/view |
Thanks. However, how do you use that converter.py script on those archives? Each archive contains multiple files. For example, the LREC archive contains files dev2012, test2011, test2011asr, and train2012. I'm not sure what the difference is between test2011 and test2011asr. The readme just says it's "for ASR output", which tells us nothing. Do I need to convert all of these files? How do I combine this with the Europarl file? There only appears to be one, europarl-v7.en, and it seems to be in a very different format than the LREC files, as it contains full sentences, whereas the LREC files appear to contain pairs of tokens. |
Nevermind, I went through the scripts in ./examples, and figured out how to preprocess the raw datasets. I put the train/dev/test files for both the TED and Europarl files in the same directory, so the data.py would include them all. Is that copacetic? I'm now training a model using the recommended |
In your README, you say you trained your model on the TED and Europarl datasets. Where did you obtain these? I can't find any public download links for anything matching those names.
I'd like to train my own model, using those as a starting point, but these datasets don't seem to exist anywhere.
The text was updated successfully, but these errors were encountered: