Brendan Tomoschuk's and Jarrett Lovelett's entry to the Duolingo Shared Task on Second Language Acquisition Modeling. See conference paper in Documents/. Given basic user data and limited feature set, predict probability of translations errors at the token level.
-
Clone this repository and then download the provided datasets from here. Save data to Data/, preserving the following directory structure (keep existing filenames):
/DuolingoSharedTask
/Data
/data_en_es
/data_es_en
/data_fr_en -
makeDataFrame.ipynb: a notebook that reads in the data, processes it, generates new features, and saves the datasets using pickle (one for each target language).
-
buildModel-forest.ipynb: a notebook that reads in the pickled data and builds a random forest classifier for each language separately and generates predictions for the test set. Options to use training data to predict dev set instead of test set, or to include dev set in the training data.
-
See Data/starter_code for baseline.py, a baseline model provided by the shared task organizers (reads in raw data, do not use makeDataFrame.ipynb) and eval.py, a script that evaluates the predictions generated by the baseline model and/or buildModel.ipynb and reports several metrics.