Create script to export tagged corpus #4

turicas · 2013-04-12T11:14:35Z

After processing all uploaded documents, we need to create a script to query MongoDB to get all tagged Wikipedia articles and then export it to a given format (one that can be read by NLTK).

turicas · 2013-05-03T14:34:20Z

After studying the possibilities and talking with Steven Bird about this topic, the best approach seems to be:

Create a customized CorpusReader class based on a local database (SQLite or DBM, because they are available on the Python standard library), since there are many documents to just leave them in the filesystem (an example of customized class using MongoDB);
Export all data that is on MongoDB to this new data structure (if possible, compress the contents of this file);
Create a pull request on nltk project adding the new CorpusReader;
Create a pull request on nltk-data project adding the new pt_wikipedia corpus.

fccoelho · 2013-05-03T15:44:40Z

Looks good, are you going to settle on SQLite or DBM?

turicas · 2013-05-06T16:30:50Z

SQLite by now, since DBM miss some great features of key-value stores.

turicas · 2013-05-08T09:23:36Z

Done on 6424741, but needs to run on server (fab export) with no errors in corpus (no duplicates, all pipelines run etc.). Depeding on #8, #10 and #11 to re-run.

ghost assigned turicas May 3, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create script to export tagged corpus #4

Create script to export tagged corpus #4

turicas commented Apr 12, 2013

turicas commented May 3, 2013

fccoelho commented May 3, 2013

turicas commented May 6, 2013

turicas commented May 8, 2013

Create script to export tagged corpus #4

Create script to export tagged corpus #4

Comments

turicas commented Apr 12, 2013

turicas commented May 3, 2013

fccoelho commented May 3, 2013

turicas commented May 6, 2013

turicas commented May 8, 2013