Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create script to export tagged corpus #4

Open
turicas opened this issue Apr 12, 2013 · 4 comments
Open

Create script to export tagged corpus #4

turicas opened this issue Apr 12, 2013 · 4 comments
Assignees

Comments

@turicas
Copy link

turicas commented Apr 12, 2013

After processing all uploaded documents, we need to create a script to query MongoDB to get all tagged Wikipedia articles and then export it to a given format (one that can be read by NLTK).

@ghost ghost assigned turicas May 3, 2013
@turicas
Copy link
Author

turicas commented May 3, 2013

After studying the possibilities and talking with Steven Bird about this topic, the best approach seems to be:

  • Create a customized CorpusReader class based on a local database (SQLite or DBM, because they are available on the Python standard library), since there are many documents to just leave them in the filesystem (an example of customized class using MongoDB);
  • Export all data that is on MongoDB to this new data structure (if possible, compress the contents of this file);
  • Create a pull request on nltk project adding the new CorpusReader;
  • Create a pull request on nltk-data project adding the new pt_wikipedia corpus.

@fccoelho
Copy link
Member

fccoelho commented May 3, 2013

Looks good, are you going to settle on SQLite or DBM?

@turicas
Copy link
Author

turicas commented May 6, 2013

SQLite by now, since DBM miss some great features of key-value stores.

@turicas
Copy link
Author

turicas commented May 8, 2013

Done on 6424741, but needs to run on server (fab export) with no errors in corpus (no duplicates, all pipelines run etc.). Depeding on #8, #10 and #11 to re-run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants