You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After processing all uploaded documents, we need to create a script to query MongoDB to get all tagged Wikipedia articles and then export it to a given format (one that can be read by NLTK).
The text was updated successfully, but these errors were encountered:
After studying the possibilities and talking with Steven Bird about this topic, the best approach seems to be:
Create a customized CorpusReader class based on a local database (SQLite or DBM, because they are available on the Python standard library), since there are many documents to just leave them in the filesystem (an example of customized class using MongoDB);
Export all data that is on MongoDB to this new data structure (if possible, compress the contents of this file);
Create a pull request on nltk project adding the new CorpusReader;
Create a pull request on nltk-data project adding the new pt_wikipedia corpus.
Done on 6424741, but needs to run on server (fab export) with no errors in corpus (no duplicates, all pipelines run etc.). Depeding on #8, #10 and #11 to re-run.
After processing all uploaded documents, we need to create a script to query MongoDB to get all tagged Wikipedia articles and then export it to a given format (one that can be read by NLTK).
The text was updated successfully, but these errors were encountered: