-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create an NLTK tagger #7
Comments
I've created a script to calculate the average size in memory of part-of-speech data (list of tuples) -- it takes ~9min to run for 202k documents ( For the 202k+ already tagged documents, the size would be something like 8.45GB (in-memory Python list of tuples with all sentences of these documents). The average per document by now is 43,79KiB, so we'll need approximately 31.56GB of memory to store all 755,680 part-of-speech tagged documents. |
Can't this be done in batches, to keep the memory usage predictable? On Fri, Apr 19, 2013 at 4:50 PM, Álvaro Justen [email protected]:
Flávio Codeço Coelho+55(21) 3799-5567 |
@fccoelho, we've already discussed this topic and one assumption of the project is to do it one time, using all the memory we have available. One of the key assumptions is that we're not going to use an incremental trainer (probably we can spend some time on it in the future, but it's not the goal now). |
After parsing all documents we need to train a NLTK tagger with POS data of the entire corpus (755k+ documents).
The text was updated successfully, but these errors were encountered: