Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create an NLTK tagger #7

Open
turicas opened this issue Apr 19, 2013 · 3 comments
Open

Create an NLTK tagger #7

turicas opened this issue Apr 19, 2013 · 3 comments
Assignees

Comments

@turicas
Copy link

turicas commented Apr 19, 2013

After parsing all documents we need to train a NLTK tagger with POS data of the entire corpus (755k+ documents).

@ghost ghost assigned turicas Apr 19, 2013
@turicas
Copy link
Author

turicas commented Apr 19, 2013

I've created a script to calculate the average size in memory of part-of-speech data (list of tuples) -- it takes ~9min to run for 202k documents (fab calculate_pos_size).

For the 202k+ already tagged documents, the size would be something like 8.45GB (in-memory Python list of tuples with all sentences of these documents).

The average per document by now is 43,79KiB, so we'll need approximately 31.56GB of memory to store all 755,680 part-of-speech tagged documents.

@fccoelho
Copy link
Member

Can't this be done in batches, to keep the memory usage predictable?

On Fri, Apr 19, 2013 at 4:50 PM, Álvaro Justen [email protected]:

I've created a script to calculate the average size in memory of
part-of-speech data (list of tuples) -- it takes ~9min to run for 202k
documents (fab calculate_pos_size).

For the 202k+ already tagged documents, the size would be something like *
8.45GB* (in-memory Python list of tuples with all sentences of these
documents).

The average per document by now is 43,79KiB, so we'll need
approximately 31.56GB of memory to store all 755,680 part-of-speech
tagged documents.


Reply to this email directly or view it on GitHubhttps://github.com//issues/7#issuecomment-16681345
.

Flávio Codeço Coelho

+55(21) 3799-5567
Professor
Escola de Matemática Aplicada
Fundação Getulio Vargas
Praia de Botafogo, 190 sala 312
Rio de Janeiro - RJ
22250-900
Brasil

@turicas
Copy link
Author

turicas commented Apr 20, 2013

@fccoelho, we've already discussed this topic and one assumption of the project is to do it one time, using all the memory we have available. One of the key assumptions is that we're not going to use an incremental trainer (probably we can spend some time on it in the future, but it's not the goal now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants