Create an NLTK tagger #7

turicas · 2013-04-19T19:47:36Z

After parsing all documents we need to train a NLTK tagger with POS data of the entire corpus (755k+ documents).

turicas · 2013-04-19T19:50:46Z

I've created a script to calculate the average size in memory of part-of-speech data (list of tuples) -- it takes ~9min to run for 202k documents (fab calculate_pos_size).

For the 202k+ already tagged documents, the size would be something like 8.45GB (in-memory Python list of tuples with all sentences of these documents).

The average per document by now is 43,79KiB, so we'll need approximately 31.56GB of memory to store all 755,680 part-of-speech tagged documents.

fccoelho · 2013-04-19T20:18:37Z

Can't this be done in batches, to keep the memory usage predictable?

On Fri, Apr 19, 2013 at 4:50 PM, Álvaro Justen [email protected]:

I've created a script to calculate the average size in memory of
part-of-speech data (list of tuples) -- it takes ~9min to run for 202k
documents (fab calculate_pos_size).

For the 202k+ already tagged documents, the size would be something like *
8.45GB* (in-memory Python list of tuples with all sentences of these
documents).

The average per document by now is 43,79KiB, so we'll need
approximately 31.56GB of memory to store all 755,680 part-of-speech
tagged documents.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/7#issuecomment-16681345
.

Flávio Codeço Coelho

+55(21) 3799-5567
Professor
Escola de Matemática Aplicada
Fundação Getulio Vargas
Praia de Botafogo, 190 sala 312
Rio de Janeiro - RJ
22250-900
Brasil

turicas · 2013-04-20T13:24:19Z

@fccoelho, we've already discussed this topic and one assumption of the project is to do it one time, using all the memory we have available. One of the key assumptions is that we're not going to use an incremental trainer (probably we can spend some time on it in the future, but it's not the goal now).

ghost assigned turicas Apr 19, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create an NLTK tagger #7

Create an NLTK tagger #7

turicas commented Apr 19, 2013

turicas commented Apr 19, 2013

fccoelho commented Apr 19, 2013

turicas commented Apr 20, 2013

Create an NLTK tagger #7

Create an NLTK tagger #7

Comments

turicas commented Apr 19, 2013

turicas commented Apr 19, 2013

fccoelho commented Apr 19, 2013

Flávio Codeço Coelho

turicas commented Apr 20, 2013