Format for sentence tokenizers? #6

diyclassics · 2018-11-02T22:11:29Z

Comparing models to NLTK, it seems like it would be better to pickle sentence tokenizer models as type nltk.tokenize.punkt.PunktSentenceTokenizer objects as opposed to their current type of nltk.tokenize.punkt.PunktTrainer objects. Cf. language-specific files here: https://github.com/nltk/nltk_data/tree/gh-pages/packages/tokenizers

I've added an example of such a file here: https://github.com/cltk/latin_models_cltk/blob/master/tokenizers/sentence/latin_punkt.pickle

I think the 'trainer'-style pickle files should be deprecated and phased out; new code can refer to the 'tokenizer'-style pickle files in the short term and refactored when the former are officially removed.

Thoughts?

The text was updated successfully, but these errors were encountered:

diyclassics · 2018-11-02T22:13:40Z

There might be an argument—again for a productive kind of parallelism in data structure with NLTK—to place this file in a directory ...tokenizers/punkt/latin.py (as opposed to .../tokenizers/sentence/latin.py).

kylepjohnson · 2018-11-26T17:49:14Z

be better to pickle sentence tokenizer models as type nltk.tokenize.punkt.PunktSentenceTokenizer objects as opposed to their current type of nltk.tokenize.punkt.PunktTrainer

Sounds fine to me. When I first wrote that, was I misunderstanding the NLTK API? Or has their API this evolved since then? I could look it up, but looks like you have the answer at your fingertips.

a productive kind of parallelism in data structure with NLTK

I am with you in general, however perhaps the name punkt has always rubbed me the wrong way. In NLP we split two things, words and sentences -- it's intuitive IMHO.

diyclassics assigned todd-cook and kylepjohnson Nov 2, 2018

diyclassics added the question label Nov 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Format for sentence tokenizers? #6

Format for sentence tokenizers? #6

diyclassics commented Nov 2, 2018

diyclassics commented Nov 2, 2018

kylepjohnson commented Nov 26, 2018

Format for sentence tokenizers? #6

Format for sentence tokenizers? #6

Comments

diyclassics commented Nov 2, 2018

diyclassics commented Nov 2, 2018

kylepjohnson commented Nov 26, 2018