Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format for sentence tokenizers? #6

Open
diyclassics opened this issue Nov 2, 2018 · 2 comments
Open

Format for sentence tokenizers? #6

diyclassics opened this issue Nov 2, 2018 · 2 comments
Assignees
Labels

Comments

@diyclassics
Copy link
Contributor

Comparing models to NLTK, it seems like it would be better to pickle sentence tokenizer models as type nltk.tokenize.punkt.PunktSentenceTokenizer objects as opposed to their current type of nltk.tokenize.punkt.PunktTrainer objects. Cf. language-specific files here: https://github.com/nltk/nltk_data/tree/gh-pages/packages/tokenizers

I've added an example of such a file here: https://github.com/cltk/latin_models_cltk/blob/master/tokenizers/sentence/latin_punkt.pickle

I think the 'trainer'-style pickle files should be deprecated and phased out; new code can refer to the 'tokenizer'-style pickle files in the short term and refactored when the former are officially removed.

Thoughts?

@diyclassics
Copy link
Contributor Author

There might be an argument—again for a productive kind of parallelism in data structure with NLTK—to place this file in a directory ...tokenizers/punkt/latin.py (as opposed to .../tokenizers/sentence/latin.py).

@kylepjohnson
Copy link
Member

be better to pickle sentence tokenizer models as type nltk.tokenize.punkt.PunktSentenceTokenizer objects as opposed to their current type of nltk.tokenize.punkt.PunktTrainer

Sounds fine to me. When I first wrote that, was I misunderstanding the NLTK API? Or has their API this evolved since then? I could look it up, but looks like you have the answer at your fingertips.

a productive kind of parallelism in data structure with NLTK

I am with you in general, however perhaps the name punkt has always rubbed me the wrong way. In NLP we split two things, words and sentences -- it's intuitive IMHO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants