-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement building ngrams storage via python #19
Comments
Hi, could you provide a better description of what exactly you need Tongrams to do? |
Currently I plan to use it as an independent compressed storage format for bigrams and maybe to try to add its support into I am not very familiar to this lib currently and currently I expect like only
gonna have uses for my use case. |
If the libraries you want to use store their output in a (rather standard) |
Most of them use the format that is somehow like the one in the README, but a bit different in whitespaces.
Do you mean using the CLI tools? I meant using the python API (yeah, for my use case it can be possible to pre-serialize the dataset and then consume it (it is likely to be the default use case), though it is not the best way to deal with it (I think about the lib as a middleware, I have abstract classes for storing ngrams in some "internal" format, and backends have methods to convert models from/to their internal formats to/from the abstraction layer "internal" format).), without any |
Hi. I have written an abstraction layer around multiple libraries doing word splitting (
londonisacapitalofgreatbritain
must becomelondon is a capital of great britain
). All the libs rely on preprocessed ngrams dicts, some on unigrams, some additionally on bigrams. All of them store them very inefficiently - as a text file, 1 line - one n-gram. For bigrams it already causes duplication.My middleware provides a unified interface to them, and also converts their ngrams formats to each other.
I'd like to support your lib format for ngrams storage too. But it'd require some way to convert other formats into your format and back.
The text was updated successfully, but these errors were encountered: