-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial release #22
Initial release #22
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I added the serialized representation in the first place was because building the aho corasick data structures from scratch takes several seconds!
I.e. this is a cost, we don't want to pay when running blackbird tests (or when other ppl depend on this).
One option could be to improve the aho-corasick serialization, since we have our own fork anyways. But, I didn't check how much data it actually has. It might just be tricky to get it small, since with have 100k resp. 200k tokens in it. And you need to store probably a couple of numbers for every character in them. So, making it small might be challenging...
I looked at the serialization code of the daaghorse crate and it doesn't look like we can safe a ton there. not great... |
This may not be as bad as it could be. I added a print statement to the initialization and it looks like this happens only once for all tests. That was in this crate though, so it could be that in Blackbird it might happens once per crate (if the tokenizers are used directly or indirectly). I'll try compression and see if that gives us anything. |
Just compressing the serialized BPE reduces the size, but not enough to get us under the 10MB limit:
I wondered if we could use |
Co-authored-by: Timothy Clem <[email protected]>
Generate serialized data in build script
@@ -0,0 +1,42 @@ | |||
# OpenAI Byte Pair Encoders | |||
|
|||
Fast tokenizers for OpenAI token sets based on the [bpe](https://crates.io/crates/bpe) crate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there should be a warning that this is crate is NOT replicating the regex "word splitting" used by openAI.
Therefore, results will differ!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the warning and also a test that shows an example of the issue.
The
bpe
crate names was released, so I released an initial version of the crate to claim the name.Things changed:
I decided to go ahead with the release to make sure we got the name. I imagine we do another release soon with additional changes or polishing we think are necessary.