-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polish benchmark and include some figures in README #20
Conversation
1281e4c
to
eaf4f7f
Compare
f21653f
to
32d4c76
Compare
2f6e8a3
to
6ea06fa
Compare
6ea06fa
to
10d1784
Compare
crates/bpe/README.md
Outdated
![counting runtime comparison](./benches/result/counting-o200k.svg) | ||
|
||
### Encoding results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a worst case example for tiktoken?
(some string without whitespaces)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that. This was the same as the encoding benchmark but all inputs were taken from a random ascii string without whitespace. The factor increased a bit (close to 6x) but the curves seem fairly similar to the encoding results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm. What I tested some time ago was a string which was the concatenation of all unicode characters. That input never finished with the tiktoken lib... I think the regex simply returned a super large sub-chunk on which the quadratic encoder was then running. This obviously will take forever...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the ascii is too simple then. I found a way to sample unicode. I'll try that and see if it makes a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not replicate what you're describing using random Unicode strings. I'll leave this for now and maybe get back to it if we want to highlight this on the blog post.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, I didn't use random unicode strings... 🤷
crates/bpe/README.md
Outdated
![encoding runtime comparison](./benches/result/encoding-o200k.svg) | ||
|
||
### Incremental encoding results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe merge with the previous section?
The important point here is not so much the incremental encoding, but the reverse encoding aspect I think.
I guess it requires a bit of explanation...
Oh... did I check that it returns the same result for "random" input?
E.g. when we have all whitespace, then the reverse encoder must move the longest merged token correctly to the front.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean? The appending encoder doesn't do reverse encoding afaik. We also have the prepending one, although that one's not included in the benchmark now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on what we will focus on in the blog post, we might need to add more numbers (like worst-case inputs for tiktoken)
Polishes the benchmark and adds some of the result figures in the README.
I considered adding the HuggingFace tokenizers to the benchmark as well, but they don't have cl100k and/or o200k readily available. I can figure out how to build a tokenizer from the tiktoken tokens. That would require computing some merge lists, if I understand it correctly. But I'm not sure it's worth the effort. E.g. this suggests their tokenizer is quite a bit slower than tiktoken anyway.
Rendered README.