Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polish benchmark and include some figures in README #20

Merged
merged 15 commits into from
Oct 2, 2024

Conversation

hendrikvanantwerpen
Copy link
Contributor

@hendrikvanantwerpen hendrikvanantwerpen commented Oct 1, 2024

Polishes the benchmark and adds some of the result figures in the README.

I considered adding the HuggingFace tokenizers to the benchmark as well, but they don't have cl100k and/or o200k readily available. I can figure out how to build a tokenizer from the tiktoken tokens. That would require computing some merge lists, if I understand it correctly. But I'm not sure it's worth the effort. E.g. this suggests their tokenizer is quite a bit slower than tiktoken anyway.

Rendered README.

@hendrikvanantwerpen hendrikvanantwerpen self-assigned this Oct 1, 2024
crates/bpe/README.md Outdated Show resolved Hide resolved
crates/bpe/README.md Outdated Show resolved Hide resolved
![counting runtime comparison](./benches/result/counting-o200k.svg)

### Encoding results
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a worst case example for tiktoken?
(some string without whitespaces)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that. This was the same as the encoding benchmark but all inputs were taken from a random ascii string without whitespace. The factor increased a bit (close to 6x) but the curves seem fairly similar to the encoding results.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm. What I tested some time ago was a string which was the concatenation of all unicode characters. That input never finished with the tiktoken lib... I think the regex simply returned a super large sub-chunk on which the quadratic encoder was then running. This obviously will take forever...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the ascii is too simple then. I found a way to sample unicode. I'll try that and see if it makes a difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not replicate what you're describing using random Unicode strings. I'll leave this for now and maybe get back to it if we want to highlight this on the blog post.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, I didn't use random unicode strings... 🤷

![encoding runtime comparison](./benches/result/encoding-o200k.svg)

### Incremental encoding results
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe merge with the previous section?

The important point here is not so much the incremental encoding, but the reverse encoding aspect I think.
I guess it requires a bit of explanation...

Oh... did I check that it returns the same result for "random" input?
E.g. when we have all whitespace, then the reverse encoder must move the longest merged token correctly to the front.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean? The appending encoder doesn't do reverse encoding afaik. We also have the prepending one, although that one's not included in the benchmark now.

crates/bpe/README.md Outdated Show resolved Hide resolved
crates/bpe/README.md Outdated Show resolved Hide resolved
crates/bpe/README.md Outdated Show resolved Hide resolved
crates/bpe/README.md Outdated Show resolved Hide resolved
crates/bpe/README.md Outdated Show resolved Hide resolved
crates/bpe/README.md Outdated Show resolved Hide resolved
crates/bpe/README.md Outdated Show resolved Hide resolved
crates/bpe/README.md Outdated Show resolved Hide resolved
crates/bpe/README.md Outdated Show resolved Hide resolved
crates/bpe/README.md Outdated Show resolved Hide resolved
crates/bpe/README.md Outdated Show resolved Hide resolved
crates/bpe/README.md Outdated Show resolved Hide resolved
crates/bpe/README.md Outdated Show resolved Hide resolved
crates/bpe/README.md Outdated Show resolved Hide resolved
crates/bpe/README.md Outdated Show resolved Hide resolved
crates/bpe/README.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@aneubeck aneubeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on what we will focus on in the blog post, we might need to add more numbers (like worst-case inputs for tiktoken)

@hendrikvanantwerpen hendrikvanantwerpen merged commit a94431a into main Oct 2, 2024
3 checks passed
@hendrikvanantwerpen hendrikvanantwerpen deleted the update-benchmark branch October 2, 2024 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants