Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polish benchmark and include some figures in README #20
Polish benchmark and include some figures in README #20
Changes from 8 commits
8bb37e3
641d546
81a119f
eaf4f7f
32d4c76
0c66cab
f05019b
10d1784
215b41b
0fdb60f
d9b2bee
814ef8d
8c9e05b
dc4338d
5376991
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a worst case example for tiktoken?
(some string without whitespaces)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that. This was the same as the encoding benchmark but all inputs were taken from a random ascii string without whitespace. The factor increased a bit (close to 6x) but the curves seem fairly similar to the encoding results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm. What I tested some time ago was a string which was the concatenation of all unicode characters. That input never finished with the tiktoken lib... I think the regex simply returned a super large sub-chunk on which the quadratic encoder was then running. This obviously will take forever...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the ascii is too simple then. I found a way to sample unicode. I'll try that and see if it makes a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not replicate what you're describing using random Unicode strings. I'll leave this for now and maybe get back to it if we want to highlight this on the blog post.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, I didn't use random unicode strings... 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe merge with the previous section?
The important point here is not so much the incremental encoding, but the reverse encoding aspect I think.
I guess it requires a bit of explanation...
Oh... did I check that it returns the same result for "random" input?
E.g. when we have all whitespace, then the reverse encoder must move the longest merged token correctly to the front.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean? The appending encoder doesn't do reverse encoding afaik. We also have the prepending one, although that one's not included in the benchmark now.
This file was deleted.