Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polish benchmark and include some figures in README #20

Merged
merged 15 commits into from
Oct 2, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions crates/bpe/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ crate-type = ["lib", "staticlib"]
bench = false

[[bench]]
name = "counting"
path = "benches/counting.rs"
name = "performance"
path = "benches/performance.rs"
harness = false

[features]
Expand Down
43 changes: 41 additions & 2 deletions crates/bpe/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ The main purpose of this library is to provide fast and correct token counting f
As a by-product, it can also be used to efficiently encode those chunks if desired.

For chunking the following operations are of interest:

1) Split text after exactly n tokens at a character boundary.
1) Count tokens for sub-ranges of a text.
1) Incrementally count tokens while appending text to a chunk.
Expand All @@ -29,6 +30,7 @@ This library presents novel algorithms to compute BPE encodings which address th
## Prior Art

There are mostly three strategies for BPE encoding.

1) Trivial solution. Search brute force for the most frequent pair in the encoded text according the dictionary and replace those occurrences. This has a `O(n^2)` complexity and is therefore not very appealing in production.
2) Heap based. Set up a heap with the frequencies. This improves the linear search time to a logarithmic factor. If done properly, the overall complexity reduces now to `O(n log n)`.
3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible.
Expand Down Expand Up @@ -89,13 +91,13 @@ If BPE wants to make a different merge decision when it sees the full input, the

Given a valid encoding sequence `e_0..e_i` and a valid encoding tuple `e_i e_j`, then `e_0..e_i e_j` is also a valid encoding sequence.


## Novel Algorithm

At a first glance, it seems impossible to achieve `O(n)` complexity while preserving the encoding output of the original BPE algorithm, since the original BPE algorithm needs to first scan the full input before it can make any encoding decision.
For instance, the sequence `abab` would be encoded as `ab ab` when the dictionary contains the tokens `a b ab ba bc abc babc ababc` ordered by frequency. But appending a single character `ababc` would result in a pretty different tokenization: `ab a bc`. So without looking ahead it seems impossible to properly tokenize the text.

The solution is to track the encodings of ALL text prefixes. For our example `ababc` we would get:

- `a` ------> `a`
- `ab` -----> `ab`
- `aba` ----> `ab a`
Expand Down Expand Up @@ -136,6 +138,7 @@ Once that happens the reencoding will be different and the algorithm can stop.
The actual implementation needs essentially at most 14 lookups for the most complex cases to determine whether two tokens are compatible or not.

Putting all these pieces together leads to the following algorithmic sketch:

```rust
let last_tokens = vec![];
for pos in 0..text.len() {
Expand Down Expand Up @@ -166,6 +169,7 @@ The main observation is that often the greedy heuristic picks already the correc
In the cases, where it doesn't the algorithm has to somehow backtrack to the next tokenization until it converged to the correct solution.

Our backtracking implementation solves the enumeration problem as follows:

1) If the current tokenization sequence is valid, then append the longest matching token to the right.
2) Otherwise, replace the right most token with the next longest prefix token.
3) If there is no such token, then remove that token and go back to step 2.
Expand Down Expand Up @@ -193,4 +197,39 @@ We compared our implementations with the tiktoken implementation on a MacBook Pr
As can be seen, our Backtracking implementation beats the TikToken Rust implementation by ~4x.
And even the fully dynamic programming solution is faster with a more consistent runtime.
The tuned heap implementation is still quite competitive to TikToken (especially for smaller inputs).
If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.
If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.

### Counting results
hendrikvanantwerpen marked this conversation as resolved.
Show resolved Hide resolved

Results for counting o200k tokens for random 10000 byte slices. The setup time of the interval encoder is comparable to backtracking. After setup counting of slices of the original data are approximately constant time.
hendrikvanantwerpen marked this conversation as resolved.
Show resolved Hide resolved

![counting runtime comparison](./benches/result/counting-o200k.svg)

### Encoding results
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a worst case example for tiktoken?
(some string without whitespaces)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that. This was the same as the encoding benchmark but all inputs were taken from a random ascii string without whitespace. The factor increased a bit (close to 6x) but the curves seem fairly similar to the encoding results.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm. What I tested some time ago was a string which was the concatenation of all unicode characters. That input never finished with the tiktoken lib... I think the regex simply returned a super large sub-chunk on which the quadratic encoder was then running. This obviously will take forever...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the ascii is too simple then. I found a way to sample unicode. I'll try that and see if it makes a difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not replicate what you're describing using random Unicode strings. I'll leave this for now and maybe get back to it if we want to highlight this on the blog post.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, I didn't use random unicode strings... 🤷


Results for encoding o200k tokens for random 1000 bytes. The backtracking encoder consistently outperforms tiktoken by a constant factor.
hendrikvanantwerpen marked this conversation as resolved.
Show resolved Hide resolved

![encoding runtime comparison](./benches/result/encoding-o200k.svg)
hendrikvanantwerpen marked this conversation as resolved.
Show resolved Hide resolved

### Incremental encoding results
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe merge with the previous section?

The important point here is not so much the incremental encoding, but the reverse encoding aspect I think.
I guess it requires a bit of explanation...

Oh... did I check that it returns the same result for "random" input?
E.g. when we have all whitespace, then the reverse encoder must move the longest merged token correctly to the front.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean? The appending encoder doesn't do reverse encoding afaik. We also have the prepending one, although that one's not included in the benchmark now.


Results for incrementally encoding o200k tokens by appending 10000 random bytes. The appending encoder is slower by a constant factor but overall has similar performance curve as the backtracking encoder encoding all data at once.

![appending runtime comparison](./benches/result/appending-o200k.svg)

### Running the benchmarks

Run the benchmark as follows (required [cargo-criterion](https://crates.io/crates/cargo-criterion) installed):

```sh
cargo criterion
```

(Using `cargo bench` ignores the settings in `criterion.toml`!)
Open the full report which should be located in `target/criterion/reports/index.html`.

Update the figures in this repo as follows (requires `rsvg-convert` from `librsvg` installed):

```sh
script/copy-benchmark-results
```
139 changes: 0 additions & 139 deletions crates/bpe/benches/counting.rs

This file was deleted.

Loading