From b51951b3bb7d6ca4ee950e4b2852c7f9217b98d1 Mon Sep 17 00:00:00 2001 From: Hendrik van Antwerpen Date: Thu, 10 Oct 2024 19:05:03 +0200 Subject: [PATCH] Update text about input splitting --- crates/bpe/README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/crates/bpe/README.md b/crates/bpe/README.md index 5358bb0..e1bb45a 100644 --- a/crates/bpe/README.md +++ b/crates/bpe/README.md @@ -30,11 +30,12 @@ The comparison with the Rust tiktoken implementation is more subtle, because pre ## Prior Art -There are mostly three strategies for BPE encoding. +There are mostly two strategies for BPE encoding. 1) Trivial solution. Search brute force for the most frequent pair in the encoded text according the dictionary and replace those occurrences. This has a `O(n^2)` complexity and is therefore not very appealing in production. 2) Heap based. Set up a heap with the frequencies. This improves the linear search time to a logarithmic factor. If done properly, the overall complexity reduces now to `O(n log n)`. -3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible. (Note that tiktoken as well as other tokenizers often split the input as part of pre-tokenization to improve model performance.) + +Note that many tokenizers split the input into sections and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible. Input splitting may is therefore not a viable strategy for improving encoding performance. We have implemented a fast heap based solution as baseline. It uses a bitfield to mark token boundaries. This is more memory efficient than using linked lists or other approaches and should also be faster.