Go bindings for OpenAI's BPE tokenizer implemented in Rust. This project provides high-performance tokenization capabilities for Go applications, offering significant speed improvements over pure Go implementations.
This project wraps GitHub's Rust BPE implementation, which is part of their rust-gems
collection. The implementation is known for its speed, correctness, and novel algorithms for Byte Pair Encoding.
Our Rust-based implementation shows significant performance improvements compared to the pure Go implementation (tiktoken-go):
Key findings:
- 4-6x faster than tiktoken-go across all text types
- CL100k model shows ~393-540% improvement
- O200k model shows even better results with ~483-626% improvement
- Consistent performance gains for both short and long texts
- Particularly efficient with Unicode text, showing a 622% improvement for O200k
The package includes pre-compiled libraries for:
- Linux (x86_64, arm64)
- macOS (x86_64, arm64)
- Windows (x86_64)
go get github.com/edit4i/gh-bpe-openai-go
For now, the github action is not working properly, so you'll need to build from source. But will update this ASAP
The appropriate library for your platform is included in the package and will be automatically used during compilation.
If you need to build for a different platform or want to optimize for your specific architecture:
- Install Rust toolchain (https://rustup.rs/)
- Clone the repository with submodules:
git clone --recursive https://github.com/edit4i/gh-bpe-openai-go
cd gh-bpe-openai-go
- Build the Rust library:
cd rust
cargo build --release
- Copy the built library to the appropriate location:
# Linux
cp target/release/libbpe_openai_ffi.so ../lib/linux_amd64/
# macOS
cp target/release/libbpe_openai_ffi.dylib ../lib/darwin_amd64/
# Windows
cp target/release/bpe_openai_ffi.dll ../lib/windows_amd64/
Here's a simple example of how to use the tokenizer:
package main
import (
"fmt"
"log"
bpe "github.com/edit4i/gh-bpe-openai-go"
)
func main() {
tokenizer, err := bpe.NewCL100kTokenizer()
if err != nil {
log.Fatal(err)
}
text := "Hello, world!"
tokens, err := tokenizer.Encode(text)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Tokens: %v\n", tokens)
}
Check out the examples directory for more usage examples.
CL100k
- Used by GPT-4, GPT-3.5-turbo, text-embedding-3-*O200k
- Used by earlier models like GPT-3
The project is organized as follows:
bpe-openai-ffi/
- Rust FFI layer that provides C bindingsrust-gems/
- Submodule containing GitHub's Rust BPE implementationbenchmark/
- Performance benchmarks and visualizationexamples/
- Usage examples- Root files:
bpe.go
- Main Go bindingsbpe_test.go
- Test suitecbindgen.toml
- C bindings configuration
- Go 1.21+
- Rust (latest stable)
- cbindgen (for generating C headers)
# Clone with submodules
git clone --recursive [email protected]:edit4i/gh-bpe-openai-go.git
cd gh-bpe-openai-go
# If you already cloned without --recursive:
git submodule update --init --recursive
# Build
make build
make test
# Run the benchmarks
go test -bench=.
# Generate benchmark visualization (requires Python)
make performance # will generate "benchmark.txt"
mv benchmark.txt benchmark/benchmark.txt
cd benchmark
pip install -r requirements.txt
python plot_benchmark.py
MIT License - see LICENSE for details.
Contributions are welcome! Please feel free to submit a Pull Request.
When updating the Rust implementation, please rebuild and update the pre-compiled binaries for all supported platforms:
- Build for each platform using appropriate cross-compilation tools
- Place the compiled libraries in the correct
lib/
subdirectories:- Linux:
lib/linux_amd64/libbpe_openai_ffi.so
,lib/linux_arm64/libbpe_openai_ffi.so
- macOS:
lib/darwin_amd64/libbpe_openai_ffi.dylib
,lib/darwin_arm64/libbpe_openai_ffi.dylib
- Windows:
lib/windows_amd64/bpe_openai_ffi.dll
- Linux:
- Test the package on each platform to ensure the binaries work correctly
- Update the version number in both Rust and Go code
For cross-compilation, you can use:
- Docker for Linux targets
- OSX Cross for macOS targets
- MinGW-w64 for Windows targets
Example cross-compilation commands will be provided in the scripts/
directory.
This project wouldn't be possible without:
- GitHub's Rust Gems - The underlying Rust implementation
- tiktoken-go - For benchmark comparisons and inspiration