Note: This project is archived. Please refer to tok which is a reimplementation.
This is an experimental project binding HuggingFace tokenizers Rust library to R using the extendr project. Do not use for anything meaninful yet.
This repository uses the helloextendr template.
Before you can install this package, you need to install a working Rust toolchain. We recommend using rustup.
On Windows, you’ll also have to add the i686-pc-windows-gnu
and
x86_64-pc-windows-gnu
targets:
rustup target add x86_64-pc-windows-gnu
rustup target add i686-pc-windows-gnu
Once Rust is working, you can install this package via:
remotes::install_github("mlverse/hftokenizers")
Here’s a quick demo of what you can do with hftokenizers
:
library(hftokenizers)
download.file(
"https://raw.githubusercontent.com/mlverse/hftokenizers/main/tests/testthat/assets/small.txt",
"small.txt"
)
tokenizer$
new(models_bpe$new())$
train(normalizePath("small.txt"))$
encode(c("hello world"))$
ids
#> [1] 57 427 93 275 61 53