tokenizer

Scala

This subproject (the one in this tokenizer directory) should work on the four supported platforms without further measure (such as installation of Rust). A build file is included for sbt, and IntelliJ is able to import it. The pre-compiled library files in the resources directory will be used:

apple-librust_tokenizer.jnilib is for Macs with Apple processors
intel-librust_tokenizer.jnilib is for Macs with Intel processors
librust_tokenizer.so is built for Linux with Intel processors
rust_tokenizer.dll works under Windows for Intel processors

Additionally, the tokenizer subdirectory includes a collection of pretrained Hugging Face tokenizers in serialized form. Some of the tokenizers are available from the Hugging Face website, but including them here allows the library to function without a network connection. Other tokenizers are not directly available from the website. These have been downloaded in their Python representation, converted to Rust format, and added to the resources. In this way the library is also independent of Python at runtime. The currently included tokenizers are

bert-base-cased
distilbert-base-cased
roberta-base
xlm-roberta-base
google/bert_uncased_L-4_H-512_A-8
google/electra-small-discriminator
microsoft/deberta-v3-base

For instructions on adding more, see the encoder README.

When a named tokenizer is requested in the Scala code, the computer searches in three places for a serialized version of it, in order:

In a file on the local hard drive. This option can be used for unpublished tokenizers, perhaps for testing or privacy reasons. In this case the tokenizer name should really be a file name. It would normally end with tozenizer.json and not be just a directory name. ../my_fancy_tokenizer/tokenizer.json is an example.
In a Java resource accessible on the classpath. The resource name will be generated from the tokenizer name based on this template: /org/clulab/scala_transformers/tokenizer/<tokenizer_name>/tokenizer.json.
At the Hugging Face website. Hugging Face code in the library and at the site will convert the tokenizer name to a URL looking something like this: https://huggingface.co/<tokenizer_name>/blob/main/tokenizer.json. If the tokenizer is found once, it will be cached to your local hard drive and used on subsequent requests that aren't satisfied by the two options above.

Rust

To rebuild the libraries which provide the JNI interface to Hugging Face tokenizers, install Rust and then run cargo build in the rust subdirectory. Copy the resulting library (i.e., .so, .dll, or .dylib file) from the target directory to the resources directory. For a release, use cargo build --release. The Macintosh library files will need to be renamed to distinguish between microprocessors, and the .dylib extension should be changed to .jnilib.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

tokenizer

Scala

Rust

Files

README.md

Latest commit

History

README.md

File metadata and controls

tokenizer

Scala

Rust