Skip to content

Scala interfaces to huggingface transformers and tokenizers

Notifications You must be signed in to change notification settings

clulab/scala-transformers

Repository files navigation

Build Status Maven Central

scala-transformers

Scala interfaces to newly trained Hugging Face/ONNX transformers and existing tokenizers

The libraries and models resulting from this project are incorporated into processors and generally don't need attention unless functionality is being modified, but here are some details about how it all works.

encoder

To incorporate the encoder subproject as a Scala library dependency, either to access an existing model or because you've trained a new one with the Python code there, you'll need to add something like this to your build.sbt file:

libraryDependencies += "org.clulab" %% "scala-transformers-encoder" % "0.4.0"

New models should generally be published to the CLU Lab's artifactory server so that they can be treated as library dependencies, although they can also be accessed as local files. Two models have been generated and published. They are incorporated into a Scala project with

resolvers += "clulab" at "https://artifactory.clulab.org/artifactory/sbt-release"

// Pick one or more.
libraryDependencies += "org.clulab" % "deberta-onnx-model"  % "0.0.3"
libraryDependencies += "org.clulab" % "roberta-onnx-model"  % "0.0.2"

The models make reference to tokenizers which also need to be added according to instructions in the next section.

Please see the encoder README for information about how to generate models and how to download and package Hugging Face tokenizers for use in the tokenizer subproject.

tokenizer

To use the tokenizer subproject as a Scala library dependency, you'll need to add something like this to your build.sbt file:

libraryDependencies += "org.clulab" %% "scala-transformers-tokenizer" % "0.4.0"

See the tokenizer README for information about which tokenizers have already been packaged and how they are accessed.