.NET: User Story: PyTorch & HuggingFace Custom Models Migration Story #9793

tjwald · 2024-11-22T13:42:47Z

name: PyTorch & HuggingFace Custom Models Migration Story
about: Making migration to dotnet easier for projects that were trained using the HuggingFace transformers library and PyTorch model.

We created a POC using the new AI building blocks of dotnet 9, and wanted to point out pain points, and opportunities to improve performance, and enable easier migrations from python.

Background

My team is trying to cut costs in our production environment, and a third of our cost is custom ML servers that we have created.
Each ML model is wrapped in a FastAPI server. The model itself is called using the transformers library created by HuggingFace.
The model is trained and created by our research team, and we are responsible to make them run fast and cost less.

We need to host our own models due to algorithmic complexities surrounding the call to the model itself - for example repeated calls to the model during the same user request, data locality optimizations for combining several models for the same request and more.

To reduce costs and improve our performance we migrated to ONNX (but still using python), and we saw an improvement, but we still weren't able to fully utilize the GPU, and we feel that we have reached the maximum ability of our python server to handle concurrent requests.
This requires us to spin up multiple pods for the same service to deal with the load.

As soon as dotnet 9 came out with the new AI Infrastructure and building blocks, I created a POC of our simplest model with the new libraries and was able to prove that this can increase our GPU utilization, throughput, and latency to move to C# and dotnet.

This was difficult.
There was no clear migration guide for this scenario which was shocking given the importance of HuggingFace transformers for AI usage.
This POC required me to implement many things provided by the transformers library and 'fight' with ONNX <-> Tokenizers libraries in dotnet.

Additional Context

We are a python backend team. I have some background in C# and Dotnet, but convincing management to migrate to dotnet is difficult especially given the complexity of the code required to write an efficient server in C# for ML processing.
I spent 1 month to migrate all of our models to ONNX and to a new architecture to improve performance. This only got us to 24K requests per minute. But using the C# POC I created I was able to get to 200K requests per minute with a substantially lower latency.

Request

Start a project to document, supply tools, and library features to make the migration from HuggingFace custom models simple and the end result performant.
Even if some of my comments / requests exist, they aren't documented well enough for this migration to be easy.

I love dotnet and would love more applications and coding shops to use it.

Value To the Ecosystem of Dotnet

If dotnet wants more users to start using dotnet for AI applications, it needs to supply easy to use, performant migration paths for the largest AI ecosystem - HuggingFace transformers, especially with custom models and tokenizers.
This will enable R&D teams to take ML Researcher models and get them to production on a more efficient solution.

The following contains most of the suggestions / issues we encountered in our POC.

Tokenizers Enhancements

Using Custom Tokenizer Options

In the HuggingFace library, loading a custom tokenizer is as simple as:

tokenizer = AutoTokenizer.from_pretrained('/path/to/tokenizer')

In the Microsoft.ML.Tokenizers library, this is more complex making the migration harder.
There are 2 reasons why migration is harder:

No simple factory method that loads the correct tokenizer, with the custom tokenizer options from disk
No migration guide from a HuggingFace tokenizer to the dotnet equivalent - I needed to research the specific tokenizer used, the specific way to load the needed resources, and the specific mapping of conifgs and files from the HuggingFace tokenizer formats to the dotnet format.

If we could have a factory that was able to load the resources from disk and return a fully functional tokenizer, and if there was a simple migration guide or Extension package for ease of migration from HuggingFace, this would be best.

Token Id Type

We should be able to specify that the Output should be a long as opposed to an int, etc. since we had to cast the int to a long since that was what to model took as input.

Batch Tokenization

We optimized our models to use a lot of batch processing - both pre-batched and dynamic batching.
To support this, I had to write a wrapper for the Microsoft.ML.Tokenizers Tokenizer class that performed this batch tokenization.

The current interface of the Tokenizer requires me to allocate an array for each tokenization call.
In addition, I then allocate an array for the batch to hold on to all of these arrays for each sentence in the batch and then copy it into a 2-dimensional array for the model to be able to process them.
These are a lot of allocations and copying that could be avoided by supporting batching natively.
In addition, adding an overload so we can pass the output buffer can help reduce allocations and increase performance by pooling these tensors.

This shows that batch tokenization should be a feature of the tokenizer and not handwritten by the user, and with minimal changes to the signature be more performant.

Context Tokenization

In the HuggingFace library, using a tokenizer you can tokenize a sentence with a given context like so:

tokenizer = AutoTokenizer.from_pretrained('/path/to/tokenizer')

context = 'some context'
sentence = 'some sentence'

tokens = tokenizer(context, sentence)

This is also supported in the batch form.

Migrating from HuggingFace to dotnet would require understanding the underpinning of this tokenization method and would complicate the project enough to make the transition not "worth it" on the maintenance side.

Tensors

The need for a tensor type

We have some models that use 2-dimensional tensors as input, and some that use 3-dimensional tensors.
All of our models return 2d tensors, where the first dimension is the batch size, and the second is the actual result for each item in the batch.

Trying to get this working with arrays / Memory2D from the CommunityToolkit.HighPerformance package helped but is cumbersome to use. Also, there is no Memory3D or MemoryND.

In python we have the numpy.ndarray that enables the user to specify the shape of the tensor, and change the shape as needed.
For example, we can batch tokenize 20 sentences, where we need the model to get 5X4X512 tensor, representing a batch size of 5 with an input of 4 sentence of up to 512 tokens per sentence.

For example:

import numpy as np

sentences = [...] # 20 sentences

tokenized_batch:dict[str, np.ndarray] = tokenizer(sentences)


model_input = {input_name: np.reshape(tensor, shape=(5, 4, -1)) for input_name, tensor in tokenized_batch.items()}

# call model

We should be able to create a view of the underlying data with the new shape without allocations. This is not possible with arrays, and the higher dimensionality isn't easy using Memory.

I am also aware that there are dense and sparse tensors, but we only use dense tensors so I can't give any input here.
This should be considered as part of the design of a Tensor Type.
This tensor type should be compatible / easy / efficient to use with connecting the tokenizer output to the model.

Tensor primitives

All of our models use SoftMax on the output of the model before using the output.
To do this for a batch I used Memory2D for the model output and then had to loop for each row in the result and call TensorPremitives.SoftMax to get the result.
I am sure that there is a more efficient way to do this, that is simple to use. If there was a tensor type, then the call to SoftMax on the tensor should run the equivalent of SoftMax for each "row" of the last dimension (or give a parameter which dimension to use)

Putting this all Together

In the transformers library there are simple to use pipelines that enable users to solve a specific task.
For example, the TextClassificationPipeline enables users to tokenize and then classify the text to a set of given labels.
The pipeline takes a batch of sentences, runs the tokenization, runs the model on the tokens, and then returns the label for each sentence along with the logits for each input.

There is no simple to use equivalent pipeline in dotnet.

To make it worse, the ONNX library uses a custom Tensor type and OrtValues that aren't easily created, which are also very confusing to get right, and RunAsync method that isn't thread safe as far as I can tell.

I wrote my own pipeline for one of the tasks we need, but this makes the transition from python to dotnet very hard, and also very error prone.

Bonus

Since we can't use a solution like NVidia Triton server, or other hosted solutions for AI models, we had to write our own inference orchestration, to manage batching and parallel processing of requests in a certain time-window. This is also very difficult to manage and could be better done by a dedicated solution (For example we don't monitor Memory usage to see if we can fit more models on the same GPU at the same time).

The text was updated successfully, but these errors were encountered:

luisquintanilla · 2024-11-22T18:34:53Z

Hi @tjwald,

Thanks for this writeup and feedback.

I'm curious, when you mentioned "The need for a tensor type", did you use Tensor<T>?

https://learn.microsoft.com/en-us/dotnet/core/whats-new/dotnet-9/overview#tensort

tjwald · 2024-11-23T15:41:30Z

@luisquintanilla I didn't see that the type existed - there was a tensor type provided in the ONNX package that wasn't easy to use.
I will take a look at it, and let you know how it worked out.

Also - I have to add that in dotnet 9 is the first release that I could actually implement our ML model, and it is a lot more performant (10X!!) than our python implementation!
So I am very happy with the way that dotnet is going with AI. This user story is about making it easier and more performant :)

tjwald · 2024-11-23T23:57:40Z

I have now tried to use the Tensor Type provided in System.Numerics.Tensors and I wasn't able to adapt my POC to use it.

These were the issues I ran into:

ONNX doesn't support it natively - and I couldn't get around this by getting a Memory<T> view of the Tensor<T>.
- In the end I used UnsafeAccesor to simulate this access, but due to the reasons below it wasn't helpful to do so.
There is no Tensor(ReadOnly)Memory<T> equivalent of (ReadOnly)Memory<T>. Meaning I couldn't pass slices of Tensors to functions that needed only a portion of it. This is useful for batching and taking the tokenized Tensor<T> and slicing batches out of it.
I was trying to convert a Span<ReadOnlyTensorSpan<T>> to OrtValue[] but this resulted in a compilation error since I couldn't compile the type. Passing a ReadOnlyTensorSpan<T>[] won't compile also since ReadOnlyTensorSpan<T> can't be in an array.
Trying to get around that limitation, I tried to convert each ReadOnlyTensorSpan<T> to an OrtValue on its own, but I couldn't without extra copies (No access to a Memory<T>...)
On the output side, it was harder to convert the tensor to a result:

private (int[], float[]) BatchChoices(ReadOnlyTensorSpan<float> modelOutput)
{
     int batchSize = modelOutput.Lengths[0];
     int[] choices = new int[batchSize];
     int[] scores = new int[batchSize];
     Span<float> probabilities = stackalloc float[modelOutput.Lengths[1]];
     // We can't assign a stackalloc directly to a TensorSpan<T>... 
     TensorSpan<float> probabilitiesTensor = new TensorSpan<float>(probabilities );
     for(int i = 0; i < batchSize; i++)
     {
            Tensor.SoftMax(modelOutput[(i..i+1), ..], probabilitiesTensor);
            choices[i] = TensorPrimitives.IndexOfMax(probabilities);
            scores[i] = TensorPrimitives.Max(probabilities);
     }
    return (choices, scores);
}

I expected to be able to SoftMax the tensor so that each row was softmax on its own, and then IndexOfMax / Max would be applied to each row separately, returning a Span<int> / Span<float> of the indices / scores in each row, like so:

private (int[], float[]) BatchChoices(ReadOnlyTensorSpan<float> modelOutput)
{
     int batchSize = modelOutput.Lengths[0];
     int[] choices = new int[batchSize];
     int[] scores = new int[batchSize];
     Span<float> probabilities = stackalloc float[(int)modelOutput.FlattendLength];  // this cast is ugly - can we get rid of it?
     
     TensorSpan<float> probabilitiesTensor = new TensorSpan<float>(probabilities );  

     Tensor.SoftMax(modelOutput, probabilitiesTensor, Dimension: ^1);
     Tensor.IndexOfMax(probabilitiesTensor, choices, Dimension: ^1);
     Tensor.Max(probabilitiesTensor, scores, Dimension: ^1);

     return (choices, scores);
}

markwallace-microsoft added .NET Issue or Pull requests regarding .NET code python Pull requests for the Python Semantic Kernel triage labels Nov 22, 2024

github-actions bot changed the title ~~User Story: PyTorch & HuggingFace Custom Models Migration Story~~ .Net: User Story: PyTorch & HuggingFace Custom Models Migration Story Nov 22, 2024

github-actions bot changed the title ~~User Story: PyTorch & HuggingFace Custom Models Migration Story~~ Python: User Story: PyTorch & HuggingFace Custom Models Migration Story Nov 22, 2024

stephentoub changed the title ~~Python: User Story: PyTorch & HuggingFace Custom Models Migration Story~~ .NET: User Story: PyTorch & HuggingFace Custom Models Migration Story Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.NET: User Story: PyTorch & HuggingFace Custom Models Migration Story #9793

.NET: User Story: PyTorch & HuggingFace Custom Models Migration Story #9793

tjwald commented Nov 22, 2024 •

edited

Loading

luisquintanilla commented Nov 22, 2024

tjwald commented Nov 23, 2024

tjwald commented Nov 23, 2024 •

edited

Loading

.NET: User Story: PyTorch & HuggingFace Custom Models Migration Story #9793

.NET: User Story: PyTorch & HuggingFace Custom Models Migration Story #9793

Comments

tjwald commented Nov 22, 2024 • edited Loading

Background

Additional Context

Request

Value To the Ecosystem of Dotnet

Tokenizers Enhancements

Using Custom Tokenizer Options

Token Id Type

Batch Tokenization

Context Tokenization

Tensors

The need for a tensor type

Tensor primitives

Putting this all Together

Bonus

luisquintanilla commented Nov 22, 2024

tjwald commented Nov 23, 2024

tjwald commented Nov 23, 2024 • edited Loading

tjwald commented Nov 22, 2024 •

edited

Loading

tjwald commented Nov 23, 2024 •

edited

Loading