Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add document for GenAI #7170

Merged
merged 1 commit into from
Jun 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ Project Docs
- [ML.NET Roadmap](https://github.com/dotnet/machinelearning/blob/main/README.md)
- [ML.NET Cookbook](code/MlNetCookBook.md)
- [ML.NET API Reference Documentation](https://docs.microsoft.com/dotnet/api/?view=ml-dotnet)
- [GenAI Design Document](gen-ai/README.md)

Building from Source
--------------------
Expand Down
14 changes: 14 additions & 0 deletions docs/gen-ai/Benchmark && Evaluation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
It's critical to evaluate the performance of the GenAI model once it's available. The evaluation && benchmark will be on two-fold:
- evaluation on various eval datasets: this is to make sure our implementation is correct and the model is working as expected comparing to python-implemented model.
- benchmark on inference speed: this is to make sure the model can be used in real-time applications.

This document will cover the topic of how to evaluate the model on various eval datasets.

## How we evaluate the model
To get the most comparable result with other llms, we evaluate the model in the same way as [Open LLM leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), which uses [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) as the evaluation framework.

For the details of which evaluation datasets are used, please refer to the [Open LLM leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard).

Because `lm-evaluation-harness` is written in python, there is no way to directly use it in .NET. Therefore we use the following steps as a workaround:
- in C#, start a openai chat completion service server with the model we want to evaluate.
- in python, use `lm-evaluation-harness` to evaluate the model using openai mode.
118 changes: 118 additions & 0 deletions docs/gen-ai/CausalLMPipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# What is a causal language model pipeline?

The causal language model pipeline is a utility class which wraps a tokenizer and a causal language model and provides a uniformed interface for various decoding method to generate text. The pipeline is designed to be easy to use and requires only a few lines of code to generate text.

In Microsoft.ML.GenAI, we will provide a generic `CausalLMPipeline` class plus a typed `CausalLMPipeline` class which specifies the type parameters for the tokenizer and the causal language model. The typed `CausalLMPipeline` class make it easier to develop consuming method for semantic kernel. see [here](./Usage.md#consume-model-from-semantic-kernel) for more details
# Contract
```C#
public abstract class CausalLMPipeline
{
public virtual (
Tensor, // output token ids [batch_size, sequence_length]
Tensor // output logits [batch_size, sequence_length, vocab_size]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the TorchSharp tensor type? Can / should it become the System.Numerics.Tensors one?

Copy link
Contributor Author

@LittleLittleCloud LittleLittleCloud Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the TorchSharp tensor type

Yes

should it become the System.Numerics.Tensors one?

If System.Numerics.Tensors can work with libtorch, then maybe yes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're some ways from having Tensor integrated with TorchSharp, it is still experimental and subject to change.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it would be the TorchSharp shim implementing ITensor, not Tensor from SNT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably it would be ITensor<T> such that a Tensor<T> could be passed in? The implementation could type test if it were a TorchSharp implementation and act accordingly, presumably.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the fullness of time.

) Generate(
Tensor inputIds, // input token ids [batch_size, sequence_length]
Tensor attentionMask, // attention mask [batch_size, sequence_length]
float temperature = 0.7f,
float topP = 0.9f,
int maxLen = 128,
int[][]? stopTokenSequence = null,
bool echo = false); // echo the input token ids in the output token ids
}

public CasualLMPipeline<TTokenizer, TCausalLM> : CausalLMPipeline
where TTokenizer : ITokenizer
LittleLittleCloud marked this conversation as resolved.
Show resolved Hide resolved
where TCausalLM : nn.Module<CausalLanguageModelInput, CausalLanguageModelOutput>
{
public CausalLMPipeline<LLama2Tokenizer, Phi3ForCasualLM> Create(LLama2Tokenizer tokenizer, Phi3ForCasualLM model);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably this is intended to instead be a static method on the base class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm on the second thought, maybe not. Because CasualLMPipeline and CasualLMPipeline<TTokenizer, TCasualLM> will live in GenAI.Core while Phi3ForCasualLM lives in GenAI.Phi3.

In that case, this Create method is no longer necessary exist, using a constructor to take TTokenzier and TCausalLM would be easier to understand


}
```

# Usage
```C#
LLama2Tokenizer tokenizer;
Phi3ForCausalLM model;

var pipeline = CausalLMPipeline.Create(tokenizer, model);
var prompt = "Once upon a time";
// top-k sampling
var output = pipeline.Generate(
prompt: prompt,
maxLen: 100,
temperature: 0.7f,
topP: 0.9f,
stopSequences: null,
device: "cuda",
bos: true, // add bos token to the prompt
eos: false, // do not add eos token to the prompt
echo: true // echo the prompt in the generated text
);
```

# Sampling methods
The `CausalLMPipeline` provides a uniformed interface for various decoding methods to generate text. This saves our effort to implement different decoding methods for each model.

## Sampling
```C#
public virtual (
Tensor, // output token ids [batch_size, sequence_length]
Tensor // output logits [batch_size, sequence_length, vocab_size]
) Generate(
Tensor inputIds, // input token ids [batch_size, sequence_length]
Tensor attentionMask, // attention mask [batch_size, sequence_length]
float temperature = 0.7f,
float topP = 0.9f,
int maxLen = 128,
int[][]? stopTokenSequence = null,
bool echo = false); // echo the input token ids in the output token ids
```

>[!NOTE]
> The Greedy search and beam search are not implemented in the pipeline yet. They will be added in the future.

## Greedy Search
```C#
public (
Tensor, // output token ids [batch_size, sequence_length]
Tensor // output logits [batch_size, sequence_length, vocab_size]
) GreedySearch(
Tensor inputIds, // input token ids [batch_size, sequence_length]
Tensor attentionMask, // attention mask [batch_size, sequence_length]
int maxLen = 128,
int[][]? stopTokenSequence = null,
bool echo = false); // echo the input token ids in the output token ids
```

## Beam Search
```C#
public (
Tensor, // output token ids [batch_size, sequence_length]
Tensor // output logits [batch_size, sequence_length, vocab_size]
) BeamSearch(
Tensor inputIds, // input token ids [batch_size, sequence_length]
Tensor attentionMask, // attention mask [batch_size, sequence_length]
int maxLen = 128,
int[][]? stopTokenSequence = null,
int beamSize = 5,
float lengthPenalty = 1.0f,
bool echo = false); // echo the input token ids in the output token ids
```

## The extension method for `CausalLMPipeline`

The extension `Generate` method provides a even-easier way to generate text without the necessary to generate the input tensor. The method takes a prompt string and other optional parameters to generate text.

```C#
public static string Generate(
this CasualLMPipeline pipeline,
string prompt,
int maxLen = 128,
float temperature = 0.7f,
float topP = 0.9f,
string[]? stopSequences = null,
string device = "cpu",
bool bos = true,
bool eos = false,
bool echo = false)
```
75 changes: 75 additions & 0 deletions docs/gen-ai/CausalLanguageModel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# What is a Causal Language Model?

A causal language model is a type of language model that predicts the next token in a sequence of tokens. The model generates text one token at a time, with each token conditioned on the tokens that came before it. This type of model is useful for generating text, such as in chatbots, machine translation, and text summarization. [see more](https://huggingface.co/docs/transformers/tasks/language_modeling)


# The Causal Language Model Contract
In the remaining sections, we will describe the contract for a causal language model.

## `CausalLMModelInput`
```C#
public CausalLMModelInput
{
// [batch_size, sequence_length]
public Tensor input_ids { get; set; }

// optional: [batch_size, sequence_length]
public Tensor? attention_mask { get; set; }

// optional: [batch_size, sequence_length]
public Tensor? position_ids { get; set; }

// optional: kv cache for attention layers
public IKVCache? kv_cache { get; set; }

// optional: [batch_size, sequence_length, hidden_size]
// if provided, the model will use these embeddings instead of computing them from input_ids
public Tensor? inputs_embeds { get; set; }

// if use kv cache when calculating attention
public bool use_cache { get; set; }

// if return attentions in model output
public bool output_attentions { get; set; }

// if return hidden states in model output
// for e.g. calculating loss
public bool output_hidden_states { get; set; }
}
```

## `CausalLMModelOutput`
```C#
public class CausalLMModelOutput
{
// [batch_size, sequence_length, vocab_size]
// The predicted logits for each token in the input sequence.
public Tensor logits { get; set; }

// optional: [batch_size, sequence_length, hidden_size]
public Tensor last_hidden_state { get; set; }

// optional: all hidden states
public Tensor[]? hidden_states { get; set; }

// optional: all attentions
public Tensor[]? attentions { get; set; }

// optional: kv cache for attention layers
public IKVCache? cache { get; set; }
}
```

Once both `CausalLMModelInput` and `CausalLMModelOutput` are defined, the causal language model can be implemented as follows (use Phi-3 as an example):

```C#
public class Phi3ForCausalLM : nn.Module<CausalLMModelInput, CausalLMModelOutput>
```


# What language model has been implemented using this contract in this repo?
- `Phi3ForCausalLM`
- `Phi2ForCausalLM`

# What language model has been implemented using this pattern, but not exactly the same contract class in the other repo?
- `LLaMAForCausalLM` (for both llama2 and llama3)
11 changes: 11 additions & 0 deletions docs/gen-ai/DynamicLoading.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Dynamic loading is a technique to inference very large model on a machine with limited GPU memory. The idea is to load only part of the model to GPU memory and run inference on the loaded part. Once the inference is done, the loaded part is released from GPU memory and the next part is loaded to GPU memory. This process is repeated until the whole model is processed.

The technique is available in both llama.cpp and [huggingface accelerate](https://huggingface.co/blog/accelerate-large-models). The GenAI model package should also support this technique.

## Update on 2024/05/30
Experiment over partial loading is done in PR #10. The main take-away are
- partial loading can gain acceleration from 1.03X to over 30X even not fully loading model to GPU.
- the main bottleneck is still memory traffic between CPU and GPU.
- larger blocks should have higher priority when deciding which block to be 'pin' to GPU memory.

The result can be found in [this report](DynamicLoadingReport.md)
64 changes: 64 additions & 0 deletions docs/gen-ai/DynamicLoadingReport.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
## Conclusion

- The main bottleneck of auto inference(dynamic loading) is the overhead of CPU-GPU data transfer.
- The larger the layer size, the more acceleration we can get from GPU. So we should try to put larger layers on GPU.

## Hardware: i9-14900k, 64GB memory, rtx 4090
### Sequential Layer

| Device | Num of Layers | Layer Size | Model Size | Num of Layers on GPU | Num of Layers on CPU | Average Inference (ms) | Acceleration | % of Layer in GPU |
|--------|----------------|------------|------------|-----------------------|-----------------------|------------------------|--------------|-------------------|
| CPU | 512 | 4MB | 2GB | - | - | 939.8 | 1.0 | 0% |
| Auto | 512 | 4MB | 2GB | 0 | 512 | 490 | 1.9 | 0% |
| Auto | 512 | 4MB | 2GB | 253 | 259 | 272 | 3.5 | 49.4% |
| Auto | 512 | 4MB | 2GB | 512 | 0 | 32 | 29.4 | 100% |
| GPU | 512 | 4MB | 2GB | - | - | 32.4 | 29.0 | 100% |

### Sequential Layer, Deeper Model

| Device | Num of Layers | Layer Size | Model Size | Num of Layers on GPU | Num of Layers on CPU | Average Inference (ms) | Acceleration | % of Layer in GPU |
|--------|----------------|------------|------------|-----------------------|-----------------------|------------------------|--------------|-------------------|
| CPU | 1024 | 4MB | 4GB | - | - | 1839.8 | 1.0 | 0% |
| Auto | 1024 | 4MB | 4GB | 0 | 1024 | 954 | 1.9 | 0% |
| Auto | 1024 | 4MB | 4GB | 252 | 772 | 787 | 2.3 | 24.6% |
| Auto | 1024 | 4MB | 4GB | 508 | 516 | 530 | 3.5 | 49.6% |
| Auto | 1024 | 4MB | 4GB | 764 | 260 | 312.5 | 5.9 | 74.6% |
| Auto | 1024 | 4MB | 4GB | 1020 | 4 | 69.7 | 26.9 | 99.6% |
| GPU | 1024 | 4MB | 4GB | - | - | 65.9 | 27.9 | 100% |

### Sequential Layer, Larger Layer (16MB)

| Device | Num of Layers | Layer Size | Model Size | Num of Layers on GPU | Num of Layers on CPU | Average Inference (ms) | Acceleration | % of Layer in GPU |
|--------|----------------|------------|------------|-----------------------|-----------------------|------------------------|--------------|-------------------|
| CPU | 256 | 16MB | 4GB | - | - | 864 | 1.0 | 0% |
| Auto | 256 | 16MB | 4GB | 0 | 256 | 844.7 | 1.02 | 0% |
| Auto | 256 | 16MB | 4GB | 60 | 196 | 669.9 | 1.3 | 23.4% |
| Auto | 256 | 16MB | 4GB | 124 | 132 | 494.2 | 1.7 | 48.4% |
| Auto | 256 | 16MB | 4GB | 188 | 68 | 372.7 | 2.3 | 73.4% |
| Auto | 256 | 16MB | 4GB | 252 | 4 | 152.5 | 5.7 | 98.4% |
| GPU | 256 | 16MB | 4GB | - | - | 119 | 7.3 | 100% |

### Sequential Layer, Even Larger Layer (64MB)

| Device | Num of Layers | Layer Size | Model Size | Num of Layers on GPU | Num of Layers on CPU | Average Inference (ms) | Acceleration | % of Layer in GPU |
|--------|----------------|------------|------------|-----------------------|-----------------------|------------------------|--------------|-------------------|
| CPU | 64 | 64MB | 4GB | - | - | 8501 | 1.0 | 0% |
| Auto | 64 | 64MB | 4GB | 0 | 64 | 898 | 9.5 | 0% |
| Auto | 64 | 64MB | 4GB | 12 | 52 | 755.2 | 11.3 | 18.8% |
| Auto | 64 | 64MB | 4GB | 28 | 36 | 598 | 14.2 | 43.8% |
| Auto | 64 | 64MB | 4GB | 44 | 20 | 419.7 | 20.2 | 68.8% |
| Auto | 64 | 64MB | 4GB | 60 | 4 | 263.7 | 32.3 | 93.8% |
| Auto | 64 | 64MB | 4GB | 64 | 0 | 70.54 | 121 | 100% |
| GPU | 64 | 64MB | 4GB | - | - | 69.8 | 121.7 | 100% |

## Hardware: Xeon W-2133, 32GB memory, gtx 1066
| Device | Num of Layers | Layer Size | Model Size | Num of Layers on GPU | Num of Layers on CPU | Average Inference (ms) | Acceleration | % of Layer in GPU |
|--------|----------------|------------|------------|-----------------------|-----------------------|------------------------|--------------|-------------------|
| CPU | 64 | 64MB | 4GB | - | - | 17419 | 1.0 | 0% |
| Auto | 64 | 64MB | 4GB | 0 | 64 | 3783.4 | 4.6 | 0% |
| Auto | 64 | 64MB | 4GB | 12 | 52 | 3415 | 5.1 | 18.8% |
| Auto | 64 | 64MB | 4GB | 28 | 36 | 3004 | 5.79 | 43.8% |
| Auto | 64 | 64MB | 4GB | 44 | 20 | 2536 | 6.86 | 68.8% |
| Auto | 64 | 64MB | 4GB | 60 | 4 | 2101 | 8.29 | 93.8% |
| Auto | 64 | 64MB | 4GB | 64 | 0 | 1163 | 14.97 | 100% |
| GPU | 64 | 64MB | 4GB | - | - | 1213 | 14.3 | 100% |
7 changes: 7 additions & 0 deletions docs/gen-ai/Package Structure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
The GenAI project will be a collection of popular open source AI models. It will be organized in the following structure:
LittleLittleCloud marked this conversation as resolved.
Show resolved Hide resolved

- Microsoft.ML.GenAI.Core: the core library for GenAI project, it contains the fundamental contracts or classes like `CausalLanguageModel` and `CausalLMPipeline`
- Microsoft.ML.GenAI.{ModelName}: the implementation of a specific model, which includes the model configuration, causal lm model implementation (like `Phi3ForCausalLM`) and tokenizer implementation if any. In the first stage, we plan to provide the following models:
- Microsoft.ML.GenAI.Phi: the implementation of Phi-series model
- Microsoft.ML.GenAI.LLaMA: the implementation of LLaMA-series model
- Microsoft.ML.GenAI.StableDiffusion: the implementation of Stable Diffusion model
16 changes: 16 additions & 0 deletions docs/gen-ai/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
This folder contains the design doc for GenAI Model package

### Basic
- [Package Structure](./Package%20Structure.md): the structure of GenAI Model package
- [Usage](./Usage.md): how to use the model from GenAI Model package
- [Benchmark && Evaluation](./Benchmark%20&&%20Evaluation.md): how to evaluate the model from GenAI Model package

### Contracts && API
- [CausalLMPipeline](./CausalLMPipeline.md)
- [CausalLMModelInput and CausalLMModelOutput](./CausalLanguageModel.md)
- [Tokenizer](./Tokenizer.md)

### Need further investigation
- [Dynamic loading](./DynamicLoading.md): load only part of model to GPU when gpu memory is limited. We explore the result w/o dynamic loading in [this report](../DynamicLoadingReport.md)
- Improve loading speed: I notice that the model loading speed from disk to memory is slower in torchsharp than what it is in huggingface. Need to investigate the reason and improve the loading speed
- Quantization: quantize the model to reduce the model size and improve the inference speed
6 changes: 6 additions & 0 deletions docs/gen-ai/Tokenizer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# What is a tokenizer?

A tokenizer is a class that splits a string into tokens and encodes them into numerical(int) values.

# The Tokenizer Contract
We can simply use the tokenizer from `Microsoft.ML.Tokenizer` package
LittleLittleCloud marked this conversation as resolved.
Show resolved Hide resolved
Loading