dotnet · LittleLittleCloud · Jun 13, 2024 · Jun 6, 2024 · stephentoub · Jun 11, 2024
diff --git a/docs/README.md b/docs/README.md
@@ -17,6 +17,7 @@ Project Docs
 - [ML.NET Roadmap](https://github.com/dotnet/machinelearning/blob/main/README.md)
 - [ML.NET Cookbook](code/MlNetCookBook.md)
 - [ML.NET API Reference Documentation](https://docs.microsoft.com/dotnet/api/?view=ml-dotnet)
+- [GenAI Design Document](gen-ai/README.md)
 
 Building from Source
 --------------------

diff --git a/docs/gen-ai/Benchmark && Evaluation.md b/docs/gen-ai/Benchmark && Evaluation.md
@@ -0,0 +1,14 @@
+It's critical to evaluate the performance of the GenAI model once it's available. The evaluation && benchmark will be on two-fold:
+- evaluation on various eval datasets: this is to make sure our implementation is correct and the model is working as expected comparing to python-implemented model.
+- benchmark on inference speed: this is to make sure the model can be used in real-time applications.
+
+This document will cover the topic of how to evaluate the model on various eval datasets.
+
+## How we evaluate the model
+To get the most comparable result with other llms, we evaluate the model in the same way as [Open LLM leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), which uses [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) as the evaluation framework.
+
+For the details of which evaluation datasets are used, please refer to the [Open LLM leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard).
+
+Because `lm-evaluation-harness` is written in python, there is no way to directly use it in .NET. Therefore we use the following steps as a workaround:
+- in C#, start a openai chat completion service server with the model we want to evaluate.
+- in python, use `lm-evaluation-harness` to evaluate the model using openai mode.
diff --git a/docs/gen-ai/CausalLMPipeline.md b/docs/gen-ai/CausalLMPipeline.md
@@ -0,0 +1,118 @@
+# What is a causal language model pipeline?
+
+The causal language model pipeline is a utility class which wraps a tokenizer and a causal language model and provides a uniformed interface for various decoding method to generate text. The pipeline is designed to be easy to use and requires only a few lines of code to generate text.
+
+In Microsoft.ML.GenAI, we will provide a generic `CausalLMPipeline` class plus a typed `CausalLMPipeline` class which specifies the type parameters for the tokenizer and the causal language model. The typed `CausalLMPipeline` class make it easier to develop consuming method for semantic kernel. see [here](./Usage.md#consume-model-from-semantic-kernel) for more details
+# Contract
+```C#
+public abstract class CausalLMPipeline
+{
+    public virtual (
+        Tensor, // output token ids [batch_size, sequence_length]
+        Tensor // output logits [batch_size, sequence_length, vocab_size]
+    ) Generate(
+        Tensor inputIds, // input token ids [batch_size, sequence_length]
+        Tensor attentionMask, // attention mask [batch_size, sequence_length]
+        float temperature = 0.7f,
+        float topP = 0.9f,
+        int maxLen = 128,
+        int[][]? stopTokenSequence = null,
+        bool echo = false); // echo the input token ids in the output token ids
+}
+
+public CasualLMPipeline<TTokenizer, TCausalLM> : CausalLMPipeline
+    where TTokenizer : ITokenizer
+    where TCausalLM : nn.Module<CausalLanguageModelInput, CausalLanguageModelOutput>
+{
+    public CausalLMPipeline<LLama2Tokenizer, Phi3ForCasualLM> Create(LLama2Tokenizer tokenizer, Phi3ForCasualLM model);
+
+}
+```
+
+# Usage
+```C#
+LLama2Tokenizer tokenizer;
+Phi3ForCausalLM model;
+
+var pipeline = CausalLMPipeline.Create(tokenizer, model);
+var prompt = "Once upon a time";
+// top-k sampling
+var output = pipeline.Generate(
+    prompt: prompt,
+    maxLen: 100,
+    temperature: 0.7f,
+    topP: 0.9f,
+    stopSequences: null,
+    device: "cuda",
+    bos: true, // add bos token to the prompt
+    eos: false, // do not add eos token to the prompt
+    echo: true // echo the prompt in the generated text
+);
+```
+
+# Sampling methods
+The `CausalLMPipeline` provides a uniformed interface for various decoding methods to generate text. This saves our effort to implement different decoding methods for each model.
+
+## Sampling
+```C#
+public virtual (
+        Tensor, // output token ids [batch_size, sequence_length]
+        Tensor // output logits [batch_size, sequence_length, vocab_size]
+    ) Generate(
+        Tensor inputIds, // input token ids [batch_size, sequence_length]
+        Tensor attentionMask, // attention mask [batch_size, sequence_length]
+        float temperature = 0.7f,
+        float topP = 0.9f,
+        int maxLen = 128,
+        int[][]? stopTokenSequence = null,
+        bool echo = false); // echo the input token ids in the output token ids
+```
+
+>[!NOTE]
+> The Greedy search and beam search are not implemented in the pipeline yet. They will be added in the future.
+
+## Greedy Search
+```C#
+public (
+    Tensor, // output token ids [batch_size, sequence_length]
+    Tensor // output logits [batch_size, sequence_length, vocab_size]
+) GreedySearch(
+    Tensor inputIds, // input token ids [batch_size, sequence_length]
+    Tensor attentionMask, // attention mask [batch_size, sequence_length]
+    int maxLen = 128,
+    int[][]? stopTokenSequence = null,
+    bool echo = false); // echo the input token ids in the output token ids
+```
+
+## Beam Search
+```C#
+public (
+    Tensor, // output token ids [batch_size, sequence_length]
+    Tensor // output logits [batch_size, sequence_length, vocab_size]
+) BeamSearch(
+    Tensor inputIds, // input token ids [batch_size, sequence_length]
+    Tensor attentionMask, // attention mask [batch_size, sequence_length]
+    int maxLen = 128,
+    int[][]? stopTokenSequence = null,
+    int beamSize = 5,
+    float lengthPenalty = 1.0f,
+    bool echo = false); // echo the input token ids in the output token ids
+```
+
+## The extension method for `CausalLMPipeline`
+
+The extension `Generate` method provides a even-easier way to generate text without the necessary to generate the input tensor. The method takes a prompt string and other optional parameters to generate text.
+
+```C#
+public static string Generate(
+    this CasualLMPipeline pipeline,
+    string prompt,
+    int maxLen = 128,
+    float temperature = 0.7f,
+    float topP = 0.9f,
+    string[]? stopSequences = null,
+    string device = "cpu",
+    bool bos = true,
+    bool eos = false,
+    bool echo = false)
+```
diff --git a/docs/gen-ai/CausalLanguageModel.md b/docs/gen-ai/CausalLanguageModel.md
@@ -0,0 +1,75 @@
+# What is a Causal Language Model?
+
+A causal language model is a type of language model that predicts the next token in a sequence of tokens. The model generates text one token at a time, with each token conditioned on the tokens that came before it. This type of model is useful for generating text, such as in chatbots, machine translation, and text summarization. [see more](https://huggingface.co/docs/transformers/tasks/language_modeling)
+
+
+# The Causal Language Model Contract
+In the remaining sections, we will describe the contract for a causal language model.
+
+## `CausalLMModelInput`
+```C#
+public CausalLMModelInput
+{
+    // [batch_size, sequence_length]
+    public Tensor input_ids { get; set; }
+
+    // optional: [batch_size, sequence_length]
+    public Tensor? attention_mask { get; set; }
+
+    // optional: [batch_size, sequence_length]
+    public Tensor? position_ids { get; set; }
+
+    // optional: kv cache for attention layers
+    public IKVCache? kv_cache { get; set; }
+
+    // optional: [batch_size, sequence_length, hidden_size]
+    // if provided, the model will use these embeddings instead of computing them from input_ids
+    public Tensor? inputs_embeds { get; set; }
+
+    // if use kv cache when calculating attention
+    public bool use_cache { get; set; }
+
+    // if return attentions in model output
+    public bool output_attentions { get; set; }
+
+    // if return hidden states in model output
+    // for e.g. calculating loss
+    public bool output_hidden_states { get; set; }
+}
+```
+
+## `CausalLMModelOutput`
+```C#
+public class CausalLMModelOutput
+{
+    // [batch_size, sequence_length, vocab_size]
+    // The predicted logits for each token in the input sequence.
+    public Tensor logits { get; set; }
+
+    // optional: [batch_size, sequence_length, hidden_size]
+    public Tensor last_hidden_state { get; set; }
+
+    // optional: all hidden states
+    public Tensor[]? hidden_states { get; set; }
+
+    // optional: all attentions
+    public Tensor[]? attentions { get; set; }
+
+    // optional: kv cache for attention layers
+    public IKVCache? cache { get; set; }
+}
+```
+
+Once both `CausalLMModelInput` and `CausalLMModelOutput` are defined, the causal language model can be implemented as follows (use Phi-3 as an example):
+
+```C#
+public class Phi3ForCausalLM : nn.Module<CausalLMModelInput, CausalLMModelOutput>
+```
+
+
+# What language model has been implemented using this contract in this repo?
+- `Phi3ForCausalLM`
+- `Phi2ForCausalLM`
+
+# What language model has been implemented using this pattern, but not exactly the same contract class in the other repo?
+- `LLaMAForCausalLM` (for both llama2 and llama3)
diff --git a/docs/gen-ai/DynamicLoading.md b/docs/gen-ai/DynamicLoading.md
@@ -0,0 +1,11 @@
+Dynamic loading is a technique to inference very large model on a machine with limited GPU memory. The idea is to load only part of the model to GPU memory and run inference on the loaded part. Once the inference is done, the loaded part is released from GPU memory and the next part is loaded to GPU memory. This process is repeated until the whole model is processed.
+
+The technique is available in both llama.cpp and [huggingface accelerate](https://huggingface.co/blog/accelerate-large-models). The GenAI model package should also support this technique.
+
+## Update on 2024/05/30
+Experiment over partial loading is done in PR #10. The main take-away are
+- partial loading can gain acceleration from 1.03X to over 30X even not fully loading model to GPU.
+- the main bottleneck is still memory traffic between CPU and GPU.
+- larger blocks should have higher priority when deciding which block to be 'pin' to GPU memory.
+
+The result can be found in [this report](DynamicLoadingReport.md)
diff --git a/docs/gen-ai/DynamicLoadingReport.md b/docs/gen-ai/DynamicLoadingReport.md
@@ -0,0 +1,64 @@
+## Conclusion
+
+- The main bottleneck of auto inference(dynamic loading) is the overhead of CPU-GPU data transfer.
+- The larger the layer size, the more acceleration we can get from GPU. So we should try to put larger layers on GPU.
+
+## Hardware: i9-14900k, 64GB memory, rtx 4090
+### Sequential Layer
+
+| Device | Num of Layers | Layer Size | Model Size | Num of Layers on GPU | Num of Layers on CPU | Average Inference (ms) | Acceleration | % of Layer in GPU |
+|--------|----------------|------------|------------|-----------------------|-----------------------|------------------------|--------------|-------------------|
+| CPU    | 512            | 4MB        | 2GB        | -                     | -                     | 939.8                  | 1.0          | 0%                |
+| Auto   | 512            | 4MB        | 2GB        | 0                     | 512                   | 490                    | 1.9          | 0%                |
+| Auto   | 512            | 4MB        | 2GB        | 253                   | 259                   | 272                    | 3.5          | 49.4%             |
+| Auto   | 512            | 4MB        | 2GB        | 512                   | 0                     | 32                     | 29.4         | 100%              |
+| GPU    | 512            | 4MB        | 2GB        | -                     | -                     | 32.4                   | 29.0         | 100%              |
+
+### Sequential Layer, Deeper Model
+
+| Device | Num of Layers | Layer Size | Model Size | Num of Layers on GPU | Num of Layers on CPU | Average Inference (ms) | Acceleration | % of Layer in GPU |
+|--------|----------------|------------|------------|-----------------------|-----------------------|------------------------|--------------|-------------------|
+| CPU    | 1024           | 4MB        | 4GB        | -                     | -                     | 1839.8                 | 1.0          | 0%                |
+| Auto   | 1024           | 4MB        | 4GB        | 0                     | 1024                  | 954                    | 1.9          | 0%                |
+| Auto   | 1024           | 4MB        | 4GB        | 252                   | 772                   | 787                    | 2.3          | 24.6%             |
+| Auto   | 1024           | 4MB        | 4GB        | 508                   | 516                   | 530                    | 3.5          | 49.6%             |
+| Auto   | 1024           | 4MB        | 4GB        | 764                   | 260                   | 312.5                  | 5.9          | 74.6%             |
+| Auto   | 1024           | 4MB        | 4GB        | 1020                  | 4                     | 69.7                   | 26.9         | 99.6%             |
+| GPU    | 1024           | 4MB        | 4GB        | -                     | -                     | 65.9                   | 27.9         | 100%              |
+
+### Sequential Layer, Larger Layer (16MB)
+
+| Device | Num of Layers | Layer Size | Model Size | Num of Layers on GPU | Num of Layers on CPU | Average Inference (ms) | Acceleration | % of Layer in GPU |
+|--------|----------------|------------|------------|-----------------------|-----------------------|------------------------|--------------|-------------------|
+| CPU    | 256            | 16MB       | 4GB        | -                     | -                     | 864                    | 1.0          | 0%                |
+| Auto   | 256            | 16MB       | 4GB        | 0                     | 256                   | 844.7                  | 1.02         | 0%                |
+| Auto   | 256            | 16MB       | 4GB        | 60                    | 196                   | 669.9                  | 1.3          | 23.4%             |
+| Auto   | 256            | 16MB       | 4GB        | 124                   | 132                   | 494.2                  | 1.7          | 48.4%             |
+| Auto   | 256            | 16MB       | 4GB        | 188                   | 68                    | 372.7                  | 2.3          | 73.4%             |
+| Auto   | 256            | 16MB       | 4GB        | 252                   | 4                     | 152.5                  | 5.7          | 98.4%             |
+| GPU    | 256            | 16MB       | 4GB        | -                     | -                     | 119                    | 7.3          | 100%              |
+
+### Sequential Layer, Even Larger Layer (64MB)
+
+| Device | Num of Layers | Layer Size | Model Size | Num of Layers on GPU | Num of Layers on CPU | Average Inference (ms) | Acceleration | % of Layer in GPU |
+|--------|----------------|------------|------------|-----------------------|-----------------------|------------------------|--------------|-------------------|
+| CPU    | 64             | 64MB       | 4GB        | -                     | -                     | 8501                   | 1.0          | 0%                |
+| Auto   | 64             | 64MB       | 4GB        | 0                     | 64                    | 898                    | 9.5          | 0%                |
+| Auto   | 64             | 64MB       | 4GB        | 12                    | 52                    | 755.2                  | 11.3         | 18.8%             |
+| Auto   | 64             | 64MB       | 4GB        | 28                    | 36                    | 598                    | 14.2         | 43.8%             |
+| Auto   | 64             | 64MB       | 4GB        | 44                    | 20                    | 419.7                  | 20.2         | 68.8%             |
+| Auto   | 64             | 64MB       | 4GB        | 60                    | 4                     | 263.7                  | 32.3         | 93.8%             |
+| Auto   | 64             | 64MB       | 4GB        | 64                    | 0                     | 70.54                     | 121         | 100%              |
+| GPU    | 64             | 64MB       | 4GB        | -                     | -                     | 69.8                   | 121.7        | 100%              |
+
+## Hardware: Xeon W-2133, 32GB memory, gtx 1066
+| Device | Num of Layers | Layer Size | Model Size | Num of Layers on GPU | Num of Layers on CPU | Average Inference (ms) | Acceleration | % of Layer in GPU |
+|--------|----------------|------------|------------|-----------------------|-----------------------|------------------------|--------------|-------------------|
+| CPU    | 64             | 64MB       | 4GB        | -                     | -                     | 17419                   | 1.0          | 0%                |
+| Auto   | 64             | 64MB       | 4GB        | 0                     | 64                    | 3783.4                    | 4.6          | 0%                |
+| Auto   | 64             | 64MB       | 4GB        | 12                    | 52                    | 3415                  | 5.1         | 18.8%             |
+| Auto   | 64             | 64MB       | 4GB        | 28                    | 36                    | 3004                    | 5.79         | 43.8%             |
+| Auto   | 64             | 64MB       | 4GB        | 44                    | 20                    | 2536                  | 6.86         | 68.8%             |
+| Auto   | 64             | 64MB       | 4GB        | 60                    | 4                     | 2101                  | 8.29         | 93.8%             |
+| Auto   | 64             | 64MB       | 4GB        | 64                    | 0                     | 1163                     | 14.97         | 100%              |
+| GPU    | 64             | 64MB       | 4GB        | -                     | -                     | 1213                   | 14.3        | 100%              |
diff --git a/docs/gen-ai/Package Structure.md b/docs/gen-ai/Package Structure.md
@@ -0,0 +1,7 @@
+The GenAI project will be a collection of popular open source AI models. It will be organized in the following structure:
+
+- Microsoft.ML.GenAI.Core: the core library for GenAI project, it contains the fundamental contracts or classes like `CausalLanguageModel` and `CausalLMPipeline`
+- Microsoft.ML.GenAI.{ModelName}: the implementation of a specific model, which includes the model configuration, causal lm model implementation (like `Phi3ForCausalLM`) and tokenizer implementation if any. In the first stage, we plan to provide the following models:
+  - Microsoft.ML.GenAI.Phi: the implementation of Phi-series model
+  - Microsoft.ML.GenAI.LLaMA: the implementation of LLaMA-series model
+  - Microsoft.ML.GenAI.StableDiffusion: the implementation of Stable Diffusion model
diff --git a/docs/gen-ai/README.md b/docs/gen-ai/README.md
@@ -0,0 +1,16 @@
+This folder contains the design doc for GenAI Model package
+
+### Basic
+- [Package Structure](./Package%20Structure.md): the structure of GenAI Model package
+- [Usage](./Usage.md): how to use the model from GenAI Model package
+- [Benchmark && Evaluation](./Benchmark%20&&%20Evaluation.md): how to evaluate the model from GenAI Model package
+
+### Contracts && API
+- [CausalLMPipeline](./CausalLMPipeline.md)
+- [CausalLMModelInput and CausalLMModelOutput](./CausalLanguageModel.md)
+- [Tokenizer](./Tokenizer.md)
+
+### Need further investigation
+- [Dynamic loading](./DynamicLoading.md): load only part of model to GPU when gpu memory is limited. We explore the result w/o dynamic loading in [this report](../DynamicLoadingReport.md)
+- Improve loading speed: I notice that the model loading speed from disk to memory is slower in torchsharp than what it is in huggingface. Need to investigate the reason and improve the loading speed
+- Quantization: quantize the model to reduce the model size and improve the inference speed
diff --git a/docs/gen-ai/Tokenizer.md b/docs/gen-ai/Tokenizer.md
@@ -0,0 +1,6 @@
+# What is a tokenizer?
+
+A tokenizer is a class that splits a string into tokens and encodes them into numerical(int) values.
+
+# The Tokenizer Contract
+We can simply use the tokenizer from `Microsoft.ML.Tokenizer` package