Add token usage tracking #872

marcominerva · 2024-10-30T16:22:38Z

Motivation and Context (Why the change? What's the scenario?)

This PR adds a new TokenUsage property to MemoryAnswer to hold information about token usage.

High level description (Approach, Design)

Token usage is calculated in the SearchClient.AskAsync method using the configured tokenizer.

- MemoryAnswer.cs: Imported TokenUsage class and added TokenUsage property. - SearchClient.cs: Refactored GenerateAnswer method, added prompt creation, token count logic, and updated RenderFactTemplate call. - TokenUsage.cs: Created TokenUsage class to track token counts.

dluc · 2024-10-30T20:38:52Z

I think we should use actual usage reports from the AI services when available, to avoid the risk of returning incorrect information, particularly when token count is important for billing.

I would also design the report to support a list, because multiple AI calls are involved, and potentially to different models and different services.

For models where the report data is unavailable, we can fallback to tokenizers, using different keys like estimated_input_tokens for example.

Something like:

{
  "requests": [
    {
      "id": "A0E4C1D0-0D1A-4D3B-8D3D-3D0D1A0E4C1D",
      "date": "2024-10-30T12:00:00Z",
      "service": "Azure OpenAI",
      "model": "GPT-4o-mini",
      "usage": {
        "input_tokens": 123,
        "output_tokens": 456,
        "output_reasoning_tokens": 50
      }
    },
    {
      "id": "C1D0B1E4-4D3B-0D1A-8D3D-E4C1D3D0D1A0",
      "date": "2024-10-30T12:00:00Z",
      "service": "Azure OpenAI",
      "model": "text-embedding-ada-002",
      "usage": {
        "prompt_tokens": 123,
        "total_tokens": 123
      }
    },
    {
      "date": "2024-10-30T12:00:00Z",
      "service": "LlamaSharp",
      "model": "llama2",
      "usage": {
        "estimated_input_tokens": 123,
        "estimated_output_tokens": 456
      }
    }
  ]
}

marcominerva · 2024-10-31T08:34:09Z

I think we should use actual usage reports from the AI services when available

In fact, I initially started with that kind of implementation in my branch https://github.com/marcominerva/kernel-memory/tree/token_usage (see in particular https://github.com/marcominerva/kernel-memory/blob/5a3a77f62a2a22d88fe85ad5efb8426731e9d4a5/extensions/AzureOpenAI/AzureOpenAITextGenerator.cs#L148-L150), but then I found some blocking issues like microsoft/semantic-kernel#9420.

So, I thought that we can start with the "manual" approach, and then progressively update it with the actual usage reports from the different services.

Token usage is one of the most request feature by my customers, and having a value such the one proposed in this PR, even if not 100% accurate, is much better than not having a value at all 😄 . Also because, at the moment, they use this exact approach when they need to get an idea of token usage, with the problem that the actual prompt is "hidden", and the only point in which its size is shown is in the logger:

kernel-memory/service/Core/Search/SearchClient.cs

Lines 409 to 416 in 67472d5

    
           if (this._log.IsEnabled(LogLevel.Debug)) 
        
           { 
        
               this._log.LogDebug("Running RAG prompt, size: {0} tokens, requesting max {1} tokens", 
        
                   this._textGenerator.CountTokens(prompt), 
        
                   this._config.AnswerTokens); 
        
               this._log.LogSensitive("Prompt: {0}", prompt); 
        
           }

dluc · 2024-10-31T17:58:47Z

I think we should use actual usage reports from the AI services when available

In fact, I initially started with that kind of implementation in my branch https://github.com/marcominerva/kernel-memory/tree/token_usage (see in particular https://github.com/marcominerva/kernel-memory/blob/5a3a77f62a2a22d88fe85ad5efb8426731e9d4a5/extensions/AzureOpenAI/AzureOpenAITextGenerator.cs#L148-L150), but then I found some blocking issues like microsoft/semantic-kernel#9420.

So, I thought that we can start with the "manual" approach, and then progressively update it with the actual usage reports from the different services.

Token usage is one of the most request feature by my customers, and having a value such the one proposed in this PR, even if not 100% accurate, is much better than not having a value at all 😄 . Also because, at the moment, they use this exact approach when they need to get an idea of token usage, with the problem that the actual prompt is "hidden", and the only point in which its size is shown is in the logger:

kernel-memory/service/Core/Search/SearchClient.cs

Lines 409 to 416 in 67472d5

if (this._log.IsEnabled(LogLevel.Debug))

{

this._log.LogDebug("Running RAG prompt, size: {0} tokens, requesting max {1} tokens",

this._textGenerator.CountTokens(prompt),

this._config.AnswerTokens);

this._log.LogSensitive("Prompt: {0}", prompt);

}

I would use the service data when available and include the tokenizer optionally. Using the tokenizer is a performance concern, so it would be nice if we could turn it off.

Something like this:

PR 1

applies only to OpenAI generators, text and embeddings
define a common class for metrics, avoiding dictionaries
add token metrics to Search and Ask responses, organized as a list of calls
each call includes
- timestamp
- name of the service e.g. "Azure OpenAI", "Ollama", etc.
- model type e.g. "TextEmbedding", "Text", etc.
- name of the model/deployment used
- "tokens in metric" reported by the service - NULL if not available (key: "service_tokens_in")
- "tokens out metric" reported by the service - NULL if not available (key: "service_tokens_out")
- other token metrics reported by the service if avaialvble, e.g. reasoning tokens (key: "service_reasoning_tokens")
- "token in" measured by the tokenizer (key: "tokenizer_tokens_in")
- "token out" measured by the tokenizer (key: "tokenizer_tokens_out")

PR 2

apply to all AI generators: Azure OpenAI, Ollama, LLamaSharp, Anthropic
add global option "enable token metrics"
add global option "include token metrics measured by tokenizer" - enabled by default

PR 3

add report also to the pipeline object, to cost ingestion. Support updates, don't lose previous metrics.

marcominerva · 2024-11-04T10:10:55Z

So do you mean something like this: https://github.com/marcominerva/kernel-memory/tree/token_usage? Check in particular the following implementations:

In case, we can close this PR and opens a new one pointing to the token_usage branch (of course, there is yet some work to do).

marcominerva requested a review from dluc as a code owner October 30, 2024 16:22

Merge branch 'main' into token_usage2

d197e64

Small refactoring

ee1269f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add token usage tracking #872

Add token usage tracking #872

marcominerva commented Oct 30, 2024

dluc commented Oct 30, 2024

marcominerva commented Oct 31, 2024

dluc commented Oct 31, 2024

marcominerva commented Nov 4, 2024

Add token usage tracking #872

Are you sure you want to change the base?

Add token usage tracking #872

Conversation

marcominerva commented Oct 30, 2024

Motivation and Context (Why the change? What's the scenario?)

High level description (Approach, Design)

dluc commented Oct 30, 2024

marcominerva commented Oct 31, 2024

dluc commented Oct 31, 2024

PR 1

PR 2

PR 3

marcominerva commented Nov 4, 2024