Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add token usage tracking #872

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

marcominerva
Copy link
Contributor

Motivation and Context (Why the change? What's the scenario?)

This PR adds a new TokenUsage property to MemoryAnswer to hold information about token usage.

High level description (Approach, Design)

Token usage is calculated in the SearchClient.AskAsync method using the configured tokenizer.

- MemoryAnswer.cs: Imported TokenUsage class and added TokenUsage property.
- SearchClient.cs: Refactored GenerateAnswer method, added prompt creation, token count logic, and updated RenderFactTemplate call.
- TokenUsage.cs: Created TokenUsage class to track token counts.
@dluc
Copy link
Collaborator

dluc commented Oct 30, 2024

I think we should use actual usage reports from the AI services when available, to avoid the risk of returning incorrect information, particularly when token count is important for billing.

I would also design the report to support a list, because multiple AI calls are involved, and potentially to different models and different services.

For models where the report data is unavailable, we can fallback to tokenizers, using different keys like estimated_input_tokens for example.

Something like:

{
  "requests": [
    {
      "id": "A0E4C1D0-0D1A-4D3B-8D3D-3D0D1A0E4C1D",
      "date": "2024-10-30T12:00:00Z",
      "service": "Azure OpenAI",
      "model": "GPT-4o-mini",
      "usage": {
        "input_tokens": 123,
        "output_tokens": 456,
        "output_reasoning_tokens": 50
      }
    },
    {
      "id": "C1D0B1E4-4D3B-0D1A-8D3D-E4C1D3D0D1A0",
      "date": "2024-10-30T12:00:00Z",
      "service": "Azure OpenAI",
      "model": "text-embedding-ada-002",
      "usage": {
        "prompt_tokens": 123,
        "total_tokens": 123
      }
    },
    {
      "date": "2024-10-30T12:00:00Z",
      "service": "LlamaSharp",
      "model": "llama2",
      "usage": {
        "estimated_input_tokens": 123,
        "estimated_output_tokens": 456
      }
    }
  ]
}

@marcominerva
Copy link
Contributor Author

I think we should use actual usage reports from the AI services when available

In fact, I initially started with that kind of implementation in my branch https://github.com/marcominerva/kernel-memory/tree/token_usage (see in particular https://github.com/marcominerva/kernel-memory/blob/5a3a77f62a2a22d88fe85ad5efb8426731e9d4a5/extensions/AzureOpenAI/AzureOpenAITextGenerator.cs#L148-L150), but then I found some blocking issues like microsoft/semantic-kernel#9420.

So, I thought that we can start with the "manual" approach, and then progressively update it with the actual usage reports from the different services.

Token usage is one of the most request feature by my customers, and having a value such the one proposed in this PR, even if not 100% accurate, is much better than not having a value at all 😄 . Also because, at the moment, they use this exact approach when they need to get an idea of token usage, with the problem that the actual prompt is "hidden", and the only point in which its size is shown is in the logger:

if (this._log.IsEnabled(LogLevel.Debug))
{
this._log.LogDebug("Running RAG prompt, size: {0} tokens, requesting max {1} tokens",
this._textGenerator.CountTokens(prompt),
this._config.AnswerTokens);
this._log.LogSensitive("Prompt: {0}", prompt);
}

@dluc
Copy link
Collaborator

dluc commented Oct 31, 2024

I think we should use actual usage reports from the AI services when available

In fact, I initially started with that kind of implementation in my branch https://github.com/marcominerva/kernel-memory/tree/token_usage (see in particular https://github.com/marcominerva/kernel-memory/blob/5a3a77f62a2a22d88fe85ad5efb8426731e9d4a5/extensions/AzureOpenAI/AzureOpenAITextGenerator.cs#L148-L150), but then I found some blocking issues like microsoft/semantic-kernel#9420.

So, I thought that we can start with the "manual" approach, and then progressively update it with the actual usage reports from the different services.

Token usage is one of the most request feature by my customers, and having a value such the one proposed in this PR, even if not 100% accurate, is much better than not having a value at all 😄 . Also because, at the moment, they use this exact approach when they need to get an idea of token usage, with the problem that the actual prompt is "hidden", and the only point in which its size is shown is in the logger:

if (this._log.IsEnabled(LogLevel.Debug))
{
this._log.LogDebug("Running RAG prompt, size: {0} tokens, requesting max {1} tokens",
this._textGenerator.CountTokens(prompt),
this._config.AnswerTokens);
this._log.LogSensitive("Prompt: {0}", prompt);
}

I would use the service data when available and include the tokenizer optionally. Using the tokenizer is a performance concern, so it would be nice if we could turn it off.

Something like this:

PR 1

  • applies only to OpenAI generators, text and embeddings
  • define a common class for metrics, avoiding dictionaries
  • add token metrics to Search and Ask responses, organized as a list of calls
  • each call includes
    • timestamp
    • name of the service e.g. "Azure OpenAI", "Ollama", etc.
    • model type e.g. "TextEmbedding", "Text", etc.
    • name of the model/deployment used
    • "tokens in metric" reported by the service - NULL if not available (key: "service_tokens_in")
    • "tokens out metric" reported by the service - NULL if not available (key: "service_tokens_out")
    • other token metrics reported by the service if avaialvble, e.g. reasoning tokens (key: "service_reasoning_tokens")
    • "token in" measured by the tokenizer (key: "tokenizer_tokens_in")
    • "token out" measured by the tokenizer (key: "tokenizer_tokens_out")

PR 2

  • apply to all AI generators: Azure OpenAI, Ollama, LLamaSharp, Anthropic
  • add global option "enable token metrics"
  • add global option "include token metrics measured by tokenizer" - enabled by default

PR 3

  • add report also to the pipeline object, to cost ingestion. Support updates, don't lose previous metrics.

@marcominerva
Copy link
Contributor Author

So do you mean something like this: https://github.com/marcominerva/kernel-memory/tree/token_usage? Check in particular the following implementations:

In case, we can close this PR and opens a new one pointing to the token_usage branch (of course, there is yet some work to do).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants