Skip to content

Commit

Permalink
OpenTelemetry Metrics: Adds support to collect Operation level metrics (
Browse files Browse the repository at this point in the history
#4682)

## Description

1. Added new flag in `CosmosClientTelemetryOptions` i.e.
`IsClientMetricsEnabled`, to enable/disable metrics. By default, it
would be disabled. (inspired from Java SDK
https://github.com/Azure/azure-sdk-for-java/blob/5bc07ca75c7c0520c1098b5a6264258b6e043435/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosClientTelemetryConfig.java#L61)
2. If `enabled`, collecting below metrics ref.
open-telemetry/semantic-conventions#1438

ref. Java Metric Doumentation,
https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos/docs/Metrics.md
ref. Discussion with open telemetry community
open-telemetry/semantic-conventions#1438

this PR has **Contract Changes**.

## Perf Testing
I haven't observed any performance impact from this change. In this
feature, any performance issues would likely stem from the Exporter
implementation or the aggregation interval. The tests were conducted
using a no-op exporter subscribed to these meters to isolate any
performance impact specifically related to data recording


![image](https://github.com/user-attachments/assets/1a0ab16a-7b1b-44fb-a0ff-2eacd87d2d93)

## Type of change
- [] New feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kiran Kumar Kolli <[email protected]>
Co-authored-by: Debdatta Kunda <[email protected]>
  • Loading branch information
3 people authored Nov 24, 2024
1 parent d7169f4 commit 9fa85ee
Show file tree
Hide file tree
Showing 29 changed files with 1,377 additions and 164 deletions.
2 changes: 2 additions & 0 deletions Microsoft.Azure.Cosmos/src/CosmosClient.cs
Original file line number Diff line number Diff line change
Expand Up @@ -1438,6 +1438,8 @@ private int DecrementNumberOfActiveClients()
// In case dispose is called multiple times. Check if at least 1 active client is there
if (NumberOfActiveClients > 0)
{
CosmosDbOperationMeter.RemoveInstanceCount(this.Endpoint);

return Interlocked.Decrement(ref NumberOfActiveClients);
}

Expand Down
8 changes: 8 additions & 0 deletions Microsoft.Azure.Cosmos/src/CosmosClientTelemetryOptions.cs
Original file line number Diff line number Diff line change
Expand Up @@ -53,5 +53,13 @@ public class CosmosClientTelemetryOptions
/// but has to beware that customer data may be shown when the later option is chosen. It's the user's responsibility to sanitize the queries if necessary.
/// </summary>
public QueryTextMode QueryTextMode { get; set; } = QueryTextMode.None;

/// <summary>
/// Indicates whether client-side metrics collection is enabled or disabled.
/// When set to true, the application will capture and report client metrics such as request counts, latencies, errors, and other key performance indicators.
/// If false, no metrics related to the client will be gathered or reported.
/// <remarks>Metrics data can be published to a monitoring system like Prometheus or Azure Monitor, depending on the configured metrics provider.</remarks>
/// </summary>
public bool IsClientMetricsEnabled { get; set; }
}
}
7 changes: 7 additions & 0 deletions Microsoft.Azure.Cosmos/src/DocumentClient.cs
Original file line number Diff line number Diff line change
Expand Up @@ -956,6 +956,13 @@ internal virtual void Initialize(Uri serviceEndpoint,
// Loading VM Information (non blocking call and initialization won't fail if this call fails)
VmMetadataApiHandler.TryInitialize(this.httpClient);

if (this.cosmosClientTelemetryOptions.IsClientMetricsEnabled)
{
CosmosDbOperationMeter.Initialize();

CosmosDbOperationMeter.AddInstanceCount(this.ServiceEndpoint);
}

// Starting ClientTelemetry Job
this.telemetryToServiceHelper = TelemetryToServiceHelper.CreateAndInitializeClientConfigAndTelemetryJob(this.clientId,
this.ConnectionPolicy,
Expand Down
60 changes: 41 additions & 19 deletions Microsoft.Azure.Cosmos/src/Resource/ClientContextCore.cs
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ namespace Microsoft.Azure.Cosmos
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using global::Azure;
using Microsoft.Azure.Cosmos.Handlers;
using Microsoft.Azure.Cosmos.Resource.CosmosExceptions;
using Microsoft.Azure.Cosmos.Routing;
Expand Down Expand Up @@ -498,22 +499,24 @@ private async Task<TResult> RunWithDiagnosticsHelperAsync<TResult>(
RequestOptions requestOptions,
ResourceType? resourceType = null)
{
Func<string> getOperationName = () =>
{
// If opentelemetry is not enabled then return null operation name, so that no activity is created.
if (openTelemetry == null)
{
return null;
}

if (resourceType is not null && this.IsBulkOperationSupported(resourceType.Value, operationType))
{
return OpenTelemetryConstants.Operations.ExecuteBulkPrefix + openTelemetry.Item1;
}
return openTelemetry.Item1;
};

using (OpenTelemetryCoreRecorder recorder =
OpenTelemetryRecorderFactory.CreateRecorder(
getOperationName: () =>
{
// If opentelemetry is not enabled then return null operation name, so that no activity is created.
if (openTelemetry == null)
{
return null;
}

if (resourceType is not null && this.IsBulkOperationSupported(resourceType.Value, operationType))
{
return OpenTelemetryConstants.Operations.ExecuteBulkPrefix + openTelemetry.Item1;
}
return openTelemetry.Item1;
},
getOperationName: getOperationName,
containerName: containerName,
databaseName: databaseName,
operationType: operationType,
Expand All @@ -525,20 +528,30 @@ private async Task<TResult> RunWithDiagnosticsHelperAsync<TResult>(
try
{
TResult result = await task(trace).ConfigureAwait(false);
if (openTelemetry != null && recorder.IsEnabled)
// Checks if OpenTelemetry is configured for this operation and either Trace or Metrics are enabled by customer
if (openTelemetry != null
&& (!this.ClientOptions.CosmosClientTelemetryOptions.DisableDistributedTracing || this.ClientOptions.CosmosClientTelemetryOptions.IsClientMetricsEnabled))
{
// Record request response information
// Extracts and records telemetry data from the result of the operation.
OpenTelemetryAttributes response = openTelemetry?.Item2(result);

// Records the telemetry attributes for Distributed Tracing (if enabled)
recorder.Record(response);
}

// Records metrics such as request units, latency, and item count for the operation.
CosmosDbOperationMeter.RecordTelemetry(getOperationName: getOperationName,
accountName: this.client.Endpoint,
containerName: containerName,
databaseName: databaseName,
attributes: response);
}
return result;
}
catch (OperationCanceledException oe) when (!(oe is CosmosOperationCanceledException))
{
CosmosOperationCanceledException operationCancelledException = new CosmosOperationCanceledException(oe, trace);
recorder.MarkFailed(operationCancelledException);

throw operationCancelledException;
}
catch (ObjectDisposedException objectDisposed) when (!(objectDisposed is CosmosObjectDisposedException))
Expand All @@ -563,7 +576,16 @@ private async Task<TResult> RunWithDiagnosticsHelperAsync<TResult>(
catch (Exception ex)
{
recorder.MarkFailed(ex);

if (openTelemetry != null && ex is CosmosException cosmosException)
{
// Records telemetry data related to the exception.
CosmosDbOperationMeter.RecordTelemetry(getOperationName: getOperationName,
accountName: this.client.Endpoint,
containerName: containerName,
databaseName: databaseName,
ex: cosmosException);
}

throw;
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
namespace Microsoft.Azure.Cosmos.Telemetry
{
using System;
using System.Collections.Generic;
using global::Azure.Core;

internal sealed class AppInsightClassicAttributeKeys : IActivityAttributePopulator
Expand Down Expand Up @@ -160,5 +161,19 @@ public void PopulateAttributes(DiagnosticScope scope, QueryTextMode? queryTextMo
}
}
}

public KeyValuePair<string, object>[] PopulateOperationMeterDimensions(string operationName, string containerName, string databaseName, Uri accountName, OpenTelemetryAttributes attributes, CosmosException ex)
{
return new KeyValuePair<string, object>[]
{
new KeyValuePair<string, object>(AppInsightClassicAttributeKeys.ContainerName, containerName),
new KeyValuePair<string, object>(AppInsightClassicAttributeKeys.DbName, databaseName),
new KeyValuePair<string, object>(AppInsightClassicAttributeKeys.ServerAddress, accountName.Host),
new KeyValuePair<string, object>(AppInsightClassicAttributeKeys.DbOperation, operationName),
new KeyValuePair<string, object>(AppInsightClassicAttributeKeys.StatusCode, (int)(attributes?.StatusCode ?? ex?.StatusCode)),
new KeyValuePair<string, object>(AppInsightClassicAttributeKeys.SubStatusCode, attributes?.SubStatusCode ?? ex?.SubStatusCode),
new KeyValuePair<string, object>(AppInsightClassicAttributeKeys.Region, string.Join(",", attributes.Diagnostics.GetContactedRegions()))
};
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// ------------------------------------------------------------

namespace Microsoft.Azure.Cosmos
{
/// <summary>
/// The CosmosDbClientMetrics class provides constants related to OpenTelemetry metrics for Azure Cosmos DB.
/// These metrics are useful for tracking various aspects of Cosmos DB client operations and compliant with Open Telemetry Semantic Conventions
/// It defines standardized names, units, descriptions, and histogram buckets for measuring and monitoring performance through OpenTelemetry.
/// </summary>
public sealed class CosmosDbClientMetrics
{
/// <summary>
/// OperationMetrics
/// </summary>
public static class OperationMetrics
{
/// <summary>
/// the name of the operation meter
/// </summary>
public const string MeterName = "Azure.Cosmos.Client.Operation";

/// <summary>
/// Version of the operation meter
/// </summary>
public const string Version = "1.0.0";

/// <summary>
/// Metric Names
/// </summary>
public static class Name
{
/// <summary>
/// Total request units per operation (sum of RUs for all requested needed when processing an operation)
/// </summary>
public const string RequestCharge = "db.client.cosmosdb.operation.request_charge";

/// <summary>
/// Total end-to-end duration of the operation
/// </summary>
public const string Latency = "db.client.operation.duration";

/// <summary>
/// For feed operations (query, readAll, readMany, change feed) batch operations this meter capture the actual item count in responses from the service.
/// </summary>
public const string RowCount = "db.client.response.row_count";

/// <summary>
/// Number of active SDK client instances.
/// </summary>
public const string ActiveInstances = "db.client.cosmosdb.active_instance.count";
}

/// <summary>
/// Unit for metrics
/// </summary>
public static class Unit
{
/// <summary>
/// Unit representing a simple count
/// </summary>
public const string Count = "#";

/// <summary>
/// Unit representing time in seconds
/// </summary>
public const string Sec = "s";

/// <summary>
/// Unit representing request units
/// </summary>
public const string RequestUnit = "# RU";

}

/// <summary>
/// Provides descriptions for metrics.
/// </summary>
public static class Description
{
/// <summary>
/// Description for operation duration
/// </summary>
public const string Latency = "Total end-to-end duration of the operation";

/// <summary>
/// Description for total request units per operation
/// </summary>
public const string RequestCharge = "Total request units per operation (sum of RUs for all requested needed when processing an operation)";

/// <summary>
/// Description for the item count metric in responses
/// </summary>
public const string RowCount = "For feed operations (query, readAll, readMany, change feed) batch operations this meter capture the actual item count in responses from the service";

/// <summary>
/// Description for the active SDK client instances metric
/// </summary>
public const string ActiveInstances = "Number of active SDK client instances.";
}
}

/// <summary>
/// Buckets
/// </summary>
public static class HistogramBuckets
{
/// <summary>
/// ExplicitBucketBoundaries for "db.cosmosdb.operation.request_charge" Metrics
/// </summary>
/// <remarks>
/// <b>1, 5, 10</b>: Low Usage Levels, These smaller buckets allow for precise tracking of operations that consume a minimal number of Request Units. This is important for lightweight operations, such as basic read requests or small queries, where resource utilization should be optimized. Monitoring these low usage levels can help ensure that the application is not inadvertently using more resources than necessary.<br></br>
/// <b>25, 50</b>: Moderate Usage Levels, These ranges capture more moderate operations, which are typical in many applications. For example, queries that return a reasonable amount of data or perform standard CRUD operations may fall within these limits. Identifying usage patterns in these buckets can help detect efficiency issues in routine operations.<br></br>
/// <b>100, 250</b>: Higher Usage Levels, These boundaries represent operations that may require significant resources, such as complex queries or larger transactions. Monitoring RUs in these ranges can help identify performance bottlenecks or costly queries that might lead to throttling.<br></br>
/// <b>500, 1000</b>: Very High Usage Levels, These buckets capture operations that consume a substantial number of Request Units, which may indicate potentially expensive queries or batch processes. Understanding the frequency and patterns of such high RU usage can be critical in optimizing performance and ensuring the application remains within provisioned throughput limits.
/// </remarks>
public static readonly double[] RequestUnitBuckets = new double[] { 1, 5, 10, 25, 50, 100, 250, 500, 1000};

/// <summary>
/// ExplicitBucketBoundaries for "db.client.operation.duration" Metrics.
/// </summary>
/// <remarks>
/// <b>0.001, 0.005, 0.010</b> seconds: Higher Precision at Sub-Millisecond Levels, For high-performance workloads, especially when dealing with microservices or low-latency queries. <br></br>
/// <b>0.050, 0.100, 0.200</b> seconds: Granularity for Standard Web Applications, These values allow detailed tracking for latencies between 50ms and 200ms, which are common in web applications. Fine-grained buckets in this range help in identifying performance issues before they grow critical, while covering the typical response times expected in Cosmos DB.<br></br>
/// <b>0.500, 1.000</b> seconds: Wider Range for Intermediate Latencies, Operations that take longer, in the range of 500ms to 1 second, are still important for performance monitoring. By capturing these values, you maintain awareness of potential bottlenecks or slower requests that may need optimization.<br></br>
/// <b>2.000, 5.000</b> seconds: Capturing Outliers and Slow Queries, It’s important to track outliers that might go beyond 1 second. Having buckets for 2 and 5 seconds enables identification of rare, long-running operations that may require further investigation.
/// </remarks>
public static readonly double[] RequestLatencyBuckets = new double[] { 0.001, 0.005, 0.010, 0.050, 0.100, 0.200, 0.500, 1.000, 2.000, 5.000 };

/// <summary>
/// ExplicitBucketBoundaries for "db.client.response.row_count" Metrics
/// </summary>
/// <remarks>
/// <b>10, 50, 100</b>: Small Response Sizes, These buckets are useful for capturing scenarios where only a small number of items are returned. Such small queries are common in real-time or interactive applications where performance and quick responses are critical. They also help in tracking the efficiency of operations that should return minimal data, minimizing resource usage.<br></br>
/// <b>250, 500, 1000</b>: Moderate Response Sizes, These values represent typical workloads where moderate amounts of data are returned in each query. This is useful for applications that need to return more information, such as data analytics or reporting systems. Tracking these ranges helps identify whether the system is optimized for these relatively larger data sets and if they lead to any performance degradation.<br></br>
/// <b>2000, 5000</b>: Larger Response Sizes, These boundaries capture scenarios where the query returns large datasets, often used in batch processing or in-depth analytical queries. These larger page sizes can potentially increase memory or CPU usage and may lead to longer query execution times, making it important to track performance in these ranges.<br></br>
/// <b>10000</b>: Very Large Response Sizes (Outliers), This boundary is included to capture rare, very large response sizes. Such queries can put significant strain on system resources, including memory, CPU, and network bandwidth, and can often lead to performance issues such as high latency or even network drops.
/// </remarks>
public static readonly double[] RowCountBuckets = new double[] { 1, 10, 50, 100, 250, 500, 1000, 2000, 5000, 10000 };
}
}
}
Loading

0 comments on commit 9fa85ee

Please sign in to comment.