Add SDK span telemetry metrics #1631

JonasKunz · 2024-11-29T10:06:39Z

Changes

With this PR I'd like to start a discussion around adding SDK self-monitoring metrics to the semantic conventions.
The goal of these metrics is to give insights into how the SDK is performing, e.g. whether data is being dropped due to overload / misconfiguration or everything is healthy.
I'd like to add these to semconv to keep them language agnostic, so that for example a single dashboard can be used to visualize the health state of all SDKs used in a system.

We checked the SDK implementations, it seems like only the Java SDK currently has some health metrics implemented.
This PR took some inspiration from those and is intended to improve and therefore supersede them.

I'd like to start out with just span related metrics to keep the PR and discussions simpler here, but would follow up with similar PRs for logs and traces based on the discussion results on this PR.

Prior work

This PR can be seen as a follow up to the closed OTEP 259:

The OTEP originally superseded Add processed/exported Span metrics. #184, which initially focused only on SDK exporter metrics.
Add processed/exported Span metrics. #184 was closed in favor of the predecessor of OTEP 259 to instead allow monitoring of entire "pipelines" using unified metrics across SDK exporters and collector components : A pipeline starts with an SDK exporter and goes through the processing of collector(s)
Finally, that OTEP was closed in favor of this collector RFC which adds metrics just for the collector components.

So we kind of have gone full circle: The discussion started with just SDK metrics (only for exporters), going to an approach to unify the metrics across SDK-exporters and collector, which then ended up with just collector metrics.
So this PR can be seen as the required revival of #184 (see also this comment).

In my opinion, it is a good thing to separate the collector and SDK self-metrics:

There have been concerns about both using the same metrics for both: How do you distinguish the metrics exposed by collector components from the self-monitoring metrics exposed by an Otel-SDK used in the collector for e.g. tracing the collector itself?
Though many concepts between the collector and SDK share the same name, they are not the same thing (to my knowledge, I'm not a collector expert): For example processors in the collector are designed to form pipelines potentially mutating the data as it passes through. In contrast, SDK span processor don't form pipelines (at least not visible to the SDK, those would be hidden custom implementations). Instead SDK span processors are merely observers with multiple callbacks for the span lifecycle. So it would feel like "shoehorning" things into the same metric, even though they are not the same concepts.
Separating collector and SDK metrics makes their evolution and reaching agreements a lot easier: When using separate metrics and namespaces, collector metrics can focus on the collector implementation and SDK metrics can be defined just using the SDK spec. If combine both in shared metrics, those will have to be always be aligned with both the SDK spec and the collector implementation. I think this would make maintenance much harder for little benefit.
I have a hard time finding benefits of sharing metrics for SDK and collector: The main benefit I find would of course be easier dashboarding / analysis. However, I do think having to look at two sets of metrics to do so is a fine tradeoff, considering the difficulties with the unification listed above and shown by the history of OTEP 259.

Existing Metrics in Java SDK

For reference, here is what the existing health metrics currently look like in the Java SDK:

Batch Span Processor metrics

Gauge queueSize, value is the current size of the queue
- Attribute spanProcessorType=BatchSpanProcessor (there was a former ExecutorServiceSpanProcessor which has been removed)
- This metric currently causes collisions if two BatchSpanProcessor instances are used
Counter processedSpans, value is the number of spans submitted to the Processor
- Attribute spanProcessorType=BatchSpanProcessor
- Attribute dropped (boolean), true for the number of spans which could not be processed due to a full queue

The SDK also implements pretty much the same metrics for the BatchLogRecordProcessor just span replaced everywhere with log

Exporter metrics

Exporter metrics are the same for spans, metrics and logs. They are distinguishable based on a type attribute.
Also the metric names are dependent on a "name" and "transport" defined by the exporter. For OTLP those are:

exporterName=otlp
transport is one of grpc, http (= protobuf) or http-json

The transport is used just for the instrumentation scope name: io.opentelemetry.exporters.<exporterName>-<transport>

Based on that, the following metrics are exposed:

Counter <exporterName>.exporter.seen: The number of records (spans, metrics or logs) submitted to the exporter
- Attribute type: one of span, metric or log
Counter <exporterName>.exporter.exported: The number of records (spans, metrics or logs) actually exported (or failed)
- Attribute type: one of span, metric or log
- Attribute success (boolean): false for exporter failures

Merge requirement checklist

CONTRIBUTING.md guidelines followed.
Change log entry added, according to the guidelines in When to add a changelog entry.
- If your PR does not need a change log, start the PR title with [chore]
schema-next.yaml updated with changes to existing conventions.

JonasKunz · 2024-11-29T10:34:40Z

model/telemetry/metrics.yaml

+    instrument: counter
+    unit: "1"
+
+  - id: metric.telemetry.sdk.trace.spans.sampled


I'm not really happy with this metric. I'd rather have a sampled boolean attribute on the metric.telemetry.sdk.trace.spans.ended metric.

However, do we have a way of defining "local" attributes which are not part of the global namespace, such as a simple sampled attribute? I couldn't find any example for this. Or do we have to go through the process of defining a globally unique attribute name for this?

model/telemetry/registry.yaml

AlexanderWert · 2024-11-29T11:05:27Z

model/telemetry/metrics.yaml

+    instrument: counter
+    unit: "1"
+
+  - id: metric.telemetry.sdk.trace.processor.queue_size


Wouldn't the name with span instead of trace be more intuitive here?:
metric.telemetry.sdk.span.processor.queue_size

Since, in the description you refer to spans and the span processor.

... same for other related metrics

also, the name ...queue_size might be misleading here. May (and in alignment with the other related comments) something like this instead?:

telemetry.sdk.span.processor.spans_queued

also, the name ...queue_size might be misleading here. May (and in alignment with the other related comments) something like this instead?:

I feel like spans_queued sounds more like a counter of the number of spans which made it into the queue. In contrast, queue_size represents the number of spans in the queue at a given moment in time.
But not really a strong opinion here.

model/telemetry/metrics.yaml

lmolkova · 2024-12-03T00:55:56Z

Related #1580

lmolkova · 2024-12-03T00:58:31Z

model/telemetry/registry.yaml

+      - id: telemetry.sdk.processor.type
+        type:
+          members:
+            - id: batch_span


Any reason to have span in the id and value?
Since telemetry.sdk.processor.type with batch and simple (plus any custom value) would work for any processor while the metric name would contain the signal name.

I was assuming that attributes should have a clear, unambiguous definition even when considering them outside of their current use cases: Using just batching or simple for telemetry.sdk.processor.type in isolation would be ambiguous: It could be a log or a span processor.

So for example if we later decide to add a telemetry.sdk.processor.cpu_time metric to quantify the overhead, batch or simple would be ambiguous for telemetry.sdk.processor.type.

So if I'm wrong here and we can basically say that this attribute needs to be used in contexts where the type of signal (span, log, metric) is known, I'd propose to go even further and combine telemetry.sdk.processor.type with telemetry.sdk.exporter.type to a single telemetry.sdk.component.type definition.

lmolkova · 2024-12-03T01:07:27Z

model/telemetry/registry.yaml

+        type: string
+        stability: experimental
+        brief: >
+          A name uniquely identifying the instance of the OpenTelemetry SDK component within its containing SDK instance.


Is this attribute necessary? It does not map to anything standard in the otel SDK and we can capture the same level of details if we captured the component name (e.g. class/type name) in the instrumentation scope name or had one attribute for component name and telemetry.sdk.exporter|processor.type.

E.g. telemetry.sdk.component.name would contain a fully qualified name of the processor or exporter such as io.opentelemetry.exporter.otlp.trace.OtlpGrpcSpanExporter.

It is required to ensure the uniqueness of individual time series for gauges and updown-counters:

For example telemetry.sdk.span.processor.queue_capacity: The SDK explicitly allows you to setup any amount of span processors you like: For example, you can set up two BatchSpanProcessors, each exporting to a different backend. In that case queue_capacity would break, because there is no distinguishing attribute for the two timeseries (one for each processor instance). This is fixed by adding the telemetry.sdk.component.id.

I agree that 98% case is having just one processor instance per type, but I still want the metrics to remain functional in the 2% case.

And yes, there is currently no concept for naming components explicitly in the SDK and I don't expect any to be there soon. With the definition here I just wanted to be the least prescriptive and leave the door open for adding such a naming mechanism in the future.

lmolkova · 2024-12-03T01:08:30Z

model/telemetry/registry.yaml

+        brief: >
+          A name identifying the type of the OpenTelemetry SDK processor.
+        examples: ["batch-span", "MyCustomProcessor"]
+      - id: telemetry.sdk.exporter.type


do we need to capture the exporter type assuming we capture the fully qualified component name? https://github.com/open-telemetry/semantic-conventions/pull/1631/files#r1866837831

lmolkova · 2024-12-03T01:15:51Z

model/telemetry/registry.yaml

+          A name uniquely identifying the instance of the OpenTelemetry SDK component within its containing SDK instance.
+        note: |
+          The SDK MAY allow users to provide an id for the component instances. If no id is provided by the user,
+          the SDK SHOULD automatically assign an id. Because this attribute is used in metrics, the SDK MUST ensure a low cardinality in that case.


Because this attribute is used in metrics, the SDK MUST ensure a low cardinality in that case.

How would SDK ensure it?

If we want to capture an index of the processing/exporting pipeline, perhaps we can add an index as a separate attribute and ask SDKs to set it?

Also how would SDK deal with cases like

sdk -> custom_composite_processor1 ----> another_processor1 ---> exporter1 ----> processor2 ---> exporter2

it'd only know about custom_composite_processor1 and the rest is opaque.

I think it's worth documenting that components are responsible for capturing their telemetry and disambiguating it across different instances of the same component.

If we want to capture an index of the processing/exporting pipeline, perhaps we can add an index as a separate attribute and ask SDKs to set it?

I'd say that is a non-goal for this attribute: Due to composition / wrapping, it would be impossible for the SDK to externally assign the indices to the processors. We just want to avoid the collisions explained in my answer to this comment.

I think my wording here is misleading:

I think it's worth documenting that components are responsible for capturing their telemetry and disambiguating it across different instances of the same component.

That's what I actually meant: I didn't want to sell that the SDK itself is responsible, but the component implementations must assign some unique, low-cardinality IDs to their instances.

I'll try to reword it.
EDIT: reworded in 2c60a71

lmolkova · 2024-12-03T01:21:48Z

model/telemetry/metrics.yaml

+      - ref: telemetry.sdk.exporter.type
+      - ref: telemetry.sdk.component.id
+
+  - id: metric.telemetry.sdk.span.exporter.spans_failed


We don't usually define separate metrics for failed/successful - we define a single one and use error.type attribute as a marker that something has failed

I'll rename the metric to spans_processed to include both successful and failed processings with distinguishable based on the presence of error.type.

EDIT: Implemented in 1388eac

lmolkova · 2024-12-03T01:22:32Z

model/telemetry/metrics.yaml

+
+  - id: metric.telemetry.sdk.span.sampled_count
+    type: metric
+    metric_name: telemetry.sdk.span.sampled_count


can sampled be a flag on the ended span metric? why a new metric is necessary?

See this comment, I'd be glad to add a simple attribute instead

lmolkova · 2024-12-03T01:23:26Z

model/telemetry/metrics.yaml

+    instrument: updowncounter
+    unit: "1"
+    attributes:
+      - ref: telemetry.sdk.processor.type


only batch processor has a queue, perhaps we can define a batch-processor specific metric instead?

I was thinking of allowing custom processors to also use this metric, if they make use of queuing. But I don't have a strong opinion here.

lmolkova · 2024-12-03T01:24:14Z

model/telemetry/metrics.yaml

+    type: metric
+    metric_name: telemetry.sdk.span.processor.spans_submitted
+    stability: experimental
+    brief: "The number of spans submitted for processing to this span processor"


what does submitted means? does calling on_start count as submission?

I was thinking of "submitted" being defining by receiving a call to the first callback where the processor does something relevant with the span. We should definitely specify that this means on_end for the batching and simple span processors.

EDIT: Attempted fix in 2f491df

lmolkova · 2024-12-03T01:25:28Z

model/telemetry/metrics.yaml

@@ -0,0 +1,91 @@
+groups:
+  - id: metric.telemetry.sdk.span.ended_count


what about started and/or inflight spans? I think it's useful to know at least one of them in addition to number of started spans

I was thinking about started spans but considered it less useful because it wouldn't allow the computation of the inflight spans:

Two my knowledge the absolute value for counters is irrelevant and not really queryable for most metric backends: You only are able to compute the increase of a counter between two points in time.
And you can't compute the inflight spans using increase(ended)-increase(started). The same also applies to when using DELTA temporality.

However, I did not consider adding an inflight updown-counter directly. That would also allow the computation of started spans via increase(ended)+last_value(inflight). So I'm definitiely in favor of adding this metric.

However, I'm not sure whether inflight is the best name here, as for me inflight intuitively sounds like "being sent over the wire" when looking at the full telemetry system from the outside. Maybe something like active or not_ended? I'm happy with other suggestion or even inflight, just want to ensure that we are truly happy with the name here

JonasKunz added 3 commits November 29, 2024 11:03

Added SDK span telemetry metrics

8f2b666

Fix formatting

8b2a1db

Fix yamllint

8bbea82

JonasKunz force-pushed the sdk-telemetry branch from 04f924f to 8bbea82 Compare November 29, 2024 10:26

JonasKunz added 2 commits November 29, 2024 11:26

Merge remote-tracking branch 'otel/main' into sdk-telemetry

e15696f

Changelog

cef63f2

JonasKunz commented Nov 29, 2024

View reviewed changes

JonasKunz marked this pull request as ready for review November 29, 2024 10:40

JonasKunz requested review from a team as code owners November 29, 2024 10:40