Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SDK span telemetry metrics #1631

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

JonasKunz
Copy link

@JonasKunz JonasKunz commented Nov 29, 2024

Changes

With this PR I'd like to start a discussion around adding SDK self-monitoring metrics to the semantic conventions.
The goal of these metrics is to give insights into how the SDK is performing, e.g. whether data is being dropped due to overload / misconfiguration or everything is healthy.
I'd like to add these to semconv to keep them language agnostic, so that for example a single dashboard can be used to visualize the health state of all SDKs used in a system.

We checked the SDK implementations, it seems like only the Java SDK currently has some health metrics implemented.
This PR took some inspiration from those and is intended to improve and therefore supersede them.

I'd like to start out with just span related metrics to keep the PR and discussions simpler here, but would follow up with similar PRs for logs and traces based on the discussion results on this PR.

Prior work

This PR can be seen as a follow up to the closed OTEP 259:

So we kind of have gone full circle: The discussion started with just SDK metrics (only for exporters), going to an approach to unify the metrics across SDK-exporters and collector, which then ended up with just collector metrics.
So this PR can be seen as the required revival of #184 (see also this comment).

In my opinion, it is a good thing to separate the collector and SDK self-metrics:

  • There have been concerns about both using the same metrics for both: How do you distinguish the metrics exposed by collector components from the self-monitoring metrics exposed by an Otel-SDK used in the collector for e.g. tracing the collector itself?
  • Though many concepts between the collector and SDK share the same name, they are not the same thing (to my knowledge, I'm not a collector expert): For example processors in the collector are designed to form pipelines potentially mutating the data as it passes through. In contrast, SDK span processor don't form pipelines (at least not visible to the SDK, those would be hidden custom implementations). Instead SDK span processors are merely observers with multiple callbacks for the span lifecycle. So it would feel like "shoehorning" things into the same metric, even though they are not the same concepts.
  • Separating collector and SDK metrics makes their evolution and reaching agreements a lot easier: When using separate metrics and namespaces, collector metrics can focus on the collector implementation and SDK metrics can be defined just using the SDK spec. If combine both in shared metrics, those will have to be always be aligned with both the SDK spec and the collector implementation. I think this would make maintenance much harder for little benefit.
  • I have a hard time finding benefits of sharing metrics for SDK and collector: The main benefit I find would of course be easier dashboarding / analysis. However, I do think having to look at two sets of metrics to do so is a fine tradeoff, considering the difficulties with the unification listed above and shown by the history of OTEP 259.

Existing Metrics in Java SDK

For reference, here is what the existing health metrics currently look like in the Java SDK:

Batch Span Processor metrics

  • Gauge queueSize, value is the current size of the queue
    • Attribute spanProcessorType=BatchSpanProcessor (there was a former ExecutorServiceSpanProcessor which has been removed)
    • This metric currently causes collisions if two BatchSpanProcessor instances are used
  • Counter processedSpans, value is the number of spans submitted to the Processor
    • Attribute spanProcessorType=BatchSpanProcessor
    • Attribute dropped (boolean), true for the number of spans which could not be processed due to a full queue

The SDK also implements pretty much the same metrics for the BatchLogRecordProcessor just span replaced everywhere with log

Exporter metrics

Exporter metrics are the same for spans, metrics and logs. They are distinguishable based on a type attribute.
Also the metric names are dependent on a "name" and "transport" defined by the exporter. For OTLP those are:

  • exporterName=otlp
  • transport is one of grpc, http (= protobuf) or http-json

The transport is used just for the instrumentation scope name: io.opentelemetry.exporters.<exporterName>-<transport>

Based on that, the following metrics are exposed:

Merge requirement checklist

instrument: counter
unit: "1"

- id: metric.telemetry.sdk.trace.spans.sampled
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really happy with this metric. I'd rather have a sampled boolean attribute on the metric.telemetry.sdk.trace.spans.ended metric.

However, do we have a way of defining "local" attributes which are not part of the global namespace, such as a simple sampled attribute? I couldn't find any example for this. Or do we have to go through the process of defining a globally unique attribute name for this?

@JonasKunz JonasKunz marked this pull request as ready for review November 29, 2024 10:40
@JonasKunz JonasKunz requested review from a team as code owners November 29, 2024 10:40
instrument: counter
unit: "1"

- id: metric.telemetry.sdk.trace.processor.queue_size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the name with span instead of trace be more intuitive here?:
metric.telemetry.sdk.span.processor.queue_size

Since, in the description you refer to spans and the span processor.

... same for other related metrics

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, the name ...queue_size might be misleading here. May (and in alignment with the other related comments) something like this instead?:

telemetry.sdk.span.processor.spans_queued

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, the name ...queue_size might be misleading here. May (and in alignment with the other related comments) something like this instead?:

I feel like spans_queued sounds more like a counter of the number of spans which made it into the queue. In contrast, queue_size represents the number of spans in the queue at a given moment in time.
But not really a strong opinion here.

@lmolkova
Copy link
Contributor

lmolkova commented Dec 3, 2024

Related #1580

- id: telemetry.sdk.processor.type
type:
members:
- id: batch_span
Copy link
Contributor

@lmolkova lmolkova Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to have span in the id and value?
Since telemetry.sdk.processor.type with batch and simple (plus any custom value) would work for any processor while the metric name would contain the signal name.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assuming that attributes should have a clear, unambiguous definition even when considering them outside of their current use cases: Using just batching or simple for telemetry.sdk.processor.type in isolation would be ambiguous: It could be a log or a span processor.

So for example if we later decide to add a telemetry.sdk.processor.cpu_time metric to quantify the overhead, batch or simple would be ambiguous for telemetry.sdk.processor.type.

So if I'm wrong here and we can basically say that this attribute needs to be used in contexts where the type of signal (span, log, metric) is known, I'd propose to go even further and combine telemetry.sdk.processor.type with telemetry.sdk.exporter.type to a single telemetry.sdk.component.type definition.

type: string
stability: experimental
brief: >
A name uniquely identifying the instance of the OpenTelemetry SDK component within its containing SDK instance.
Copy link
Contributor

@lmolkova lmolkova Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this attribute necessary? It does not map to anything standard in the otel SDK and we can capture the same level of details if we captured the component name (e.g. class/type name) in the instrumentation scope name or had one attribute for component name and telemetry.sdk.exporter|processor.type.

E.g. telemetry.sdk.component.name would contain a fully qualified name of the processor or exporter such as io.opentelemetry.exporter.otlp.trace.OtlpGrpcSpanExporter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is required to ensure the uniqueness of individual time series for gauges and updown-counters:

For example telemetry.sdk.span.processor.queue_capacity: The SDK explicitly allows you to setup any amount of span processors you like: For example, you can set up two BatchSpanProcessors, each exporting to a different backend. In that case queue_capacity would break, because there is no distinguishing attribute for the two timeseries (one for each processor instance). This is fixed by adding the telemetry.sdk.component.id.

I agree that 98% case is having just one processor instance per type, but I still want the metrics to remain functional in the 2% case.

And yes, there is currently no concept for naming components explicitly in the SDK and I don't expect any to be there soon. With the definition here I just wanted to be the least prescriptive and leave the door open for adding such a naming mechanism in the future.

brief: >
A name identifying the type of the OpenTelemetry SDK processor.
examples: ["batch-span", "MyCustomProcessor"]
- id: telemetry.sdk.exporter.type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to capture the exporter type assuming we capture the fully qualified component name? https://github.com/open-telemetry/semantic-conventions/pull/1631/files#r1866837831

A name uniquely identifying the instance of the OpenTelemetry SDK component within its containing SDK instance.
note: |
The SDK MAY allow users to provide an id for the component instances. If no id is provided by the user,
the SDK SHOULD automatically assign an id. Because this attribute is used in metrics, the SDK MUST ensure a low cardinality in that case.
Copy link
Contributor

@lmolkova lmolkova Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this attribute is used in metrics, the SDK MUST ensure a low cardinality in that case.

How would SDK ensure it?

If we want to capture an index of the processing/exporting pipeline, perhaps we can add an index as a separate attribute and ask SDKs to set it?

Also how would SDK deal with cases like

sdk -> custom_composite_processor1 ----> another_processor1 ---> exporter1
                                   ----> processor2 ---> exporter2

it'd only know about custom_composite_processor1 and the rest is opaque.

I think it's worth documenting that components are responsible for capturing their telemetry and disambiguating it across different instances of the same component.

Copy link
Author

@JonasKunz JonasKunz Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to capture an index of the processing/exporting pipeline, perhaps we can add an index as a separate attribute and ask SDKs to set it?

I'd say that is a non-goal for this attribute: Due to composition / wrapping, it would be impossible for the SDK to externally assign the indices to the processors. We just want to avoid the collisions explained in my answer to this comment.

I think my wording here is misleading:

I think it's worth documenting that components are responsible for capturing their telemetry and disambiguating it across different instances of the same component.

That's what I actually meant: I didn't want to sell that the SDK itself is responsible, but the component implementations must assign some unique, low-cardinality IDs to their instances.

I'll try to reword it.
EDIT: reworded in 2c60a71

- ref: telemetry.sdk.exporter.type
- ref: telemetry.sdk.component.id

- id: metric.telemetry.sdk.span.exporter.spans_failed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't usually define separate metrics for failed/successful - we define a single one and use error.type attribute as a marker that something has failed

Copy link
Author

@JonasKunz JonasKunz Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rename the metric to spans_processed to include both successful and failed processings with distinguishable based on the presence of error.type.

EDIT: Implemented in 1388eac


- id: metric.telemetry.sdk.span.sampled_count
type: metric
metric_name: telemetry.sdk.span.sampled_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can sampled be a flag on the ended span metric? why a new metric is necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this comment, I'd be glad to add a simple attribute instead

instrument: updowncounter
unit: "1"
attributes:
- ref: telemetry.sdk.processor.type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only batch processor has a queue, perhaps we can define a batch-processor specific metric instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of allowing custom processors to also use this metric, if they make use of queuing. But I don't have a strong opinion here.

type: metric
metric_name: telemetry.sdk.span.processor.spans_submitted
stability: experimental
brief: "The number of spans submitted for processing to this span processor"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does submitted means? does calling on_start count as submission?

Copy link
Author

@JonasKunz JonasKunz Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of "submitted" being defining by receiving a call to the first callback where the processor does something relevant with the span. We should definitely specify that this means on_end for the batching and simple span processors.

EDIT: Attempted fix in 2f491df

@@ -0,0 +1,91 @@
groups:
- id: metric.telemetry.sdk.span.ended_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about started and/or inflight spans? I think it's useful to know at least one of them in addition to number of started spans

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about started spans but considered it less useful because it wouldn't allow the computation of the inflight spans:

Two my knowledge the absolute value for counters is irrelevant and not really queryable for most metric backends: You only are able to compute the increase of a counter between two points in time.
And you can't compute the inflight spans using increase(ended)-increase(started). The same also applies to when using DELTA temporality.

However, I did not consider adding an inflight updown-counter directly. That would also allow the computation of started spans via increase(ended)+last_value(inflight). So I'm definitiely in favor of adding this metric.

However, I'm not sure whether inflight is the best name here, as for me inflight intuitively sounds like "being sent over the wire" when looking at the full telemetry system from the outside. Maybe something like active or not_ended? I'm happy with other suggestion or even inflight, just want to ensure that we are truly happy with the name here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

3 participants