-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SDK span telemetry metrics #1631
base: main
Are you sure you want to change the base?
Conversation
04f924f
to
8bbea82
Compare
model/telemetry/metrics.yaml
Outdated
instrument: counter | ||
unit: "1" | ||
|
||
- id: metric.telemetry.sdk.trace.spans.sampled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really happy with this metric. I'd rather have a sampled
boolean attribute on the metric.telemetry.sdk.trace.spans.ended
metric.
However, do we have a way of defining "local" attributes which are not part of the global namespace, such as a simple sampled
attribute? I couldn't find any example for this. Or do we have to go through the process of defining a globally unique attribute name for this?
model/telemetry/metrics.yaml
Outdated
instrument: counter | ||
unit: "1" | ||
|
||
- id: metric.telemetry.sdk.trace.processor.queue_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the name with span
instead of trace
be more intuitive here?:
metric.telemetry.sdk.span.processor.queue_size
Since, in the description you refer to spans and the span processor.
... same for other related metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, the name ...queue_size
might be misleading here. May (and in alignment with the other related comments) something like this instead?:
telemetry.sdk.span.processor.spans_queued
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, the name ...queue_size might be misleading here. May (and in alignment with the other related comments) something like this instead?:
I feel like spans_queued
sounds more like a counter of the number of spans which made it into the queue. In contrast, queue_size
represents the number of spans in the queue at a given moment in time.
But not really a strong opinion here.
Related #1580 |
- id: telemetry.sdk.processor.type | ||
type: | ||
members: | ||
- id: batch_span |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to have span
in the id and value?
Since telemetry.sdk.processor.type
with batch
and simple
(plus any custom value) would work for any processor while the metric name would contain the signal name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was assuming that attributes should have a clear, unambiguous definition even when considering them outside of their current use cases: Using just batching
or simple
for telemetry.sdk.processor.type
in isolation would be ambiguous: It could be a log or a span processor.
So for example if we later decide to add a telemetry.sdk.processor.cpu_time
metric to quantify the overhead, batch
or simple
would be ambiguous for telemetry.sdk.processor.type
.
So if I'm wrong here and we can basically say that this attribute needs to be used in contexts where the type of signal (span, log, metric) is known, I'd propose to go even further and combine telemetry.sdk.processor.type
with telemetry.sdk.exporter.type
to a single telemetry.sdk.component.type
definition.
type: string | ||
stability: experimental | ||
brief: > | ||
A name uniquely identifying the instance of the OpenTelemetry SDK component within its containing SDK instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this attribute necessary? It does not map to anything standard in the otel SDK and we can capture the same level of details if we captured the component name (e.g. class/type name) in the instrumentation scope name or had one attribute for component name and telemetry.sdk.exporter|processor.type
.
E.g. telemetry.sdk.component.name
would contain a fully qualified name of the processor or exporter such as io.opentelemetry.exporter.otlp.trace.OtlpGrpcSpanExporter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is required to ensure the uniqueness of individual time series for gauges and updown-counters:
For example telemetry.sdk.span.processor.queue_capacity
: The SDK explicitly allows you to setup any amount of span processors you like: For example, you can set up two BatchSpanProcessor
s, each exporting to a different backend. In that case queue_capacity
would break, because there is no distinguishing attribute for the two timeseries (one for each processor instance). This is fixed by adding the telemetry.sdk.component.id
.
I agree that 98% case is having just one processor instance per type, but I still want the metrics to remain functional in the 2% case.
And yes, there is currently no concept for naming components explicitly in the SDK and I don't expect any to be there soon. With the definition here I just wanted to be the least prescriptive and leave the door open for adding such a naming mechanism in the future.
brief: > | ||
A name identifying the type of the OpenTelemetry SDK processor. | ||
examples: ["batch-span", "MyCustomProcessor"] | ||
- id: telemetry.sdk.exporter.type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to capture the exporter type assuming we capture the fully qualified component name? https://github.com/open-telemetry/semantic-conventions/pull/1631/files#r1866837831
model/telemetry/registry.yaml
Outdated
A name uniquely identifying the instance of the OpenTelemetry SDK component within its containing SDK instance. | ||
note: | | ||
The SDK MAY allow users to provide an id for the component instances. If no id is provided by the user, | ||
the SDK SHOULD automatically assign an id. Because this attribute is used in metrics, the SDK MUST ensure a low cardinality in that case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this attribute is used in metrics, the SDK MUST ensure a low cardinality in that case.
How would SDK ensure it?
If we want to capture an index of the processing/exporting pipeline, perhaps we can add an index as a separate attribute and ask SDKs to set it?
Also how would SDK deal with cases like
sdk -> custom_composite_processor1 ----> another_processor1 ---> exporter1
----> processor2 ---> exporter2
it'd only know about custom_composite_processor1
and the rest is opaque.
I think it's worth documenting that components are responsible for capturing their telemetry and disambiguating it across different instances of the same component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to capture an index of the processing/exporting pipeline, perhaps we can add an index as a separate attribute and ask SDKs to set it?
I'd say that is a non-goal for this attribute: Due to composition / wrapping, it would be impossible for the SDK to externally assign the indices to the processors. We just want to avoid the collisions explained in my answer to this comment.
I think my wording here is misleading:
I think it's worth documenting that components are responsible for capturing their telemetry and disambiguating it across different instances of the same component.
That's what I actually meant: I didn't want to sell that the SDK itself is responsible, but the component implementations must assign some unique, low-cardinality IDs to their instances.
I'll try to reword it.
EDIT: reworded in 2c60a71
model/telemetry/metrics.yaml
Outdated
- ref: telemetry.sdk.exporter.type | ||
- ref: telemetry.sdk.component.id | ||
|
||
- id: metric.telemetry.sdk.span.exporter.spans_failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't usually define separate metrics for failed/successful - we define a single one and use error.type
attribute as a marker that something has failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rename the metric to spans_processed
to include both successful and failed processings with distinguishable based on the presence of error.type
.
EDIT: Implemented in 1388eac
|
||
- id: metric.telemetry.sdk.span.sampled_count | ||
type: metric | ||
metric_name: telemetry.sdk.span.sampled_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can sampled be a flag on the ended span metric? why a new metric is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this comment, I'd be glad to add a simple attribute instead
instrument: updowncounter | ||
unit: "1" | ||
attributes: | ||
- ref: telemetry.sdk.processor.type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only batch processor has a queue, perhaps we can define a batch-processor specific metric instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of allowing custom processors to also use this metric, if they make use of queuing. But I don't have a strong opinion here.
model/telemetry/metrics.yaml
Outdated
type: metric | ||
metric_name: telemetry.sdk.span.processor.spans_submitted | ||
stability: experimental | ||
brief: "The number of spans submitted for processing to this span processor" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does submitted means? does calling on_start
count as submission?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of "submitted" being defining by receiving a call to the first callback where the processor does something relevant with the span. We should definitely specify that this means on_end
for the batching and simple span processors.
EDIT: Attempted fix in 2f491df
@@ -0,0 +1,91 @@ | |||
groups: | |||
- id: metric.telemetry.sdk.span.ended_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about started and/or inflight spans? I think it's useful to know at least one of them in addition to number of started spans
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about started spans but considered it less useful because it wouldn't allow the computation of the inflight spans:
Two my knowledge the absolute value for counters is irrelevant and not really queryable for most metric backends: You only are able to compute the increase of a counter between two points in time.
And you can't compute the inflight spans using increase(ended)-increase(started)
. The same also applies to when using DELTA
temporality.
However, I did not consider adding an inflight
updown-counter directly. That would also allow the computation of started spans via increase(ended)+last_value(inflight)
. So I'm definitiely in favor of adding this metric.
However, I'm not sure whether inflight
is the best name here, as for me inflight intuitively sounds like "being sent over the wire" when looking at the full telemetry system from the outside. Maybe something like active
or not_ended
? I'm happy with other suggestion or even inflight
, just want to ensure that we are truly happy with the name here
Changes
With this PR I'd like to start a discussion around adding SDK self-monitoring metrics to the semantic conventions.
The goal of these metrics is to give insights into how the SDK is performing, e.g. whether data is being dropped due to overload / misconfiguration or everything is healthy.
I'd like to add these to semconv to keep them language agnostic, so that for example a single dashboard can be used to visualize the health state of all SDKs used in a system.
We checked the SDK implementations, it seems like only the Java SDK currently has some health metrics implemented.
This PR took some inspiration from those and is intended to improve and therefore supersede them.
I'd like to start out with just span related metrics to keep the PR and discussions simpler here, but would follow up with similar PRs for logs and traces based on the discussion results on this PR.
Prior work
This PR can be seen as a follow up to the closed OTEP 259:
So we kind of have gone full circle: The discussion started with just SDK metrics (only for exporters), going to an approach to unify the metrics across SDK-exporters and collector, which then ended up with just collector metrics.
So this PR can be seen as the required revival of #184 (see also this comment).
In my opinion, it is a good thing to separate the collector and SDK self-metrics:
Existing Metrics in Java SDK
For reference, here is what the existing health metrics currently look like in the Java SDK:
Batch Span Processor metrics
queueSize
, value is the current size of the queuespanProcessorType
=BatchSpanProcessor
(there was a formerExecutorServiceSpanProcessor
which has been removed)BatchSpanProcessor
instances are usedprocessedSpans
, value is the number of spans submitted to the ProcessorspanProcessorType
=BatchSpanProcessor
dropped
(boolean
),true
for the number of spans which could not be processed due to a full queueThe SDK also implements pretty much the same metrics for the
BatchLogRecordProcessor
justspan
replaced everywhere withlog
Exporter metrics
Exporter metrics are the same for spans, metrics and logs. They are distinguishable based on a
type
attribute.Also the metric names are dependent on a "name" and "transport" defined by the exporter. For OTLP those are:
exporterName
=otlp
transport
is one ofgrpc
,http
(= protobuf) orhttp-json
The transport is used just for the instrumentation scope name:
io.opentelemetry.exporters.<exporterName>-<transport>
Based on that, the following metrics are exposed:
Counter
<exporterName>.exporter.seen
: The number of records (spans, metrics or logs) submitted to the exportertype
: one ofspan
,metric
orlog
Counter
<exporterName>.exporter.exported
: The number of records (spans, metrics or logs) actually exported (or failed)type
: one ofspan
,metric
orlog
success
(boolean):false
for exporter failuresMerge requirement checklist
[chore]