-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: client side metrics handlers #924
base: client_side_metrics_data_model
Are you sure you want to change the base?
feat: client side metrics handlers #924
Conversation
This reverts commit 874a9a8.
# process each metric from OTel format into Cloud Monitoring format | ||
for resource_metric in metrics_data.resource_metrics: | ||
for scope_metric in resource_metric.scope_metrics: | ||
for metric in scope_metric.metrics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to filter only bigtable related metrics? To avoid people from publishing irrelevant metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exporter is attached to a private MeterProvider, so there shouldn't be any other metrics showing up here
) | ||
self.retry_count = meter.create_counter( | ||
name="retry_count", | ||
description="A count of additional RPCs sent after the initial attempt. Under normal circumstances, this will be 1.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under normal circumstances, this value is empty. https://cloud.google.com/bigtable/docs/client-side-metrics-descriptions#retry-count
# grab meter for this module | ||
meter = meter_provider.get_meter("bigtable.googleapis.com") | ||
# create instruments | ||
self.operation_latencies = meter.create_histogram( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make sure the metric descriptions are identical to our public doc https://cloud.google.com/bigtable/docs/client-side-metrics-descriptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, although I didn't include this part of the client blocking latencies, since it doesn't apply: For versions 2.21.0 and later, this metric also includes the latencies of requests queued on gRPC channels.
# fixed labels sent with each metric update | ||
self.shared_labels = { | ||
"client_name": f"python-bigtable/{bigtable_version}", | ||
"client_uid": client_uid or str(uuid4()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to detect the hostname where the client is run similar to java? https://github.com/googleapis/java-bigtable/blob/main/google-cloud-bigtable-stats/src/main/java/com/google/cloud/bigtable/stats/BigtableStackdriverExportUtils.java#L158-L173 Or that's not possible with python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to do something like that, I'll take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
"resource_table": table_id, | ||
} | ||
if app_profile_id: | ||
self.shared_labels["app_profile"] = app_profile_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else: we should tag app_profile with "default"
try: | ||
status = str(op.final_status.value[0]) | ||
except (IndexError, TypeError): | ||
status = "2" # unknown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we actually export status string instead of the numeric value, so OK, DEADLINE_EXCEEDED, etc.
self.otel.operation_latencies.record( | ||
op.duration, {"streaming": is_streaming, **labels} | ||
) | ||
self.otel.retry_count.add(len(op.completed_attempts) - 1, labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We leave retry_count as empty if there's no retries. So we only export this if (len(op.completed_attempt) - 1 > 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so don't even send anything for this metric if there's no errors? And I assume that applies to connectivity_error_count too then?
labels = { | ||
"method": op.op_type.value, | ||
"status": status, | ||
"resource_zone": op.zone, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to fallback to default zone and cluster here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do that when building CompletedOperationMetric before calling on_operation_complete
The idea that ActiveOperationMetric has some possibly empty fields, but when it's finalized into a CompletedOperationMetric, all the defaults are applied and it becomes more type-strict
- export_interval: The interval (in seconds) at which to export metrics to Cloud Monitoring. | ||
""" | ||
|
||
def __init__(self, *args, project_id: str, export_interval=60, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the minimum interval we can publish metrics to cloud monitoring is 60 seconds. And I don't think we want to update with larger intervals. So let's remove this option?
[ | ||
0, 0.01, 0.05, 0.1, 0.3, 0.6, 0.8, 1, 2, 3, 4, 5, 6, 8, 10, 13, 16, | ||
20, 25, 30, 40, 50, 65, 80, 100, 130, 160, 200, 250, 300, 400, | ||
500, 650, 800, 1000, 2000, 5000, 10000, 20000, 50000, 100000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We updated the buckets in java:
ImmutableList.of(
0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 8.0, 10.0, 13.0, 16.0, 20.0, 25.0, 30.0, 40.0,
50.0, 65.0, 80.0, 100.0, 130.0, 160.0, 200.0, 250.0, 300.0, 400.0, 500.0, 650.0,
800.0, 1000.0, 2000.0, 5000.0, 10000.0, 20000.0, 50000.0, 100000.0, 200000.0,
400000.0, 800000.0, 1600000.0, 3200000.0));
I think 100k is too small
) | ||
# use private meter provider to store instruments and views | ||
meter_provider = MeterProvider(metric_readers=[gcp_reader], views=VIEW_LIST) | ||
otel = _OpenTelemetryInstruments(meter_provider=meter_provider) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would a customer provide their own otel instrumentation? (this could be in a follow up PR)
In java we let customers override it in the settings and pass the otel instance down. see the description in googleapis/java-bigtable#1796
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client holds a controller that manages a set of handlers. Users can add their own OpenTelemetryHandler to send metrics to a different MeterProvider if they want
I'll have to write up some documentation for this at some point
if data_point.attributes: | ||
project_id = data_point.attributes.get("resource_project") | ||
if not isinstance(project_id, str): | ||
# we expect string for project_id field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe log a warning? Maybe something like: "malformatted resource_project x. Skip publishing"
|
||
def __init__(self, project_id: str): | ||
super().__init__() | ||
self.client = MetricServiceClient() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to configure retry attempt number for create_service_time_series method? what's the default? In java we are not retrying the method, I think because republishing could lead to errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I just looked at the gapic code, and it looks like this rpc defaults to no retries.
We can add them if we want, but if we export every 60 seconds, maybe we should add back to the queue for the next batch? (This may actually be the default for OpenTelemetry too, I'd have to look into it. Right now we just return a MetricExportResult.FAILURE
, I'm not sure what happens to the failed metrics)
Would republishing lead to errors here too?
self.project_name = self.client.common_project_path(project_id) | ||
|
||
def export( | ||
self, metrics_data: MetricsData, timeout_millis: float = 10_000, **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just pass in second? :) so we don't need to convert it back to seconds on line 145 later :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of the superclass method definition unfortunately. We inherit from opentelemetry.sdk.metrics.export.MetricExporter
This PR builds off of #923 to add handlers to the client-side metrics system, which can subscribe to the metrics stream, and export the results into different collection systems
Follow-up PR:
We add three handlers to the system:
GoogleCloudMetricsHandler
: sends metrics to a private OpenTelemetry meter, and then periodically exports them to GCP. Built on top ofOpenTelemetryMetricsHandler
OpenTelemetryMetricsHandler
: sends metrics to the root MeterProvider, so the user can access the exported metrics for their own systems. This will be off by default, but can be added alongsideGoogleCloudMetricsHandler
if needed_StdoutMetricsHandler
: can print metrics to stdout as they arrive. Mostly for debugging (we can remove this if you don't think it's useful)