New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat: client side metrics handlers #924

Open

daniel-sanche wants to merge 11 commits into client_side_metrics_data_model from client_side_metrics_handlers

Contributor

daniel-sanche commented Jan 26, 2024 •

edited

Loading

This PR builds off of #923 to add handlers to the client-side metrics system, which can subscribe to the metrics stream, and export the results into different collection systems

Follow-up PR:

feat: added client-side instrumentation to all rpcs #925

We add three handlers to the system:

GoogleCloudMetricsHandler: sends metrics to a private OpenTelemetry meter, and then periodically exports them to GCP. Built on top of OpenTelemetryMetricsHandler
OpenTelemetryMetricsHandler: sends metrics to the root MeterProvider, so the user can access the exported metrics for their own systems. This will be off by default, but can be added alongside GoogleCloudMetricsHandler if needed
_StdoutMetricsHandler: can print metrics to stdout as they arrive. Mostly for debugging (we can remove this if you don't think it's useful)


          Revert "stripped out handlers for future PR"

2c57f22

This reverts commit 874a9a8.

daniel-sanche requested review from a team as code owners

January 26, 2024 23:25

product-auto-label bot added size: xl api: bigtable labels

This was referenced Jan 26, 2024

feat: added client-side instrumentation to all rpcs #925

Open

feat: client side metrics data model #923

Open

daniel-sanche added 2 commits

January 30, 2024 19:06


          fixed instrument name

22ed8f8


          always track error count, even if no error

a381d9b

mutianf reviewed

View reviewed changes

google/cloud/bigtable/data/_metrics/handlers/gcp_exporter.py

+                      # process each metric from OTel format into Cloud Monitoring format
+                      for resource_metric in metrics_data.resource_metrics:
+                          for scope_metric in resource_metric.scope_metrics:
+                              for metric in scope_metric.metrics:

Contributor

mutianf Jan 31, 2024

Is there a way to filter only bigtable related metrics? To avoid people from publishing irrelevant metrics.

Contributor Author

daniel-sanche Feb 1, 2024

This exporter is attached to a private MeterProvider, so there shouldn't be any other metrics showing up here

google/cloud/bigtable/data/_metrics/handlers/opentelemetry.py Outdated

+                      )
+                      self.retry_count = meter.create_counter(
+                          name="retry_count",
+                          description="A count of additional RPCs sent after the initial attempt. Under normal circumstances, this will be 1.",

Contributor

mutianf Jan 31, 2024

Under normal circumstances, this value is empty. https://cloud.google.com/bigtable/docs/client-side-metrics-descriptions#retry-count

google/cloud/bigtable/data/_metrics/handlers/opentelemetry.py

+                      # grab meter for this module
+                      meter = meter_provider.get_meter("bigtable.googleapis.com")
+                      # create instruments
+                      self.operation_latencies = meter.create_histogram(

Contributor

mutianf Jan 31, 2024

let's make sure the metric descriptions are identical to our public doc https://cloud.google.com/bigtable/docs/client-side-metrics-descriptions

Contributor Author

daniel-sanche Feb 1, 2024

fixed, although I didn't include this part of the client blocking latencies, since it doesn't apply: For versions 2.21.0 and later, this metric also includes the latencies of requests queued on gRPC channels.

google/cloud/bigtable/data/_metrics/handlers/opentelemetry.py Outdated

+                      # fixed labels sent with each metric update
+                      self.shared_labels = {
+                          "client_name": f"python-bigtable/{bigtable_version}",
+                          "client_uid": client_uid or str(uuid4()),

Contributor

mutianf Jan 31, 2024

Is it possible to detect the hostname where the client is run similar to java? https://github.com/googleapis/java-bigtable/blob/main/google-cloud-bigtable-stats/src/main/java/com/google/cloud/bigtable/stats/BigtableStackdriverExportUtils.java#L158-L173 Or that's not possible with python?

Contributor Author

daniel-sanche Feb 1, 2024

It should be possible to do something like that, I'll take a look

Contributor Author

daniel-sanche Feb 1, 2024

done

google/cloud/bigtable/data/_metrics/handlers/opentelemetry.py Outdated

+                          "resource_table": table_id,
+                      }
+                      if app_profile_id:
+                          self.shared_labels["app_profile"] = app_profile_id

Contributor

mutianf Jan 31, 2024

else: we should tag app_profile with "default"

google/cloud/bigtable/data/_metrics/handlers/opentelemetry.py Outdated

+                      try:
+                          status = str(op.final_status.value[0])
+                      except (IndexError, TypeError):
+                          status = "2"  # unknown

Contributor

mutianf Jan 31, 2024

we actually export status string instead of the numeric value, so OK, DEADLINE_EXCEEDED, etc.

google/cloud/bigtable/data/_metrics/handlers/opentelemetry.py Outdated

+                      self.otel.operation_latencies.record(
+                          op.duration, {"streaming": is_streaming, **labels}
+                      )
+                      self.otel.retry_count.add(len(op.completed_attempts) - 1, labels)

Contributor

mutianf Jan 31, 2024

We leave retry_count as empty if there's no retries. So we only export this if (len(op.completed_attempt) - 1 > 0)

Contributor Author

daniel-sanche Feb 1, 2024

Ok, so don't even send anything for this metric if there's no errors? And I assume that applies to connectivity_error_count too then?

google/cloud/bigtable/data/_metrics/handlers/opentelemetry.py

+                      labels = {
+                          "method": op.op_type.value,
+                          "status": status,
+                          "resource_zone": op.zone,

Contributor

mutianf Jan 31, 2024

Do we also need to fallback to default zone and cluster here?

Contributor Author

daniel-sanche Feb 1, 2024

We do that when building CompletedOperationMetric before calling on_operation_complete

The idea that ActiveOperationMetric has some possibly empty fields, but when it's finalized into a CompletedOperationMetric, all the defaults are applied and it becomes more type-strict

daniel-sanche added 7 commits

February 1, 2024 12:09


          fixed metric descriptions

f62dfe9


          added default app profile

f0499ca


          added new client uid format

5f07f01


          fixed app profile label

4a3f7cb


          use status name instead of number

66212e1


          don't update counts if 0

a15a1bc


          Merge branch 'client_side_metrics_data_model' into client_side_metric…

b38b1b5

…s_handlers

mutianf reviewed

View reviewed changes

google/cloud/bigtable/data/_metrics/handlers/gcp_exporter.py

+                    - export_interval: The interval (in seconds) at which to export metrics to Cloud Monitoring.
+                  """
+                  def __init__(self, *args, project_id: str, export_interval=60, **kwargs):

Contributor

mutianf Feb 8, 2024

I think the minimum interval we can publish metrics to cloud monitoring is 60 seconds. And I don't think we want to update with larger intervals. So let's remove this option?

google/cloud/bigtable/data/_metrics/handlers/gcp_exporter.py

+                  [
+, 0.01, 0.05, 0.1, 0.3, 0.6, 0.8, 1, 2, 3, 4, 5, 6, 8, 10, 13, 16,
+, 25, 30, 40, 50, 65, 80, 100, 130, 160, 200, 250, 300, 400,
+, 650, 800, 1000, 2000, 5000, 10000, 20000, 50000, 100000,

Contributor

mutianf Feb 8, 2024

We updated the buckets in java:

      ImmutableList.of(
          0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 8.0, 10.0, 13.0, 16.0, 20.0, 25.0, 30.0, 40.0,
          50.0, 65.0, 80.0, 100.0, 130.0, 160.0, 200.0, 250.0, 300.0, 400.0, 500.0, 650.0,
          800.0, 1000.0, 2000.0, 5000.0, 10000.0, 20000.0, 50000.0, 100000.0, 200000.0,
          400000.0, 800000.0, 1600000.0, 3200000.0));

I think 100k is too small

google/cloud/bigtable/data/_metrics/handlers/gcp_exporter.py

+                      )
+                      # use private meter provider to store instruments and views
+                      meter_provider = MeterProvider(metric_readers=[gcp_reader], views=VIEW_LIST)
+                      otel = _OpenTelemetryInstruments(meter_provider=meter_provider)

Contributor

mutianf Feb 8, 2024

how would a customer provide their own otel instrumentation? (this could be in a follow up PR)

In java we let customers override it in the settings and pass the otel instance down. see the description in googleapis/java-bigtable#1796

Contributor Author

daniel-sanche Feb 9, 2024

The client holds a controller that manages a set of handlers. Users can add their own OpenTelemetryHandler to send metrics to a different MeterProvider if they want

I'll have to write up some documentation for this at some point

google/cloud/bigtable/data/_metrics/handlers/gcp_exporter.py

+                                      if data_point.attributes:
+                                          project_id = data_point.attributes.get("resource_project")
+                                          if not isinstance(project_id, str):
+                                              # we expect string for project_id field

Contributor

mutianf Feb 8, 2024

Maybe log a warning? Maybe something like: "malformatted resource_project x. Skip publishing"

google/cloud/bigtable/data/_metrics/handlers/gcp_exporter.py

+                  def __init__(self, project_id: str):
+                      super().__init__()
+                      self.client = MetricServiceClient()

Contributor

mutianf Feb 8, 2024

do we need to configure retry attempt number for create_service_time_series method? what's the default? In java we are not retrying the method, I think because republishing could lead to errors.

Contributor Author

daniel-sanche Feb 9, 2024 •

edited

Loading

Good point. I just looked at the gapic code, and it looks like this rpc defaults to no retries.

We can add them if we want, but if we export every 60 seconds, maybe we should add back to the queue for the next batch? (This may actually be the default for OpenTelemetry too, I'd have to look into it. Right now we just return a MetricExportResult.FAILURE, I'm not sure what happens to the failed metrics)

Would republishing lead to errors here too?

google/cloud/bigtable/data/_metrics/handlers/gcp_exporter.py

+                      self.project_name = self.client.common_project_path(project_id)
+                  def export(
+                      self, metrics_data: MetricsData, timeout_millis: float = 10_000, **kwargs

Contributor

mutianf Feb 8, 2024

why not just pass in second? :) so we don't need to convert it back to seconds on line 145 later :)

Contributor Author

daniel-sanche Feb 9, 2024

This is part of the superclass method definition unfortunately. We inherit from opentelemetry.sdk.metrics.export.MetricExporter


          Merge branch 'client_side_metrics_data_model' into client_side_metric…

987e3df

…s_handlers

daniel-sanche requested a review from a team as a code owner

February 9, 2024 00:52

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigtable size: xl