Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics-generator idle and no traces are generated using Tempo 2.5.0 #3855

Open
nc-gcz opened this issue Jul 10, 2024 · 21 comments
Open

metrics-generator idle and no traces are generated using Tempo 2.5.0 #3855

nc-gcz opened this issue Jul 10, 2024 · 21 comments

Comments

@nc-gcz
Copy link

nc-gcz commented Jul 10, 2024

Describe the bug
Good morning. We are running Grafana Tempo 2.5.0 using Bitnami Grafana Tempo distributed Helm chart in version: 3.6.3. The following components are deployed: Compactor, Distributed, Memcached, Metrics Generator, Querier, Query Frontend, Vulture and OpenTelemetry Collector in version: 0.83.0.

We have an issue after updating Grafana Tempo 2.1.1 to 2.5.0 with Metrics Generator. The issues with incompatible vParquet version has been mitigated according to documentation: https://grafana.com/docs/tempo/latest/setup/upgrade/#upgrade-to-tempo-25.

The issue we are experiencing is that metrics-generator seems to be idle after start-up:

level=info ts=2024-07-10T09:42:46.762479515Z caller=main.go:225 msg="initialising OpenTracing tracer"
level=info ts=2024-07-10T09:42:46.857883541Z caller=main.go:118 msg="Starting Tempo" version="(version=2.5.0, branch=HEAD, revision=46dad34)"
level=info ts=2024-07-10T09:42:46.860437514Z caller=server.go:240 msg="server listening on addresses" http=[::]:3200 grpc=[::]:9095
level=info ts=2024-07-10T09:42:46.862597894Z caller=memberlist_client.go:435 msg="Using memberlist cluster label and node name" cluster_label= node=grafana-tempo-metrics-generator-ff7856d95-g5zqz-5609c299
level=info ts=2024-07-10T09:42:46.8632617Z caller=module_service.go:82 msg=starting module=internal-server
level=info ts=2024-07-10T09:42:46.863420623Z caller=module_service.go:82 msg=starting module=server
level=info ts=2024-07-10T09:42:46.863484388Z caller=module_service.go:82 msg=starting module=overrides
level=info ts=2024-07-10T09:42:46.863533522Z caller=module_service.go:82 msg=starting module=memberlist-kv
level=info ts=2024-07-10T09:42:46.863997352Z caller=memberlist_client.go:541 msg="memberlist fast-join starting" nodes_found=1 to_join=2
level=info ts=2024-07-10T09:42:46.864106683Z caller=module_service.go:82 msg=starting module=metrics-generator
level=info ts=2024-07-10T09:42:46.964122903Z caller=memberlist_client.go:561 msg="memberlist fast-join finished" joined_nodes=5 elapsed_time=100.127674ms
level=info ts=2024-07-10T09:42:46.964195952Z caller=memberlist_client.go:573 phase=startup msg="joining memberlist cluster" join_members=grafana-tempo-gossip-ring
level=info ts=2024-07-10T09:42:46.964243354Z caller=basic_lifecycler.go:297 msg="instance not found in the ring" instance=grafana-tempo-metrics-generator-ff7856d95-g5zqz ring=metrics-generator
level=info ts=2024-07-10T09:42:46.964631778Z caller=app.go:214 msg="Tempo started"
level=info ts=2024-07-10T09:42:46.981742416Z caller=memberlist_client.go:580 phase=startup msg="joining memberlist cluster succeeded" reached_nodes=5 elapsed_time=17.536478ms

This is the log from metrics-generator that has been started around 60 minutes before.

The relevant configuration is as follows:

multitenancy_enabled: false
usage_report:
  reporting_enabled: false
compactor:
  compaction:
    block_retention: 48h
  ring:
    kvstore:
      store: memberlist
    wait_stability_min_duration: 30s
    wait_stability_max_duration: 2m30s
distributor:
  ring:
    kvstore:
      store: memberlist
  receivers:
    jaeger:
      protocols:
        grpc:
          endpoint: 0.0.0.0:14250
    otlp:
      protocols:
        http:
          endpoint: 0.0.0.0:55681
        grpc:
          endpoint: 0.0.0.0:4317
querier:
  frontend_worker:
    frontend_address: grafana-tempo-query-frontend:9095
ingester:
  lifecycler:
    ring:
      replication_factor: 1
      kvstore:
        store: memberlist
    tokens_file_path: /bitnami/grafana-tempo/data/tokens.json
metrics_generator:
  ring:
    kvstore:
      store: memberlist
  storage:
    path: /bitnami/grafana-tempo/data/wal
    remote_write: 
      - name: prometheus-0
        send_exemplars: true
        url: http://<REDACTED>:9090/api/v1/write
      - name: prometheus-1
        send_exemplars: true
        url: http://<REDACTED>:9090/api/v1/write
  registry:
    external_labels:
      source: tempo
  processor:
    service_graphs:
      max_items: 10000
    span_metrics:
      enable_target_info: true
memberlist:
  abort_if_cluster_join_fails: false
  join_members:
    - grafana-tempo-gossip-ring
overrides:
  per_tenant_override_config: /bitnami/grafana-tempo/conf/overrides.yaml
  defaults:
    metrics_generator:
      processors: ['service-graphs', 'span-metrics']
server:
  http_listen_port: 3200
storage:
  trace:
    backend: s3
    block:
      version: vParquet3
    blocklist_poll: 5m
    s3:
      bucket: <REDACTED>
      endpoint: <REDACTED>
      tls_insecure_skip_verify: true
      access_key: <REDACTED>
      secret_key: <REDACTED>
      bucket_lookup_type: 2
    local:
      path: /bitnami/grafana-tempo/data/traces
    wal:
      path: /bitnami/grafana-tempo/data/wal
cache:
  caches:
  - roles:
    - parquet-footer
    memcached: 
      consistent_hash: true
      host: grafana-tempo-memcached
      service: memcache
      timeout: 500ms

Yet, no service maps or span metrics are visible in Grafana. These metrics were correctly persisted in Prometheus via remote-write on Tempo 2.1.1 but broke after upgrading to 2.5.0. Also, internal metrics defined in the documentation page: https://grafana.com/docs/tempo/latest/troubleshooting/metrics-generator/ (first 3, 4) are reporting empty data in Prometheus - seems like they are not reported at all.

I think I've followed the documentation to enable the processor in metrics_generator yet, it's idling. Before, I was expecting something like:

level=info ts=2024-06-26T15:16:56.021309322Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=46122
level=info ts=2024-06-26T15:17:11.061573927Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=46122
level=info ts=2024-06-26T15:17:26.024328797Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=46122
level=info ts=2024-06-26T15:17:41.024960204Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=46122

But I cannot see anything after joining cluster, nothing about connecting to Prometheus, nothing about not being able to connect to Prometheus remote-write, nothing.

Interestingly enough, running Grafana Tempo ./examples/docker-compose/distributed I was able to see the metrics out of the box.

To Reproduce:
Steps to reproduce the behavior:

  1. Start Tempo in version 2.5.0 using Bitnami chart distributed in version: 3.6.3
  2. Configure the Metrics Generator according to documentation:
  3. Perform Operations (Write traces)

Expected behavior:
Metrics Generator is not idle and metrics are visible in Prometheus metrics.

Environment:

  • Infrastructure: Kubernetes cluster, Helm deployment
  • Deployment tool: Helm chart from Bitnami, version: 3.6.3
  • Grafana version: v11.1.0 (5b85c4c2fc)
@kirankumar-grootan
Copy link

kirankumar-grootan commented Jul 10, 2024

same here. but im using official grafana tempo-distributed helm chart

cache:
      caches:
      - memcached:
          consistent_hash: true
          host: 'test-tempo-memcached'
          service: memcached-client
          timeout: 500ms
        roles:
        - parquet-footer
        - bloom
        - frontend-search
    compactor:
      compaction:
        block_retention: 72h
        compacted_block_retention: 1h
        compaction_cycle: 30s
        compaction_window: 1h
        max_block_bytes: 107374182400
        max_compaction_objects: 6000000
        max_time_per_tenant: 5m
        retention_concurrency: 10
        v2_in_buffer_bytes: 5242880
        v2_out_buffer_bytes: 20971520
        v2_prefetch_traces_count: 1000
      ring:
        kvstore:
          store: memberlist
    distributor:
      log_received_spans:
        enabled: true
        filter_by_status_error: true
        include_all_attributes: true
      receivers:
        otlp:
          protocols:
            grpc:
              endpoint: 0.0.0.0:4317
              max_recv_msg_size_mib: 4194304
            http:
              endpoint: 0.0.0.0:4318
      ring:
        kvstore:
          store: memberlist
    ingester:
      lifecycler:
        ring:
          kvstore:
            store: memberlist
          replication_factor: 1
        tokens_file_path: /var/tempo/tokens.json
    memberlist:
      abort_if_cluster_join_fails: false
      bind_addr: []
      bind_port: 7946
      gossip_interval: 1s
      gossip_nodes: 2
      gossip_to_dead_nodes_time: 30s
      join_members:
      - dns+test-tempo-gossip-ring:7946
      leave_timeout: 5s
      left_ingesters_timeout: 5m
      max_join_backoff: 1m
      max_join_retries: 10
      min_join_backoff: 1s
      node_name: ""
      packet_dial_timeout: 5s
      packet_write_timeout: 5s
      pull_push_interval: 30s
      randomize_node_name: true
      rejoin_interval: 0s
      retransmit_factor: 2
      stream_timeout: 10s
    metrics_generator:
      metrics_ingestion_time_range_slack: 30s
      processor:
        service_graphs:
          dimensions: []
          enable_client_server_prefix: true
          enable_messaging_system_latency_histogram: true
          enable_virtual_node_label: true
          histogram_buckets: []
          max_items: 10000
          wait: 10s
          workers: 10
        span_metrics:
          dimensions: []
          histogram_buckets: []
      registry:
        collection_interval: 15s
        external_labels:
          collection_interval: 15s
          external_labels:
            source: tempo
          source: tempo
        stale_duration: 15m
      ring:
        kvstore:
          store: memberlist
      storage:
        path: /var/tempo/generator/wal
        remote_write:
        - headers:
            X-Scope-OrgID: test
          send_exemplars: true
          url: http://observability-mimir-distributor:8080/api/v1/push
        remote_write_add_org_id_header: false
        remote_write_flush_deadline: 1m
      traces_storage:
        path: /var/tempo/generator/traces
    multitenancy_enabled: true
    overrides:
      per_tenant_override_config: /runtime-config/overrides.yaml
    querier:
      frontend_worker:
        frontend_address: test-tempo-query-frontend-discovery:9095
        grpc_client_config:
          max_send_msg_size: 4194304
      max_concurrent_queries: 20
      search:
        external_backend: null
        external_endpoints: []
        external_hedge_requests_at: 8s
        external_hedge_requests_up_to: 2
        prefer_self: 10
        query_timeout: 30s
      trace_by_id:
        query_timeout: 10s
    query_frontend:
      max_outstanding_per_tenant: 2000
      max_retries: 2
      search:
        concurrent_jobs: 1000
        target_bytes_per_job: 104857600
      trace_by_id:
        query_shards: 50
    server:
      grpc_server_max_recv_msg_size: 4194304
      grpc_server_max_send_msg_size: 4194304
      http_listen_port: 3100
      http_server_read_timeout: 30s
      http_server_write_timeout: 30s
      log_format: logfmt
      log_level: info
    storage:
      trace:
        backend: s3
        blocklist_poll: 5m
        local:
          path: /var/tempo/traces
        pool:
          max_workers: 400
          queue_depth: 20000
        s3:
          access_key: 123
          bucket: tempo
          endpoint: observability-minio:9000
          forcepathstyle: true
          insecure: false
          secret_key: 123
          tls_insecure_skip_verify: true
        wal:
          path: /var/tempo/wal
    usage_report:
      reporting_enabled: false

@kirankumar-grootan
Copy link

kirankumar-grootan commented Jul 10, 2024

I'm not sure if this is the limitaion in distributed mode. When I tried it with Docker Compose, it worked because the metrics generator and ingester read and write data to the same local path. However, in a Kubernetes environment under distributed mode, all components run as separate containers with separate volumes. I'm not sure if the metrics generator and ingester share the same path.

someone from Grafana team can enlighten us on this

@nc-gcz
Copy link
Author

nc-gcz commented Jul 11, 2024

In our set-up, ingester indeed has a persistent volume claim:

spec:
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: data-grafana-tempo-ingester-0

But doesn't seem to be mounted into metrics-generator.

Another thing that we've spotted when running the tests was the following behaviour:

level=info ts=2024-07-10T14:50:55.619034542Z caller=registry.go:243 tenant=single-tenant msg="deleted stale series" active_series=44556
level=info ts=2024-07-10T14:50:55.62145804Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=45901
level=info ts=2024-07-10T14:51:10.515918323Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-07-10T14:51:25.5159173Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=0

At some point during the tests (that in our case generate traffic across 30+ microservices) it basically stopped generating metrics. After I detected that in the morning next day, and restarted the metrics-generator, it stopped after printing the following log lines:

level=info ts=2024-07-11T12:42:06.162281707Z caller=module_service.go:82 msg=initialising module=metrics-generator
level=info ts=2024-07-11T12:42:06.162851639Z caller=memberlist_client.go:543 msg="memberlist fast-join starting" nodes_found=1 to_join=2
level=info ts=2024-07-11T12:42:06.173897462Z caller=memberlist_client.go:563 msg="memberlist fast-join finished" joined_nodes=4 elapsed_time=11.048845ms
level=info ts=2024-07-11T12:42:06.173946198Z caller=memberlist_client.go:576 msg="joining memberlist cluster" join_members=test-grafana-tempo-gossip-ring
level=info ts=2024-07-11T12:42:06.17426984Z caller=basic_lifecycler.go:261 msg="instance not found in the ring" instance=test-grafana-tempo-metrics-generator-6cc769bdd7-zh89p ring=metrics-generator
level=info ts=2024-07-11T12:42:06.174649825Z caller=app.go:188 msg="Tempo started"
level=info ts=2024-07-11T12:42:06.185328903Z caller=memberlist_client.go:595 msg="joining memberlist cluster succeeded" reached_nodes=4 elapsed_time=11.382705ms

Nothing is being printed about even attempting to connect to Prometheus instance, which in our case, is completetly idle, not restarting, working as expected, always on the same URL.

@joe-elliott
Copy link
Member

Hey all, can you review these steps:

https://grafana.com/docs/tempo/latest/troubleshooting/metrics-generator/

This will help us identify where things are being dropped.

However, in a Kubernetes environment under distributed mode, all components run as separate containers with separate volumes. I'm not sure if the metrics generator and ingester share the same path.

Generators and ingesters do not need shared access to block storage. They are fine to run independently.

@nc-gcz
Copy link
Author

nc-gcz commented Jul 12, 2024

Hello @joe-elliott, thank you for reaching out.

As a part of the troubleshooting, our Operations Team has reverted from Grafana Tempo 2.5.0 to 2.1.1 - we had issues with the same behaviour on 2.5.0 that we thought would be fixed by downgrading to 2.1.1 (older version we've run), but it doesn't seem to help - we see the same behaviour on both versions.

I will follow the matching troubleshooting guide from here: https://grafana.com/docs/tempo/latest/troubleshooting/metrics-generator/

Testing of metrics from similar issue: #2514

  • sum(increase(traces_service_graph_request_total[1m])) - empty result from last 12 hours, no graph visible
  • sum(increase(tempo_distributor_ingester_appends_total[1m])) - value of around 300-350 in last 12 hours

image

  • sum(increase(tempo_distributor_metrics_generator_pushes_total[1m])) - empty result from last 12 hours, no graph visible
  • sum(increase(tempo_distributor_metrics_generator_pushes_failures_total[1m])) - empty result from last 12 hours, no graph visible
  • rate(tempo_metrics_generator_registry_collections_failed_total[1m]) - empty result from last 12 hours, no graph visible

Metrics like: prometheus_remote_storage_samples_total, prometheus_remote_storage_bytes_total or e.g. prometheus_remote_storage_samples_failed_total are reporting empty results in Prometheus for last 24 hours.

Troubleshooting guide

  • sum(rate(tempo_distributor_queue_pushes_failures_total{}[1m])) - empty result from last 12 hours, no graph visible
  • sum(rate(tempo_distributor_metrics_generator_pushes_failures_total{}[1m])) - as above,
  • sum(rate(tempo_metrics_generator_spans_discarded_total{}[1m])) by (reason) - as above,
  • sum(rate(tempo_metrics_generator_registry_series_limited_total{}[1m])) - as above,
  • All of: prometheus_remote_storage_(samples|exemplars)_(failed|dropped)_total - as above,
  • sum(rate(tempo_metrics_generator_processor_service_graphs_expired_edges{}[1m])) - as above,
  • sum(rate(tempo_metrics_generator_processor_service_graphs_edges{}[1m])) - as above,
  • sum(rate(tempo_metrics_generator_processor_service_graphs_dropped_spans{}[1m])) - as above.

It seems that metrics-generator is completely not sending the metrics at all, but it's reporting itself as healthy.

It used to work when we had a reinstall of a Helm chart and after some time and tests, it stopped working, reporting active_series=0 for a prolonged time only to be restarted again and not picking up and sending any metrics to Prometheus. So it's not like it never works - it works for some time and then it stops :(

I've attached the example body of the metrics endpoint for metrics-generator that is missing all metrics related to service graph, span metrics and internal metrics as specified above (that are empty):

metrics-generator-example-metrics-endpoint-response-body.txt

Prometheus reports it as scraping correctly:

image

@nc-gcz
Copy link
Author

nc-gcz commented Jul 12, 2024

Also, we've restarted again the Grafana Tempo - clean deployed the Helm chart version 2.1.1.

What we can see now is that the active_series= log is present and what's most important, the following log line is present:

level=info ts=2024-07-12T10:05:50.480474209Z caller=instance.go:46 tenant=single-tenant msg="creating WAL" dir=/bitnami/grafana-tempo/data/wal/single-tenant

When we have a situation where metrics-generator is idle, the log from the component ends before this line, so the last line of log is:

level=info ts=2024-07-12T10:05:50.298578151Z caller=basic_lifecycler.go:261 msg="instance not found in the ring" instance=test-grafana-tempo-metrics-generator-589965b6b5-qdrvt ring=metrics-generator
level=info ts=2024-07-12T10:05:50.298862997Z caller=app.go:188 msg="Tempo started"
level=info ts=2024-07-12T10:05:50.30847801Z caller=memberlist_client.go:595 msg="joining memberlist cluster succeeded" reached_nodes=4 elapsed_time=9.939381ms

Where, for a working metrics-generator it's like this:

level=info ts=2024-07-12T10:05:50.298862997Z caller=app.go:188 msg="Tempo started"
level=info ts=2024-07-12T10:05:50.30847801Z caller=memberlist_client.go:595 msg="joining memberlist cluster succeeded" reached_nodes=4 elapsed_time=9.939381ms
level=info ts=2024-07-12T10:05:50.480474209Z caller=instance.go:46 tenant=single-tenant msg="creating WAL" dir=/bitnami/grafana-tempo/data/wal/single-tenant
ts=2024-07-12T10:05:50.481543325Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-0 url=http://prometheus-monitoring-kube-prometheus-prometheus-0.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting WAL watcher" queue=prometheus-0
ts=2024-07-12T10:05:50.481582922Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-0 url=http://prometheus-monitoring-kube-prometheus-prometheus-0.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting scraped metadata watcher"
ts=2024-07-12T10:05:50.481763202Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-1 url=http://prometheus-monitoring-kube-prometheus-prometheus-1.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting WAL watcher" queue=prometheus-1
ts=2024-07-12T10:05:50.481772492Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-1 url=http://prometheus-monitoring-kube-prometheus-prometheus-1.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting scraped metadata watcher"
ts=2024-07-12T10:05:50.483207769Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-0 url=http://prometheus-monitoring-kube-prometheus-prometheus-0.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Replaying WAL" queue=prometheus-0
ts=2024-07-12T10:05:50.483562004Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-1 url=http://prometheus-monitoring-kube-prometheus-prometheus-1.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Replaying WAL" queue=prometheus-1
level=info ts=2024-07-12T10:05:50.48629064Z caller=db.go:336 tenant=single-tenant component=wal msg="replaying WAL, this may take a while" dir=/bitnami/grafana-tempo/data/wal/single-tenant/wal
level=info ts=2024-07-12T10:05:50.486461227Z caller=db.go:387 tenant=single-tenant component=wal msg="WAL segment loaded" segment=0 maxSegment=0
level=info ts=2024-07-12T10:06:05.514274932Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=6074
ts=2024-07-12T10:06:05.551793586Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-1 url=http://prometheus-monitoring-kube-prometheus-prometheus-1.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Done replaying WAL" duration=15.068244585s
ts=2024-07-12T10:06:05.552795859Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-0 url=http://prometheus-monitoring-kube-prometheus-prometheus-0.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Done replaying WAL" duration=15.069655298s

On a idle metrics-generator, /bitnami/grafana-tempo/data/wal path is empty. But when it correctly starts, it has data:

image|300

/bitnami/grafana-tempo/data is an emptyDir in Kubernetes deployment, as installed from Helm chart from Bitnami.

@nc-gcz
Copy link
Author

nc-gcz commented Jul 12, 2024

It stopped again:

level=info ts=2024-07-12T10:43:33.390479119Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=6294
level=info ts=2024-07-12T10:43:48.393775049Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=6294
level=info ts=2024-07-12T10:44:03.390660555Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=6294
level=info ts=2024-07-12T10:44:18.379634162Z caller=registry.go:243 tenant=single-tenant msg="deleted stale series" active_series=0
level=info ts=2024-07-12T10:44:18.379708084Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-07-12T10:44:33.379406636Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=0

As we used to have some metrics for a short period of time, I'd like to get the metrics from troubleshooting guide again:

Troubleshooting guide

  • sum(rate(tempo_distributor_queue_pushes_failures_total{}[1m])):

Empty

  • sum(rate(tempo_distributor_metrics_generator_pushes_failures_total{}[1m])):

image

  • sum(rate(tempo_metrics_generator_spans_discarded_total{}[1m])) by (reason):

image

  • sum(rate(tempo_metrics_generator_registry_series_limited_total{}[1m])):

Empty

  • All of: prometheus_remote_storage_(samples|exemplars)_(failed|dropped)_total:

image

  • sum(rate(tempo_metrics_generator_processor_service_graphs_expired_edges{}[1m])):

image

  • sum(rate(tempo_metrics_generator_processor_service_graphs_edges{}[1m])):

image

  • sum(rate(tempo_metrics_generator_processor_service_graphs_dropped_spans{}[1m])):

image

@nc-gcz
Copy link
Author

nc-gcz commented Jul 12, 2024

By checking out code locally, 2.1.1, and looking at modules/generator/storage/instance.go, we can conclude that at some point:

// New creates a metrics WAL that remote writes its data.
func New(cfg *Config, tenant string, reg prometheus.Registerer, logger log.Logger) (Storage, error) {
	logger = log.With(logger, "tenant", tenant)
	reg = prometheus.WrapRegistererWith(prometheus.Labels{"tenant": tenant}, reg)

	walDir := filepath.Join(cfg.Path, tenant)

	level.Info(logger).Log("msg", "creating WAL", "dir", walDir) // This is never called

The creating WAL is never called - we cannot see it in logs.

It has a chance to exit before New method can be called in two instances in modules/generator/generator.go:

func (g *Generator) getOrCreateInstance(instanceID string) (*instance, error) {

here:

	inst, ok := g.getInstanceByID(instanceID)
	if ok {
		return inst, nil
	} // <<< HERE

	g.instancesMtx.Lock()
	defer g.instancesMtx.Unlock()

	inst, ok = g.instances[instanceID] // <<< OR HERE
	if ok {
		return inst, nil
	}

        // This is never reached apparently?
	inst, err := g.createInstance(instanceID)
	if err != nil {
		return nil, err
	}
	g.instances[instanceID] = inst
	return inst, nil
}

Why it's not being called - I am not able to tell :( Any help needed here. Thank you.

@joe-elliott
Copy link
Member

I would definitely not recommend upgrading from 2.1.x to 2.5.x. We aggressively rotate through block formats and it's quite possible blocks created by 2.1 are not supported by 2.5. also, if you're doing fresh installs. can you just install 2.5?

But I'm not sure if that is causing your issue or not. You're saying that on a fresh install of 2.1.1 it worked for awhile and then quit? Those graphs suggest the distributors just stopped sending data to the generators. Can you check:

  • if the generators are accessible over the network from the distributors
  • ring health. check /metrics-generator/ring on the distributors

Why it's not being called - I am not able to tell :( Any help needed here. Thank you.

I believe it's not being called b/c if you don't send data to the generator it will never create an instance. it creates a per tenant instance on demand. the missing log line suggests it's just not receiving any data which is consistent with your other graphs.

@kirankumar-grootan
Copy link

kirankumar-grootan commented Jul 24, 2024

In my tempo distributor metrics I found this metric


# HELP tempo_distributor_ingester_clients The current number of ingester clients.
# TYPE tempo_distributor_ingester_clients gauge
tempo_distributor_ingester_clients 1

# HELP tempo_distributor_metrics_generator_clients The current number of metrics-generator clients.
# TYPE tempo_distributor_metrics_generator_clients gauge
tempo_distributor_metrics_generator_clients 0

metrics generator client is not configured it seems.

@d-mankowski-synerise
Copy link

d-mankowski-synerise commented Jul 29, 2024

@nc-gcz have you tried enabling metrics generator by setting values in global_overrides?

global_overrides:
  defaults:
    metrics_generator:
      processors:
        - service-graphs
        - span-metrics

https://github.com/grafana/helm-charts/tree/main/charts/tempo-distributed#activate-metrics-generator

@gdidok
Copy link

gdidok commented Aug 15, 2024

I faced the same issue.
@d-mankowski-synerise your suggestion helped me

@Cenness
Copy link

Cenness commented Sep 12, 2024

Have the same issue with mostly fresh installation from tempo-distributed 2.6.0 - tempo was running for a day or two without generator.
Processors are enabled, but still generator pod has empty wal dir and doesn't push metrics.
Metrics from troubleshooting are in same state as posted by nc-gcz: no metrics with prefix tempo_metrics_generator at all, zeros on others.

Here are relevant configmap parts:

metrics_generator:
  metrics_ingestion_time_range_slack: 30s
  processor:
    service_graphs:
      dimensions: []
      histogram_buckets:
      - 0.1
      - 0.2
      - 0.4
      - 0.8
      - 1.6
      - 3.2
      - 6.4
      - 12.8
      max_items: 10000
      wait: 10s
      workers: 10
    span_metrics:
      dimensions: []
      histogram_buckets:
      - 0.002
      - 0.004
      - 0.008
      - 0.016
      - 0.032
      - 0.064
      - 0.128
      - 0.256
      - 0.512
      - 1.02
      - 2.05
      - 4.1
    local_blocks:
      flush_to_storage: true
      filter_server_spans: false
  registry:
    collection_interval: 15s
    external_labels: {}
    stale_duration: 15m
  ring:
    kvstore:
      store: memberlist
  storage:
    path: /var/tempo/wal
    remote_write_add_org_id_header: true
    remote_write_flush_deadline: 1m
    remote_write:
      - url: http://vminsert.monitoring.svc:8480/insert/0/prometheus/api/v1/write
        remote_timeout: 30s
        name: victoria-metrics
        follow_redirects: true
        send_exemplars: true
        queue_config:
          max_samples_per_send: 40000
          max_shards: 30
        metadata_config:
          send: true
          send_interval: 1m
          max_samples_per_send: 40000
  traces_storage:
    path: /var/tempo/traces
...
overrides:
  per_tenant_override_config: /runtime-config/overrides.yaml
  defaults:
    ingestion:
      max_traces_per_user: 0
    metrics_generator:
      processors:
      - span-metrics
      - service-graphs
      - local-blocks

Per-tenant override is empty. I've tried putting values from overrides into per-tenant config, but that changed nothing.

Removed debug logs from comment for less visual spam.

Edit: Reinstalled in fully fresh namespace and bucket and generator works there.
Edit2: Fixed by full pod restart. Turns out you need to reload every other component with new config, it doesn't matter that generator has it. Everything else must have that new config too.

@disfluxly
Copy link

I just ran into this issue after upgrading to 2.6.0 (but I believe it's been broken for me since 2.5.0), and this is how I resolved it:

  1. We are still using vParquet3, even though vParquet4 is now the default. vParquet4 is set as the default in multiple spots in the configuration, including metrics_generator.traces_storage.version. This does not appear to be a documented configuration, but is seen when hitting the /status/config endpoint. I changed this to vParquet3 to match my Ingester storage version. (Note: If you're using local blocks, then the parquet version is there as well under metrics_generator.processor.local_blocks.block.version)
  2. We had a per_tenant_override_config set. The docs around overrides, in my opinion, are quite confusing and could use a rewrite, and the terminology used is a bit contradictory to how overrides are handled in other platforms. The overrides set in per_tenant_override_config do not merge with the defaults (which is how other platforms would handle it). Thus, if you enabled the metrics_generator processors under defaults, but set a per-tenant override for other settings and didn't re-include the processors, the processors would be disabled.

@MrMegaMango
Copy link

@nc-gcz have you tried enabling metrics generator by setting values in global_overrides?

global_overrides:
defaults:
metrics_generator:
processors:
- service-graphs
- span-metrics
https://github.com/grafana/helm-charts/tree/main/charts/tempo-distributed#activate-metrics-generator

I tried this global_overrides but still no metric data from {} | rate()

@MrMegaMango
Copy link

turns out need to explicitly set this:

            traces_storage:
              path: /var/tempo/traces

@xogoodnow
Copy link
Contributor

To enable metrics generator, the chart says to pass the names of the processors to overrides.metrics_generator_processors value like [service-graphs, span-metrics
but this does not work.

the following block must be added:

global_overrides:
  per_tenant_override_config: /runtime-config/overrides.yaml

overrides:
  legacyOverrides:
    metrics_generator_processors:
      - service-graphs
      - span-metrics

this does not work (which is presented in the chart):

overrides:
    metrics_generator_processors:
      - service-graphs
      - span-metrics

the following works in the single binary version:

overrides:
  defaults:
    metrics_generator:
      processors: [service-graphs, span-metrics, local-blocks]

I opened this PR so see if I missed something or this is correct :)
grafana/helm-charts#3463

@xogoodnow
Copy link
Contributor

Also the configuration for single binary works here as well

global_overrides:
  per_tenant_override_config: /runtime-config/overrides.yaml
  defaults:
    metrics_generator:
      processors: [service-graphs, span-metrics, local-blocks]

@edward738
Copy link

edward738 commented Dec 5, 2024

In my case with tempo-distributed (helm chart 2.6.0), after 2 days debugging, I've found a solution. I put this block in values.yaml and metrics-generator started send metrics

global_overrides:
  defaults:
    metrics_generator:
      processors:
        - service-graphs
        - span-metrics
  per_tenant_override_config: /runtime-config/overrides.yaml


overrides:
  defaults:
    metrics_generator:
      processors:
        - service-graphs
        - span-metrics

@knylander-grafana
Copy link
Contributor

Hi there! I noticed that we had a bunch of issues open for the overrides settings for the tempo-distributed Helm chart. We have two updates that may address the issue that you're having:

This doc may help if you're using the Helm chart.

@rdattilo
Copy link

rdattilo commented Jan 23, 2025

edward738

In my case with tempo-distributed (helm chart 2.6.0), after 2 days debugging, I've found a solution. I put this block in values.yaml and metrics-generator started send metrics

global_overrides:
  defaults:
    metrics_generator:
      processors:
        - service-graphs
        - span-metrics
  per_tenant_override_config: /runtime-config/overrides.yaml


overrides:
  defaults:
    metrics_generator:
      processors:
        - service-graphs
        - span-metrics

Never in a million years would I have thought about using global_overrides in tandem with overrides. This did it for me. My metrics magically came back to life.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests