Prometheus-operator stackdriver sidecar sharding events #233

dgdevops · 2020-04-30T09:31:28Z

I am using service monitor k8s resources to add targets to Prometheus.
I keep receiving metrics in Stackdriver from the sidecar until I add a service monitor to my k8s cluster that adds 220 targets to my prometheus, once the targets come up ALL metrics in stackdriver stop at the same time and no new metric values appear in Stackdriver. Based on the sidecar container logs shard calculation takes place :

level=debug ts=2020-04-30T08:51:20.975Z caller=queue_manager.go:317 component=queue_manager msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=9.107276804519778e-05 upperBound=1.1
level=debug ts=2020-04-30T08:51:35.975Z caller=queue_manager.go:306 component=queue_manager msg=QueueManager.calculateDesiredShards samplesIn=0.028438730446968884 samplesOut=0.035548413058711106 samplesOutDuration=27897.643824423412 timePerSample=784778.8810810816 sizeRate=70059.18401954918 offsetRate=260863.64812517414 desiredShards=7.020667105478262e-05

This keeps going for hours and hours but the metrics do not return to Stackdriver.
Could you please help in understanding the sharding?
Additionally, how could I speed up the process?

Thanks

The text was updated successfully, but these errors were encountered:

jmacd · 2021-02-01T16:42:23Z

I strongly suspect this is due to particular data points causing an unrecoverable error that looks recoverable. This requires some kind of never-succeeding request to explain, but the sidecar logic absolutely can fall into a permanent retry loop and block the WAL reader when this happens. Documented in the downstream repository

lightstep/opentelemetry-prometheus-sidecar#88

also partly mitigated:

https://github.com/lightstep/opentelemetry-prometheus-sidecar/pulls/87

jmacd · 2021-02-01T16:43:01Z

This is the function that never returns:

// sendSamples to the remote storage with backoff for recoverable errors.
func (s *shardCollection) sendSamplesWithBackoff(client StorageClient, samples []*monitoring_pb.TimeSeries) {
	backoff := s.qm.cfg.MinBackoff
	for {
		begin := time.Now()
		err := client.Store(&monitoring_pb.CreateTimeSeriesRequest{TimeSeries: samples})

		sentBatchDuration.WithLabelValues(s.qm.queueName).Observe(time.Since(begin).Seconds())
		if err == nil {
			succeededSamplesTotal.WithLabelValues(s.qm.queueName).Add(float64(len(samples)))
			return
		}

		if _, ok := err.(recoverableError); !ok {
			level.Warn(s.qm.logger).Log("msg", "Unrecoverable error sending samples to remote storage", "err", err)
			break
		}
		time.Sleep(time.Duration(backoff))
		backoff = backoff * 2
		if backoff > s.qm.cfg.MaxBackoff {
			backoff = s.qm.cfg.MaxBackoff
		}
	}

	failedSamplesTotal.WithLabelValues(s.qm.queueName).Add(float64(len(samples)))
}

varun-krishna · 2021-02-09T07:28:13Z

I see the same behaviour with the same messages from the stack-driver sidecar

level=debug ts=2021-02-09T07:25:54.294Z caller=queue_manager.go:306 component=queue_manager msg=QueueManager.calculateDesiredShards samplesIn=0.00173154100250915 samplesOut=0.00173154100250915 samplesOutDuration=5557.854004867412 timePerSample=3.2097732579324483e+06 sizeRate=4890.771316463715 offsetRate=2.134860677902194 desiredShards=0.019098805764792753

level=debug ts=2021-02-09T07:25:54.294Z caller=queue_manager.go:317 component=queue_manager msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=0.019098805764792753 upperBound=1.1

jmacd mentioned this issue Feb 1, 2021

Indefinitely-blocked sidecar problems #270

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus-operator stackdriver sidecar sharding events #233

Prometheus-operator stackdriver sidecar sharding events #233

dgdevops commented Apr 30, 2020 •

edited

Loading

jmacd commented Feb 1, 2021

jmacd commented Feb 1, 2021

varun-krishna commented Feb 9, 2021

Prometheus-operator stackdriver sidecar sharding events #233

Prometheus-operator stackdriver sidecar sharding events #233

Comments

dgdevops commented Apr 30, 2020 • edited Loading

jmacd commented Feb 1, 2021

jmacd commented Feb 1, 2021

varun-krishna commented Feb 9, 2021

dgdevops commented Apr 30, 2020 •

edited

Loading