prometheus.operator.podmonitors does not discover all targets consistently #5839

raphaelfan · 2023-11-22T22:54:19Z

What's wrong?

We are encountering an issue with agent v0.37.2 flow mode with prometheus.operator.podmonitors. Clustering is enabled for both the agent and the component.

One of the podMonitor CR is discovering a different number of pods over time. We run this CR in multiple clusters. It would discover all targets, then after ~12hrs, it would only discover some of them for ~12hrs; after that, it would discover all targets again. The pattern then repeats itself. We are sure that the number of pods remained the same during that period.

Note that this is only happening in one CR, all the other scrape jobs are fine.

One interesting piece of information is that we have another CR that shares a similar name with the problematic one. For example, the working one is called “demo/v1-otelcollector” and the non-working one is called “demo/v1-otelcollector-tempo”. I wonder if this arrangement is hitting a bug.

We did not see the same issue with static mode operator.

Steps to reproduce

Create 2 deployments first. These will be monitored by podmonitors.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: v1-otelcollector
  namespace: unified-relay
  labels:
    purpose: debug
spec:
  replicas: 10
  selector:
    matchLabels:
      app: v1-otelcollector
  template:
    metadata:
      labels:
        app: v1-otelcollector
    spec:
      containers:
      - name: v1-otelcollector
        image: otel/opentelemetry-collector-contrib:0.53.0
        ports:
          - containerPort: 4317
            name: main
          - containerPort: 8888
            name: prom-metrics
        resources:
          requests:
            cpu: 10m
            memory: 100M
          limits:
            cpu: 100m
            memory: 200M
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: v1-otelcollector-tempo
  namespace: unified-relay
  labels:
    purpose: debug
spec:
  replicas: 15
  selector:
    matchLabels:
      app: v1-otelcollector-tempo
  template:
    metadata:
      labels:
        app: v1-otelcollector-tempo
    spec:
      containers:
      - name: v1-otelcollector-tempo
        image: otel/opentelemetry-collector-contrib:0.53.0
        ports:
          - containerPort: 4317
            name: main
          - containerPort: 8888
            name: prom-metrics
        resources:
          requests:
            cpu: 10m
            memory: 100M
          limits:
            cpu: 100m
            memory: 200M

Then create the podmonitors in the same namespace.

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  labels:
    purpose: debug
  name: v1-otelcollector
  namespace: unified-relay
spec:
  jobLabel: app.kubernetes.io/name
  namespaceSelector:
    matchNames:
      - unified-relay
  podMetricsEndpoints:
    - path: /metrics
      port: prom-metrics
  selector:
    matchLabels:
      app: v1-otelcollector
---
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  labels:
    purpose: debug
  name: v1-otelcollector-tempo
  namespace: unified-relay
spec:
  jobLabel: app.kubernetes.io/name
  namespaceSelector:
    matchNames:
      - unified-relay
  podMetricsEndpoints:
    - path: /metrics
      port: prom-metrics
  selector:
    matchLabels:
      app: v1-otelcollector-tempo

System information

No response

Software version

Grafana Agent v0.37.2

Configuration

No response

Logs

No response

The text was updated successfully, but these errors were encountered:

captncraig · 2023-11-27T17:41:40Z

Thank you. This is a duplicate of #5679 which has a fix in #5862.

raphaelfan added the bug Something isn't working label Nov 22, 2023

captncraig mentioned this issue Nov 27, 2023

prometheus.operator.* - Fix issue with missing targets when one monitor's name is a prefix of another #5862

Merged

captncraig closed this as completed in #5862 Nov 27, 2023

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prometheus.operator.podmonitors does not discover all targets consistently #5839

prometheus.operator.podmonitors does not discover all targets consistently #5839

raphaelfan commented Nov 22, 2023

captncraig commented Nov 27, 2023

prometheus.operator.podmonitors does not discover all targets consistently #5839

prometheus.operator.podmonitors does not discover all targets consistently #5839

Comments

raphaelfan commented Nov 22, 2023

What's wrong?

Steps to reproduce

System information

Software version

Configuration

Logs

captncraig commented Nov 27, 2023