Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus.operator.podmonitors does not discover all targets consistently #5839

Closed
raphaelfan opened this issue Nov 22, 2023 · 1 comment · Fixed by #5862
Closed

prometheus.operator.podmonitors does not discover all targets consistently #5839

raphaelfan opened this issue Nov 22, 2023 · 1 comment · Fixed by #5862
Labels
bug Something isn't working frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.

Comments

@raphaelfan
Copy link

What's wrong?

We are encountering an issue with agent v0.37.2 flow mode with prometheus.operator.podmonitors. Clustering is enabled for both the agent and the component.

One of the podMonitor CR is discovering a different number of pods over time. We run this CR in multiple clusters. It would discover all targets, then after ~12hrs, it would only discover some of them for ~12hrs; after that, it would discover all targets again. The pattern then repeats itself. We are sure that the number of pods remained the same during that period.

Note that this is only happening in one CR, all the other scrape jobs are fine.

One interesting piece of information is that we have another CR that shares a similar name with the problematic one. For example, the working one is called “demo/v1-otelcollector” and the non-working one is called “demo/v1-otelcollector-tempo”. I wonder if this arrangement is hitting a bug.

We did not see the same issue with static mode operator.

Steps to reproduce

Create 2 deployments first. These will be monitored by podmonitors.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: v1-otelcollector
  namespace: unified-relay
  labels:
    purpose: debug
spec:
  replicas: 10
  selector:
    matchLabels:
      app: v1-otelcollector
  template:
    metadata:
      labels:
        app: v1-otelcollector
    spec:
      containers:
      - name: v1-otelcollector
        image: otel/opentelemetry-collector-contrib:0.53.0
        ports:
          - containerPort: 4317
            name: main
          - containerPort: 8888
            name: prom-metrics
        resources:
          requests:
            cpu: 10m
            memory: 100M
          limits:
            cpu: 100m
            memory: 200M
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: v1-otelcollector-tempo
  namespace: unified-relay
  labels:
    purpose: debug
spec:
  replicas: 15
  selector:
    matchLabels:
      app: v1-otelcollector-tempo
  template:
    metadata:
      labels:
        app: v1-otelcollector-tempo
    spec:
      containers:
      - name: v1-otelcollector-tempo
        image: otel/opentelemetry-collector-contrib:0.53.0
        ports:
          - containerPort: 4317
            name: main
          - containerPort: 8888
            name: prom-metrics
        resources:
          requests:
            cpu: 10m
            memory: 100M
          limits:
            cpu: 100m
            memory: 200M

Then create the podmonitors in the same namespace.

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  labels:
    purpose: debug
  name: v1-otelcollector
  namespace: unified-relay
spec:
  jobLabel: app.kubernetes.io/name
  namespaceSelector:
    matchNames:
      - unified-relay
  podMetricsEndpoints:
    - path: /metrics
      port: prom-metrics
  selector:
    matchLabels:
      app: v1-otelcollector
---
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  labels:
    purpose: debug
  name: v1-otelcollector-tempo
  namespace: unified-relay
spec:
  jobLabel: app.kubernetes.io/name
  namespaceSelector:
    matchNames:
      - unified-relay
  podMetricsEndpoints:
    - path: /metrics
      port: prom-metrics
  selector:
    matchLabels:
      app: v1-otelcollector-tempo

System information

No response

Software version

Grafana Agent v0.37.2

Configuration

No response

Logs

No response

@captncraig
Copy link
Contributor

Thank you. This is a duplicate of #5679 which has a fix in #5862.

@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
2 participants