[NVIDIA GPU] Introduce Monitoring Integration #11931

strawgate · 2024-11-30T03:54:59Z

Proposed commit message

Introduce NVIDIA GPU Monitoring Integration

Checklist

I have reviewed tips for building integrations and this pull request is aligned with them.
I have verified that all data streams collect metrics or logs.
I have added an entry to my package's changelog.yml file.
I have verified that Kibana version constraints are current according to guidelines.
I have verified that any added dashboard complies with Kibana's Dashboard good practices

Author's Checklist

How to test this PR locally

Deploy NVIDIA DGCM on a device with an NVIDIA GPU to get a prometheus metrics endpoint that you can provide to the integration.

If you have docker this just requires:

docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04
curl localhost:9400/metrics

Configure the integration to point at the host running the container and GPU http://nvidiahost:9400/metrics

Some metrics are not enabled by default with the container, enabling all metrics requires some extra steps.

Related issues

Fixes #11930

Screenshots

WIP:

elasticmachine · 2024-12-01T04:54:44Z

💔 Build Failed

Buildkite Build
Commit: df1faa8

Failed CI Steps

✅ Check go sources

History

💔 Build #18922 failed 63667b9
💔 Build #18917 failed db78b03
💔 Build #18916 failed 179dd58
💔 Build #18903 failed d050f49

Initial commit for NVIDIA GPU Integration

23a66ce

strawgate added enhancement New feature or request New Integration labels Nov 30, 2024

strawgate added 5 commits November 29, 2024 21:56

Remove extra setup file

d050f49

Add GPU-Specific dashboard and update field descriptions.

179dd58

Add't updates to GPU-Specific Dashboard

db78b03

Add GPU Overview dashboard

63667b9

Small updates to Overview dashboard

df1faa8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA GPU] Introduce Monitoring Integration #11931

[NVIDIA GPU] Introduce Monitoring Integration #11931

strawgate commented Nov 30, 2024 •

edited

Loading

elasticmachine commented Dec 1, 2024 •

edited

Loading

[NVIDIA GPU] Introduce Monitoring Integration #11931

Are you sure you want to change the base?

[NVIDIA GPU] Introduce Monitoring Integration #11931

Conversation

strawgate commented Nov 30, 2024 • edited Loading

Proposed commit message

Checklist

Author's Checklist

How to test this PR locally

Related issues

Screenshots

elasticmachine commented Dec 1, 2024 • edited Loading

💔 Build Failed

Failed CI Steps

History

strawgate commented Nov 30, 2024 •

edited

Loading

elasticmachine commented Dec 1, 2024 •

edited

Loading