Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Prometheus metric export #134

Open
wants to merge 57 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 46 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
9bf5ce9
Add metric.py, prometheus configs, and modify pyproject.toml
sharonsyh Oct 18, 2024
dd881e8
Reformat metric.py with black
sharonsyh Oct 18, 2024
d21cfd1
Add metric monitoring section to documentation
sharonsyh Nov 9, 2024
4681796
Add unit tests for EnergyHistogram, EnergyCumulativeCounter, and Powe…
sharonsyh Nov 9, 2024
e8bfe7b
Add train_single.py for testing energy monitoring metrics
sharonsyh Nov 11, 2024
1b9e541
Update docs/measure/index.md
sharonsyh Nov 17, 2024
26e925e
Refactor metric initialization and multiprocessing logic in metric.py
sharonsyh Nov 29, 2024
3569b68
Update prometheus.yml
sharonsyh Nov 29, 2024
29e615b
Improve example training script to include Zeus metrics
sharonsyh Nov 29, 2024
2ae388f
Remove unintended file tests/test_metric.py from repository
sharonsyh Nov 29, 2024
6a9daa5
Update the doc on Metrics Monitoring and Assumptions
sharonsyh Nov 29, 2024
69c42da
Update index.md
sharonsyh Nov 29, 2024
4704a67
Update index.md
sharonsyh Nov 29, 2024
5666ba5
Add README for example training file with Zeus energy metrics integra…
sharonsyh Nov 29, 2024
863f257
Add Metric Name Construction section on index.md
sharonsyh Nov 29, 2024
35ab267
Update index.md
sharonsyh Nov 29, 2024
4aa0f39
Update README.md to include the dependency on prometheus_client
sharonsyh Nov 29, 2024
1e996a4
Update unit tests for the modified metric.py
sharonsyh Nov 30, 2024
8e1d35b
Fix formatting issues detected by black
sharonsyh Nov 30, 2024
7e47fbb
Fix formatting issues detected by black
sharonsyh Nov 30, 2024
52f2fa7
Fix formatting issues detected by black
sharonsyh Nov 30, 2024
30b807e
Fix formatting issues detected by black
sharonsyh Nov 30, 2024
c53c72e
Resolve unbound variable errors
sharonsyh Nov 30, 2024
ffa46d1
Resolve unbound variable errors
sharonsyh Nov 30, 2024
5f67d5c
Specify type for the args
sharonsyh Nov 30, 2024
b96b32e
Fix formatting issues detected by black
sharonsyh Nov 30, 2024
9aa1888
Merge master
jaywonchung Nov 30, 2024
b62ed5d
Fix energy histogram to properly handle default bucket ranges
sharonsyh Nov 30, 2024
0dbb236
Add the mock_push_to_gateway Patch to each test
sharonsyh Nov 30, 2024
753f1de
Update gpu_bucket_range, cpu_bucket_range, and dram_bucket_range in t…
sharonsyh Nov 30, 2024
77ff075
Patch to mock urllib.request.urlopen preventing attempts to an actual…
sharonsyh Dec 1, 2024
793a186
Patch to mock urllib.request.urlopen preventing attempts to an actual…
sharonsyh Dec 1, 2024
30b7e1c
Patch to mock prometheus_client.exposition.push_to_gateway external c…
sharonsyh Dec 1, 2024
3ab4d89
Patch to http.client.HTTPConnection
sharonsyh Dec 1, 2024
f032488
Remove unneccessary mock
sharonsyh Dec 1, 2024
73d8c8c
Add zeus.metric to the list
sharonsyh Dec 6, 2024
c153ef2
Update reference link for each class
sharonsyh Dec 6, 2024
0aa03d4
Move line for prometheus-client
sharonsyh Dec 6, 2024
0e0e63c
feat: Add multiprocessing dict and sync execution for begin/end window
sharonsyh Dec 6, 2024
9228b74
Add error handling for queue
sharonsyh Dec 6, 2024
866365e
Add a call to train() in main
sharonsyh Dec 6, 2024
53f6c62
Refactor tests for the modified code
sharonsyh Dec 7, 2024
c014ff1
Reformat the metric monitoring section for consistency
sharonsyh Dec 9, 2024
f7e5d79
Setup Guide -> Local Setup Guide
sharonsyh Dec 9, 2024
f8d5b67
Add condition for using put() with empty queue
sharonsyh Dec 9, 2024
8c5456e
Import the SpawnProcess class from multiprocessing.context
sharonsyh Dec 9, 2024
4c2e794
Update docs/measure/index.md
sharonsyh Dec 10, 2024
d8a6f1c
Update docs/measure/index.md
sharonsyh Dec 10, 2024
d85f255
Update docs/measure/index.md
sharonsyh Dec 10, 2024
5f9cc6b
Update docs/measure/index.md
sharonsyh Dec 10, 2024
0f8d550
Update docs/measure/index.md
sharonsyh Dec 10, 2024
49acc9a
Update docs/measure/index.md
sharonsyh Dec 10, 2024
f18ecb9
Update docs/measure/index.md
sharonsyh Dec 10, 2024
2276ac2
Remove power_limit_optimizer and bring back the original code for ima…
sharonsyh Dec 10, 2024
dea8b5d
Add sync execution function to handle synchronization during monitori…
sharonsyh Jan 9, 2025
9411495
Changed variable name prometheus_url -> pushgateway_url
sharonsyh Jan 9, 2025
ef48d88
Adjust unit test functions to reflect changes in the sync_execution l…
sharonsyh Jan 9, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions docker/prometheus/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
version: '3.7'
services:
prometheus:
image: prom/prometheus
volumes:
- "./prometheus.yml:/etc/prometheus/prometheus.yml"
networks:
- localprom
ports:
- 9090:9090
node-exporter:
image: prom/node-exporter
networks:
- localprom
ports:
- 9100:9100
pushgateway:
image: prom/pushgateway
networks:
- localprom
ports:
- 9091:9091
networks:
localprom:
driver: bridge
14 changes: 14 additions & 0 deletions docker/prometheus/prometheus.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
global:
scrape_interval: 15s

scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'pushgateway'
static_configs:
- targets: ['pushgateway:9091']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']

230 changes: 229 additions & 1 deletion docs/measure/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ To only measure the energy consumption of the CPU used by the current Python pro

You can pass in `cpu_indices=[]` or `gpu_indices=[]` to [`ZeusMonitor`][zeus.monitor.ZeusMonitor] to disable either CPU or GPU measurements.

```python hl_lines="2 5-7"
```python hl_lines="2 5-15"
from zeus.monitor import ZeusMonitor
from zeus.device.cpu import get_current_cpu_index

Expand All @@ -114,6 +114,214 @@ if __name__ == "__main__":
avg_energy = sum(map(lambda m: m.total_energy, steps)) / len(steps)
print(f"One step takes {avg_time} s and {avg_energy} J for the CPU.")
```
## Metric Monitoring
sharonsyh marked this conversation as resolved.
Show resolved Hide resolved

Zeus allows for efficient monitoring of energy and power consumption for GPUs, CPUs, and DRAM using Prometheus. It tracks key metrics such as energy usage, power draw, and cumulative consumption. Users can define measurement windows to track energy usage for specific operations, enabling granular analysis and optimization.

!!! Assumption
A Prometheus Push Gateway must be deployed and accessible at the URL provided in your Zeus configuration. **This ensures that metrics collected by Zeus can be pushed to Prometheus.**
sharonsyh marked this conversation as resolved.
Show resolved Hide resolved

### Local Setup Guide

#### Step 1: Install and Start the Prometheus Push Gateway
Choose either Option 1 (Binary) or Option 2 (Docker).

##### Option 1: Download Binary
1. Visit the [Prometheus Push Gateway Download Page](https://prometheus.io/download/#pushgateway).
2. Download the appropriate binary for your operating system.
3. Extract the binary:
```sh
tar -xvzf prometheus-pushgateway*.tar.gz
cd prometheus-pushgateway-*
```
4. Start the Push Gateway:
```sh
./prometheus-pushgateway --web.listen-address=:9091
```
5. Verify the Push Gateway is running by visiting http://localhost:9091 in your browser.

##### Option 2: Using Docker
1. Pull the official Prometheus Push Gateway Docker image:
```sh
docker pull prom/pushgateway
```
2. Run the Push Gateway in a container:
```sh
docker run -d -p 9091:9091 prom/pushgateway
```
3. Verify it is running by visiting http://localhost:9091 in your browser.

#### Step 2: Install and Configure Prometheus
1. Visit the Prometheus [Prometheus Download Page](https://prometheus.io/download/#prometheus).
2. Download the appropriate binary for your operating system.
3. Extract the binary:
```sh
tar -xvzf prometheus*.tar.gz
cd prometheus-*
```
4. Update the Prometheus configuration file `prometheus.yml` to scrape metrics from the Push Gateway:
```sh
scrape_configs:
- job_name: 'pushgateway'
honor_labels: true
static_configs:
- targets: ['localhost:9091'] # Replace with your Push Gateway URL
```
5. Start Prometheus:
```sh
./prometheus --config.file=prometheus.yml
```
6. Visit http://localhost:9090 in your browser, or use curl http://localhost:9090/api/v1/targets
7. Verify Prometheus is running by visiting http://localhost:9090 in your browser.

### Metric Name Construction

Zeus organizes metrics using **static metric names** and **dynamic labels** for flexibility and ease of querying in Prometheus. Metric names are static and cannot be overridden, but users can customize the context of the metrics by naming the window when using `begin_window()` and `end_window()`.

#### Metric Name
- For Histogram: `energy_monitor_{component}_energy_joules`
- For Counter: `energy_monitor_{component}_energy_joules`
- For Gauge: `power_monitor_gpu_power_watts`
sharonsyh marked this conversation as resolved.
Show resolved Hide resolved

component: gpu, cpu, or dram

#### Labels
- window: The user-defined window name provided during `begin_window()` and `end_window()` (e.g., `energy_histogram.begin_window(f"epoch_energy")`).
- index: The GPU index (e.g., `0` for GPU 0).
sharonsyh marked this conversation as resolved.
Show resolved Hide resolved

### Usage and Initialization
[`EnergyHistogram`][zeus.metric.EnergyHistogram] records energy consumption data for GPUs, CPUs, and DRAM in Prometheus Histograms. This is ideal for observing how often energy usage falls within specific ranges.

```python hl_lines="2 5-15"
from zeus.metric import EnergyHistogram

if __name__ == "__main__":
# Initialize EnergyHistogram
energy_histogram = EnergyHistogram(
cpu_indices=[0],
gpu_indices=[0],
prometheus_url='http://localhost:9091',
job='training_energy_histogram'
)

for epoch in range(100):
# Start monitoring energy for the entire epoch
energy_histogram.begin_window("epoch_energy")
# Perform epoch-level operations
train_one_epoch(train_loader, model, optimizer, criterion, epoch, args)
acc1 = validate(val_loader, model, criterion, args)
# End monitoring energy for the epoch
energy_histogram.end_window("epoch_energy")
print(f"Epoch {epoch} completed. Validation Accuracy: {acc1}%")

```
You can use the `begin_window` and `end_window` methods to define a measurement window, similar to other ZeusMonitor operations. Energy consumption data will be recorded for the entire duration of the window.

!!! Tip
You can customize the bucket ranges for GPUs, CPUs, and DRAM during initialization to tailor the granularity of energy monitoring. For example:
```python hl_lines="2 5-15"
energy_histogram = EnergyHistogram(
cpu_indices=[0],
gpu_indices=[0],
prometheus_url='http://localhost:9091',
job='training_energy_histogram',
sharonsyh marked this conversation as resolved.
Show resolved Hide resolved
gpu_bucket_range = [10.0, 25.0, 50.0, 100.0],
cpu_bucket_range = [5.0, 15.0, 30.0, 50.0],
dram_bucket_range = [2.0, 8.0, 20.0, 40.0],
sharonsyh marked this conversation as resolved.
Show resolved Hide resolved
)
```

If no custom `bucket ranges` are specified, Zeus uses these default ranges:
```
- GPU: [50.0, 100.0, 200.0, 500.0, 1000.0]
- CPU: [10.0, 20.0, 50.0, 100.0, 200.0]
- DRAM: [5.0, 10.0, 20.0, 50.0, 150.0]
```
sharonsyh marked this conversation as resolved.
Show resolved Hide resolved
!!! Warning
Empty bucket ranges (e.g., []) are not allowed and will raise an error. Ensure you provide a valid range for each device or use the defaults.


[`EnergyCumulativeCounter`][zeus.metric.EnergyCumulativeCounter] monitors cumulative energy consumption. It tracks energy usage over time, without resetting the values, and is updated periodically.

```python hl_lines="2 5-15"

from zeus.metric import EnergyCumulativeCounter

if __name__ == "__main__":

cumulative_counter_metric = EnergyCumulativeCounter(
cpu_indices=[0],
gpu_indices=[0],
update_period=2,
prometheus_url='http://localhost:9091',
job='energy_counter_job'
)
train_loader = range(10)
val_loader = range(5)

cumulative_counter_metric.begin_window("training_energy_monitoring")

for epoch in range(100):
print(f"\n--- Epoch {epoch} ---")
train_one_epoch(train_loader, model, optimizer, criterion, epoch, args)
acc1 = validate(val_loader, model, criterion, args)
print(f"Epoch {epoch} completed. Validation Accuracy: {acc1:.2f}%.")

# Simulate additional operations outside of training
print("\nSimulating additional operations...")
time.sleep(10)

cumulative_counter_metric.end_window("training_energy_monitoring")
```
In this example, `cumulative_counter_metric` monitors energy usage throughout the entire training process rather than on a per-epoch basis. The `update_period` parameter defines how often the energy measurements are updated and pushed to Prometheus.

[`PowerGauge`][zeus.metric.PowerGauge] tracks real-time power consumption using Prometheus Gauges which monitors fluctuating values such as power usage.

```python hl_lines="2 5-15"
from zeus.metric import PowerGauge

if __name__ == "__main__":

power_gauge_metric = PowerGauge(
gpu_indices=[0],
update_period=2,
prometheus_url='http://localhost:9091',
job='power_gauge_job'
)
train_loader = range(10)
val_loader = range(5)

power_gauge_metric.begin_window("training_power_monitoring")

for epoch in range(100):
print(f"\n--- Epoch {epoch} ---")
train_one_epoch(train_loader, model, optimizer, criterion, epoch, args)
acc1 = validate(val_loader, model, criterion, args)
print(f"Epoch {epoch} completed. Validation Accuracy: {acc1:.2f}%.")

# Simulate additional operations outside of training
print("\nSimulating additional operations...")
time.sleep(10)

power_gauge_metric.end_window("training_power_monitoring")
```
The `update_period` parameter defines how often the power datas are updated and pushed to Prometheus.


### How to Query Metrics in Prometheus

#### Query to View Energy for a Specific Window
```promql
energy_monitor_gpu_energy_joules{window="epoch_energy"}
```
#### Query to Sum Energy for a Specific Window
```promql
sum(energy_monitor_gpu_energy_joules) by (window)
```
#### Query to Sum Energy for Specific GPU Across All Windows
```promql
sum(energy_monitor_gpu_energy_joules{index="0"})
```
sharonsyh marked this conversation as resolved.
Show resolved Hide resolved

## CLI power and energy monitor

Expand Down Expand Up @@ -149,3 +357,23 @@ Total time (s): 4.421529293060303
Total energy (J):
{'GPU0': 198.52566362297537, 'GPU1': 206.22215216255188, 'GPU2': 201.08565518283845, 'GPU3': 201.79834523367884}
```

## Hardware Support
We currently support both NVIDIA (via NVML) and AMD GPUs (via AMDSMI, with ROCm 6.1 or later).

### `get_gpus`
The [`get_gpus`][zeus.device.get_gpus] function returns a [`GPUs`][zeus.device.gpu.GPUs] object, which can be either an [`NVIDIAGPUs`][zeus.device.gpu.NVIDIAGPUs] or [`AMDGPUs`][zeus.device.gpu.AMDGPUs] object depending on the availability of `nvml` or `amdsmi`. Each [`GPUs`][zeus.device.gpu.GPUs] object contains one or more [`GPU`][zeus.device.gpu.common.GPU] instances, which are specifically [`NVIDIAGPU`][zeus.device.gpu.nvidia.NVIDIAGPU] or [`AMDGPU`][zeus.device.gpu.amd.AMDGPU] objects.

These [`GPU`][zeus.device.gpu.common.GPU] objects directly call respective `nvml` or `amdsmi` methods, providing a one-to-one mapping of methods for seamless GPU abstraction and support for multiple GPU types. For example:
- [`NVIDIAGPU.getName`][zeus.device.gpu.nvidia.NVIDIAGPU.getName] calls `pynvml.nvmlDeviceGetName`.
- [`AMDGPU.getName`][zeus.device.gpu.amd.AMDGPU.getName] calls `amdsmi.amdsmi_get_gpu_asic_info`.

### Notes on AMD GPUs

#### AMD GPUs Initialization
`amdsmi.amdsmi_get_energy_count` sometimes returns invalid values on certain GPUs or ROCm versions (e.g., MI100 on ROCm 6.2). See [ROCm issue #38](https://github.com/ROCm/amdsmi/issues/38) for more details. During the [`AMDGPUs`][zeus.device.gpu.AMDGPUs] object initialization, we call `amdsmi.amdsmi_get_energy_count` twice for each GPU, with a 0.5-second delay between calls. This difference is compared to power measurements to determine if `amdsmi.amdsmi_get_energy_count` is stable and reliable. Initialization takes 0.5 seconds regardless of the number of AMD GPUs.

`amdsmi.amdsmi_get_power_info` provides "average_socket_power" and "current_socket_power" fields, but the "current_socket_power" field is sometimes not supported and returns "N/A." During the [`AMDGPUs`][zeus.device.gpu.AMDGPUs] object initialization, this method is checked, and if "N/A" is returned, the [`AMDGPU.getInstantPowerUsage`][zeus.device.gpu.amd.AMDGPU.getInstantPowerUsage] method is disabled. Instead, [`AMDGPU.getAveragePowerUsage`][zeus.device.gpu.amd.AMDGPU.getAveragePowerUsage] needs to be used.

#### Supported AMD SMI Versions
Only ROCm >= 6.1 is supported, as the AMDSMI APIs for power and energy return wrong values. For more information, see [ROCm issue #22](https://github.com/ROCm/amdsmi/issues/22). Ensure your `amdsmi` and ROCm versions are up to date.
2 changes: 1 addition & 1 deletion examples/pipeline_frequency_optimizer/profile_p2p.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Profile the power cosumtion of the GPU while waiting on P2P communication."""
"""Profile the power consumption of the GPU while waiting on P2P communication."""

import os
import time
Expand Down
2 changes: 1 addition & 1 deletion examples/power_limit_optimizer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ The former script is for simple single GPU training, whereas the latter is for d

## Dependencies

All packages (including torchvision) are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
All packages (including torchvision) are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/#using-docker).
You just need to download and extract the ImageNet data and mount it to the Docker container with the `-v` option (first step below).

1. Download the ILSVRC2012 dataset from [the ImageNet homepage](http://www.image-net.org/).
Expand Down
55 changes: 55 additions & 0 deletions examples/prometheus/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Integrating the power limit optimizer with ImageNet training

This example will demonstrate how to integrate Zeus with `torchvision` and the ImageNet dataset.

[`train_single.py`](train_single.py) and [`train_dp.py`](train_dp.py) were adapted and simplified from [PyTorch's example training code for ImageNet](https://github.com/pytorch/examples/blob/main/imagenet/main.py).
The former script is for simple single GPU training, whereas the latter is for data parallel training with PyTorch DDP and [`torchrun`](https://pytorch.org/docs/stable/elastic/run.html).

## Dependencies

All packages (including torchvision and prometheus_client) are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
You just need to download and extract the ImageNet data and mount it to the Docker container with the `-v` option (first step below).

1. Download the ILSVRC2012 dataset from [the ImageNet homepage](http://www.image-net.org/).
Then, extract archives using [this script](https://github.com/pytorch/examples/blob/main/imagenet/extract_ILSVRC.sh) provided by PyTorch.
1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
1. Install `torchvision`:
```sh
pip install torchvision==0.15.2
```
1. Install `prometheus_client`:
```sh
pip install zeus-ml[prometheus]
```

## EnergyHistogram, PowerGauge, and EnergyCumulativeCounter
- [`EnergyHistogram`](https://ml.energy/zeus/reference/metric/#zeus.metric.EnergyHistogram): Records energy consumption data for GPUs, CPUs, and DRAM and pushes the data to Prometheus as histogram metrics. This is useful for tracking energy usage distribution over time.
- [`PowerGauge`](https://ml.energy/zeus/reference/metric/#zeus.metric.PowerGauge): Monitors real-time GPU power usage and pushes the data to Prometheus as gauge metrics, which are updated at regular intervals.
- [`EnergyCumulativeCounter`](https://ml.energy/zeus/reference/metric/#zeus.metric.EnergyCumulativeCounter): Tracks cumulative energy consumption over time for CPUs and GPUs and pushes the results to Prometheus as counter metrics.

## `ZeusMonitor` and `GlobalPowerLimitOptimizer`

- [`ZeusMonitor`](http://ml.energy/zeus/reference/monitor/#zeus.monitor.ZeusMonitor): Measures the GPU time and energy consumption of arbitrary code blocks.
- [`GlobalPowerLimitOptimizer`](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.GlobalPowerLimitOptimizer): Online-profiles each power limit with `ZeusMonitor` and finds the cost-optimal power limit.

## Example command

You can specify the maximum training time slowdown factor (1.0 means no slowdown) by setting `ZEUS_MAX_SLOWDOWN`. The default is set to 1.1 in this example script, meaning the lowest power limit that keeps training time inflation within 10% will be automatically found.
`GlobalPowerLimitOptimizer` supports other optimal power limit selection strategies. See [here](https://ml.energy/zeus/reference/optimizer/power_limit).

```bash
# Single-GPU
python train_single.py \
[DATA_DIR] \
--gpu 0 `# Specify the GPU id to use`

# Multi-GPU Data Parallel
torchrun \
--nnodes 1 \
--nproc_per_node gpu `# Number of processes per node, should be equal to the number of GPUs.` \
`# When set to 'gpu', it means use all the GPUs available.` \
train_dp.py \
[DATA_DIR]
```


2 changes: 2 additions & 0 deletions examples/prometheus/requirements.txt
sharonsyh marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
torch
torchvision
Loading
Loading