Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Prometheus metric export #134

Open
wants to merge 57 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
9bf5ce9
Add metric.py, prometheus configs, and modify pyproject.toml
sharonsyh Oct 18, 2024
dd881e8
Reformat metric.py with black
sharonsyh Oct 18, 2024
d21cfd1
Add metric monitoring section to documentation
sharonsyh Nov 9, 2024
4681796
Add unit tests for EnergyHistogram, EnergyCumulativeCounter, and Powe…
sharonsyh Nov 9, 2024
e8bfe7b
Add train_single.py for testing energy monitoring metrics
sharonsyh Nov 11, 2024
1b9e541
Update docs/measure/index.md
sharonsyh Nov 17, 2024
26e925e
Refactor metric initialization and multiprocessing logic in metric.py
sharonsyh Nov 29, 2024
3569b68
Update prometheus.yml
sharonsyh Nov 29, 2024
29e615b
Improve example training script to include Zeus metrics
sharonsyh Nov 29, 2024
2ae388f
Remove unintended file tests/test_metric.py from repository
sharonsyh Nov 29, 2024
6a9daa5
Update the doc on Metrics Monitoring and Assumptions
sharonsyh Nov 29, 2024
69c42da
Update index.md
sharonsyh Nov 29, 2024
4704a67
Update index.md
sharonsyh Nov 29, 2024
5666ba5
Add README for example training file with Zeus energy metrics integra…
sharonsyh Nov 29, 2024
863f257
Add Metric Name Construction section on index.md
sharonsyh Nov 29, 2024
35ab267
Update index.md
sharonsyh Nov 29, 2024
4aa0f39
Update README.md to include the dependency on prometheus_client
sharonsyh Nov 29, 2024
1e996a4
Update unit tests for the modified metric.py
sharonsyh Nov 30, 2024
8e1d35b
Fix formatting issues detected by black
sharonsyh Nov 30, 2024
7e47fbb
Fix formatting issues detected by black
sharonsyh Nov 30, 2024
52f2fa7
Fix formatting issues detected by black
sharonsyh Nov 30, 2024
30b807e
Fix formatting issues detected by black
sharonsyh Nov 30, 2024
c53c72e
Resolve unbound variable errors
sharonsyh Nov 30, 2024
ffa46d1
Resolve unbound variable errors
sharonsyh Nov 30, 2024
5f67d5c
Specify type for the args
sharonsyh Nov 30, 2024
b96b32e
Fix formatting issues detected by black
sharonsyh Nov 30, 2024
9aa1888
Merge master
jaywonchung Nov 30, 2024
b62ed5d
Fix energy histogram to properly handle default bucket ranges
sharonsyh Nov 30, 2024
0dbb236
Add the mock_push_to_gateway Patch to each test
sharonsyh Nov 30, 2024
753f1de
Update gpu_bucket_range, cpu_bucket_range, and dram_bucket_range in t…
sharonsyh Nov 30, 2024
77ff075
Patch to mock urllib.request.urlopen preventing attempts to an actual…
sharonsyh Dec 1, 2024
793a186
Patch to mock urllib.request.urlopen preventing attempts to an actual…
sharonsyh Dec 1, 2024
30b7e1c
Patch to mock prometheus_client.exposition.push_to_gateway external c…
sharonsyh Dec 1, 2024
3ab4d89
Patch to http.client.HTTPConnection
sharonsyh Dec 1, 2024
f032488
Remove unneccessary mock
sharonsyh Dec 1, 2024
73d8c8c
Add zeus.metric to the list
sharonsyh Dec 6, 2024
c153ef2
Update reference link for each class
sharonsyh Dec 6, 2024
0aa03d4
Move line for prometheus-client
sharonsyh Dec 6, 2024
0e0e63c
feat: Add multiprocessing dict and sync execution for begin/end window
sharonsyh Dec 6, 2024
9228b74
Add error handling for queue
sharonsyh Dec 6, 2024
866365e
Add a call to train() in main
sharonsyh Dec 6, 2024
53f6c62
Refactor tests for the modified code
sharonsyh Dec 7, 2024
c014ff1
Reformat the metric monitoring section for consistency
sharonsyh Dec 9, 2024
f7e5d79
Setup Guide -> Local Setup Guide
sharonsyh Dec 9, 2024
f8d5b67
Add condition for using put() with empty queue
sharonsyh Dec 9, 2024
8c5456e
Import the SpawnProcess class from multiprocessing.context
sharonsyh Dec 9, 2024
4c2e794
Update docs/measure/index.md
sharonsyh Dec 10, 2024
d8a6f1c
Update docs/measure/index.md
sharonsyh Dec 10, 2024
d85f255
Update docs/measure/index.md
sharonsyh Dec 10, 2024
5f9cc6b
Update docs/measure/index.md
sharonsyh Dec 10, 2024
0f8d550
Update docs/measure/index.md
sharonsyh Dec 10, 2024
49acc9a
Update docs/measure/index.md
sharonsyh Dec 10, 2024
f18ecb9
Update docs/measure/index.md
sharonsyh Dec 10, 2024
2276ac2
Remove power_limit_optimizer and bring back the original code for ima…
sharonsyh Dec 10, 2024
dea8b5d
Add sync execution function to handle synchronization during monitori…
sharonsyh Jan 9, 2025
9411495
Changed variable name prometheus_url -> pushgateway_url
sharonsyh Jan 9, 2025
ef48d88
Adjust unit test functions to reflect changes in the sync_execution l…
sharonsyh Jan 9, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions docker/prometheus/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
version: '3.7'
services:
prometheus:
image: prom/prometheus
volumes:
- ".prometheus.yml:/etc/prometheus/prometheus.yml"
networks:
- localprom
ports:
- 9090:9090
node-exporter:
image: prom/node-exporter
networks:
- localprom
ports:
- 9100:9100
pushgateway:
image: prom/pushgateway
networks:
- localprom
ports:
- 9091:9091
networks:
localprom:
driver: bridge

14 changes: 14 additions & 0 deletions docker/prometheus/prometheus.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
global:
scrape_interval: 15s

scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'pushgateway'
static_configs:
- targets: ['zeus-pushgateway-1:9091']
sharonsyh marked this conversation as resolved.
Show resolved Hide resolved
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']

82 changes: 82 additions & 0 deletions docs/measure/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,88 @@ if __name__ == "__main__":
avg_energy = sum(map(lambda m: m.total_energy, steps)) / len(steps)
print(f"One step takes {avg_time} s and {avg_energy} J for the CPU.")
```
## Metric Monitoring
sharonsyh marked this conversation as resolved.
Show resolved Hide resolved

Zeus allows you to monitor energy and power consumption through different metrics, such as Histograms, Counters, and Gauges, which can be pushed to a Prometheus Push Gateway for further analysis.

[`EnergyHistogram`][zeus.metric.EnergyHistogram] records energy consumption data for GPUs, CPUs, and DRAM in Prometheus Histograms. This is useful for observing how frequently energy usage reaches certain levels.

You can customize the bucket ranges for each component (GPU, CPU, and DRAM), or let Zeus use default ranges.

```python hl_lines="2 5-7"
sharonsyh marked this conversation as resolved.
Show resolved Hide resolved
from zeus.monitor import ZeusMonitor
from zeus.metric import EnergyHistogram

if __name__ == "__main__":
# Initialize EnergyHistogram with custom bucket ranges
histogram_metric = EnergyHistogram(
energy_monitor=ZeusMonitor,
sharonsyh marked this conversation as resolved.
Show resolved Hide resolved
prometheus_url='http://localhost:9091',
job='energy_histogram_job',
bucket_ranges={
"gpu": [10.0, 25.0, 50.0, 100.0],
"cpu": [5.0, 10.0, 25.0, 50.0],
"dram": [1.0, 2.5, 5.0, 10.0]
}
sharonsyh marked this conversation as resolved.
Show resolved Hide resolved
)

histogram_metric.begin_window("histogram_test")
# Perform tasks
histogram_metric.end_window("histogram_test")
sharonsyh marked this conversation as resolved.
Show resolved Hide resolved
```
You can use the `begin_window` and `end_window` methods to define a measurement window, similar to other ZeusMonitor operations. Energy consumption data will be recorded for the entire duration of the window.

!!! Tip
If no custom `bucket ranges` are provided, Zeus uses default ranges for GPU, CPU, and DRAM.

If you later decide to specify custom bucket ranges only for the GPU while leaving CPU and DRAM to use defaults, you could write:
bucket_ranges={
"gpu": [10.0, 25.0, 50.0, 100.0]
}

[`EnergyCumulativeCounter`][zeus.metric.EnergyCumulativeCounter] monitors cumulative energy consumption. It tracks energy usage over time, without resetting the values, and is updated periodically.

```python hl_lines="2 5-7"
from zeus.monitor import ZeusMonitor
from zeus.metric import EnergyCumulativeCounter

if __name__ == "__main__":

cumulative_counter_metric = EnergyCumulativeCounter(
energy_monitor=ZeusMonitor,
update_period=2, # Updates energy data every 2 seconds
prometheus_url='http://localhost:9091',
job='energy_counter_job'
)

cumulative_counter_metric.begin_window("counter_test_window")
# Let the counter run
time.sleep(10) # Keep measuring for 10 seconds
cumulative_counter_metric.end_window("counter_test_window")
```
The `update_period` parameter defines how often the energy measurements are updated and pushed to Prometheus.

[`PowerGauge`][zeus.metric.PowerGauge] tracks real-time power consumption using Prometheus Gauges which monitors fluctuating values such as power usage.

```python hl_lines="2 5-7"
from zeus.monitor.power import PowerMonitor
from zeus.metric import PowerGauge

if __name__ == "__main__":

power_gauge_metric = PowerGauge(
power_monitor=PowerMonitor,
update_period=2, # Updates power consumption every 2 seconds
prometheus_url='http://localhost:9091',
job='power_gauge_job'
)

power_gauge_metric.begin_window("gauge_test_window")
# Monitor power consumption for 10 seconds
time.sleep(10)
power_gauge_metric.end_window("gauge_test_window")
```
The `update_period` parameter defines how often the power datas are updated and pushed to Prometheus.

## CLI power and energy monitor

Expand Down
2 changes: 2 additions & 0 deletions examples/prometheus/requirements.txt
sharonsyh marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
torch
torchvision
Loading
Loading