Skip to content

Commit

Permalink
add guide on integrating with workload observability agent
Browse files Browse the repository at this point in the history
  • Loading branch information
jcyang43 committed Jan 3, 2025
1 parent 975e842 commit 32ed8c4
Show file tree
Hide file tree
Showing 2 changed files with 208 additions and 0 deletions.
97 changes: 97 additions & 0 deletions getting_started/google_cloud_monitoring/Google_Cloud_Monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Getting started with Google Cloud monitoring
This guide provides an overview on how to integrate Google's [workload observability agent](https://us-west2-docker.pkg.dev/gce-ai-infra/workload-observability/model-workload-observability) container into your MaxText training workload.

## Overview
To address Google Cloud's lack of visibility into user workload performance, Google has added a customer workload performance monitoring feature for critical workloads sensitive to infrastructure changes.
Once integrated, the workload observability agent reports metric to Google Cloud, enabling Google engineers to track workload performance metrics.
If performance falls below a certain threshold, the Google Cloud on-call team will be alerted.

The workload observability currently supports heartbeat and performance (training step time) metrics. In the near future, support for the goodput metric will also be added.
Users should work with their Customer Engineer (CE) and the Google team to define appropriate thresholds for the performance metrics.

This guide provides an example of how to integrate with the workload observability agent to send metrics to Google Cloud for monitoring for MaxText workloads.

## Pre-requisites
Please make sure you have the following before starting:
1. A GCP account with billing enabled
2. A GKE cluster ready in your project. For this example, we use a v6e TPU cluster with a 4x4 topology. If you choose a different cluster, please remember to modify your configurations accordingly.
3. A service account with the following permissions:
- Access to your Google Cloud Storage (GCS) bucket
- Access to your Artifact Repository
- Access to your GKE cluster

## Instructions
### 1. Authenticate with GCP
Verify you're authenticated with GCP on your environment:
```
gcloud auth login
gcloud auth configure-docker
```

### 2. Set up GCS bucket, artifact repository & cluster nodepool
Create a GCS bucket that will serve as the output directory for your MaxText training workload:
```
gsutil mb -l <your-zone> gs://<your-bucket-name>/
```

Export the GCS bucket path as environment variable `GCS_BUCKET_PATH`:
```
export GCS_BUCKET_PATH=gs://<your-bucket-name>
```

Create an artifact repository for Docker images:
```
gcloud artifacts repositories create <repo-name> \
--repository-format=docker \
--location=<your-zone> \
--description=<your-choice>
```

Create a nodepool on the GKE cluster:
```
gcloud container node-pools create <pool-name> \
-- location=<yoru-zone> \
--cluster=<your-gke-cluster-name>
--node-locations=<your-node-locations> \
--machine-type=ct6e-standard-4t \
--tpu-topology=4x4 \
--num-nodes=4
```

### 3. Create a Kubernetes secret
Create a Kubernetes secret to provide access to your GCS bucket:
```
kubectl create secret generic gcs-key \
--from-file=/path/to/your/service-account-key.json
```

### 4. Build your MaxText Docker image
In the project root directory, run
```
bash docker_build_dependency_image.sh DEVICE=tpu
# tag and push your image
docker tag maxtext_base_image:latest <your-docker-registry-path>/maxtext_base_tpu:latest
docker push <your-docker-registry-path>/maxtext_base_tpu:latest
```

On this [line](./tpu_v6e_with_gcp_monitoring.yaml#L39) of the config file, replace the placeholder image with the image you just built (`<your-docker-registry-path>/maxtext_base_tpu:latest`).

### 5. Set the training dataset
Export the path to your dataset on GCS bucket:

```
EXPORT DATASET_PATH=<path-to-your-training-dataset>
```

If you don't have a training dataset or want to try with synthetic data, replace `dataset_path=${DATASET_PATH}` with `dataset_type=synthetic` on this [line](./tpu_v6e_with_gcp_monitoring.yaml#L52).

### 6. Launch the workload
Finally, launch the workload with the [config](./tpu_v6e_with_gcp_monitoring.yaml) in this directory
```
# in the current directory
export TIMESTAMP=$(date +"%Y-%m-%dT%H-%M-%S") && envsubst < tpu_v6e_with_gcp_monitoring.yaml | kubectl apply -f -
```

Once deployed, the workload observability container will parse the TFevents files written to your chosen GCS bucket and report metrics to Google Cloud for monitoring.
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
apiVersion: v1
kind: Service
metadata:
name: v6e-maxtext
namespace: default
spec:
clusterIP: None
selector:
job-name: v6e-maxtext-workload
type: ClusterIP
---
apiVersion: batch/v1
kind: Job
metadata:
name: v6e-maxtext-workload
namespace: default
spec:
completionMode: Indexed # Required for TPU workloads
backoffLimit: 0
completions: 4 # number of nodes
parallelism: 4
template:
metadata:
labels:
job-name: v6e-maxtext-workload
spec:
restartPolicy: Never
subdomain: v6e-maxtext-workload
tolerations:
- key: "google.com/tpu"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
cloud.google.com/gke-tpu-topology: 4x4
dnsPolicy: ClusterFirstWithHostNet # Ensure proper name resolution for TPU pods
containers:
- name: training-workload
image: <replace with path to your maxtext docker image>
ports:
- containerPort: 8471 # Default TPU communication port
- containerPort: 9431 # TPU metrics port for monitoring
command:
- /bin/bash
- -c
- |
env
echo "run name: maxtext-llama-2-tpu-${TIMESTAMP}"
echo "gcs bucket path: ${GCS_BUCKET_PATH}"
echo "Job starting!";
trap 'echo "Exiting..."; touch /usr/share/maxtext/workload_terminated' EXIT
python3 /deps/MaxText/train.py /deps/MaxText/configs/base.yml run_name=maxtext-llama2-tpu-${TIMESTAMP} model_name=llama2-7b attention=dot_product remat_policy=save_qkv_proj use_iota_embed=true max_target_length=1024 tokenizer_path=/deps/assets/tokenizer.llama2 dataset_path=${DATASET_PATH} per_device_batch_size=1 checkpoint_period=5 steps=100 base_output_directory=${GCS_BUCKET_PATH} enable_gcp_workload_monitoring=True
echo "Job completed!";
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: "/secrets/tpu-prod-env-one-vm-key.json" # Path to the mounted service account key
volumeMounts:
- name: gcs-key
mountPath: "/secrets"
readOnly: true
- name: "workload-shared-volume"
mountPath: "/usr/share/maxtext"
resources:
requests:
google.com/tpu: "4" # Adjust based on TPU topology
limits:
google.com/tpu: "4"
- name: workload-observability
image: us-west2-docker.pkg.dev/gce-ai-infra/workload-observability/model-workload-observability:heartbeat
command:
- /bin/bash
- -c
- |
env
echo "GCS_BUCKET_PATH: ${GCS_BUCKET_PATH}, timestamp: ${TIMESTAMP}"
echo "MaxText logs are sent to ${GCS_BUCKET_PATH}/maxtext-llama2-tpu-${TIMESTAMP}"
python -u /app/main.py --replica_id 0 --gpu_index 0 &
while [ ! -e "/usr/share/maxtext/workload_terminated" ];
do
sleep 10;
done
pkill -f 'python -u /app/main.py --replica_id' || true
sleep 10
env:
- name: JOB_TIMESTAMP
value: "{{ $TIMESTAMP }}"
- name: JOB_NAME
value: "maxtext-llama2-tpu-${TIMESTAMP}"
- name: TFEVENTS_PATH
value: "${GCS_BUCKET_PATH}/maxtext-llama2-tpu-${TIMESTAMP}/tensorboard/maxtext-llama2-tpu-${TIMESTAMP}"
- name: TFEVENTS_METRIC_TAG
value: "perf/step_time_seconds"
- name: REPORT_HEARTBEAT
value: "false"
- name: GLOBAL_RANK
valueFrom:
fieldRef:
fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
volumeMounts:
- name: gcs-key
mountPath: "/secrets"
readOnly: true
- name: "workload-shared-volume"
mountPath: "/usr/share/maxtext"
volumes:
- name: gcs-key
secret:
secretName: gcs-key
- name: workload-shared-volume
emptyDir: {}

0 comments on commit 32ed8c4

Please sign in to comment.