add guide on integrating with workload observability agent

jcyang43 · Jan 3, 2025 · 32ed8c4 · 32ed8c4
1 parent 975e842
commit 32ed8c4
Show file tree

Hide file tree

Showing 2 changed files with 208 additions and 0 deletions.
diff --git a/getting_started/google_cloud_monitoring/Google_Cloud_Monitoring.md b/getting_started/google_cloud_monitoring/Google_Cloud_Monitoring.md
@@ -0,0 +1,97 @@
+# Getting started with Google Cloud monitoring 
+This guide provides an overview on how to integrate Google's [workload observability agent](https://us-west2-docker.pkg.dev/gce-ai-infra/workload-observability/model-workload-observability) container into your MaxText training workload. 
+
+## Overview
+To address Google Cloud's lack of visibility into user workload performance, Google has added a customer workload performance monitoring feature for critical workloads sensitive to infrastructure changes. 
+Once integrated, the workload observability agent reports metric to Google Cloud, enabling Google engineers to track workload performance metrics.
+If performance falls below a certain threshold, the Google Cloud on-call team will be alerted. 
+
+The workload observability currently supports heartbeat and performance (training step time) metrics. In the near future, support for the goodput metric will also be added.
+Users should work with their Customer Engineer (CE) and the Google team to define appropriate thresholds for the performance metrics.
+
+This guide provides an example of how to integrate with the workload observability agent to send metrics to Google Cloud for monitoring for MaxText workloads.
+
+## Pre-requisites 
+Please make sure you have the following before starting: 
+1. A GCP account with billing enabled 
+2. A GKE cluster ready in your project. For this example, we use a v6e TPU cluster with a 4x4 topology. If you choose a different cluster, please remember to modify your configurations accordingly.
+3. A service account with the following permissions:
+    - Access to your Google Cloud Storage (GCS) bucket
+    - Access to your Artifact Repository 
+    - Access to your GKE cluster
+
+## Instructions
+### 1. Authenticate with GCP
+Verify you're authenticated with GCP on your environment: 
+```
+gcloud auth login
+gcloud auth configure-docker
+```
+
+### 2. Set up GCS bucket, artifact repository & cluster nodepool 
+Create a GCS bucket that will serve as the output directory for your MaxText training workload:
+```
+gsutil mb -l <your-zone> gs://<your-bucket-name>/
+```
+
+Export the GCS bucket path as environment variable `GCS_BUCKET_PATH`:
+```
+export GCS_BUCKET_PATH=gs://<your-bucket-name>
+```
+
+Create an artifact repository for Docker images: 
+```
+gcloud artifacts repositories create <repo-name> \
+    --repository-format=docker \
+    --location=<your-zone> \
+    --description=<your-choice>
+```
+
+Create a nodepool on the GKE cluster:
+```
+gcloud container node-pools create <pool-name> \
+    -- location=<yoru-zone> \
+    --cluster=<your-gke-cluster-name>
+    --node-locations=<your-node-locations> \
+    --machine-type=ct6e-standard-4t \ 
+    --tpu-topology=4x4 \ 
+    --num-nodes=4
+```
+
+### 3. Create a Kubernetes secret 
+Create a Kubernetes secret to provide access to your GCS bucket: 
+```
+kubectl create secret generic gcs-key \ 
+    --from-file=/path/to/your/service-account-key.json
+```
+
+### 4. Build your MaxText Docker image 
+In the project root directory, run 
+```
+bash docker_build_dependency_image.sh DEVICE=tpu
+
+# tag and push your image 
+docker tag maxtext_base_image:latest <your-docker-registry-path>/maxtext_base_tpu:latest
+docker push <your-docker-registry-path>/maxtext_base_tpu:latest
+```
+
+On this [line](./tpu_v6e_with_gcp_monitoring.yaml#L39) of the config file, replace the placeholder image with the image you just built (`<your-docker-registry-path>/maxtext_base_tpu:latest`).
+
+### 5. Set the training dataset
+Export the path to your dataset on GCS bucket: 
+
+```
+EXPORT DATASET_PATH=<path-to-your-training-dataset>
+```
+
+If you don't have a training dataset or want to try with synthetic data, replace `dataset_path=${DATASET_PATH}` with `dataset_type=synthetic` on this [line](./tpu_v6e_with_gcp_monitoring.yaml#L52).
+
+### 6. Launch the workload
+Finally, launch the workload with the [config](./tpu_v6e_with_gcp_monitoring.yaml) in this directory
+```
+# in the current directory 
+
+export TIMESTAMP=$(date +"%Y-%m-%dT%H-%M-%S") && envsubst < tpu_v6e_with_gcp_monitoring.yaml | kubectl apply -f - 
+```
+
+Once deployed, the workload observability container will parse the TFevents files written to your chosen GCS bucket and report metrics to Google Cloud for monitoring.
diff --git a/getting_started/google_cloud_monitoring/tpu_v6e_with_gcp_monitoring.yaml b/getting_started/google_cloud_monitoring/tpu_v6e_with_gcp_monitoring.yaml
@@ -0,0 +1,111 @@
+apiVersion: v1
+kind: Service
+metadata:
+  name: v6e-maxtext
+  namespace: default
+spec:
+  clusterIP: None
+  selector:
+    job-name: v6e-maxtext-workload
+  type: ClusterIP
+---
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: v6e-maxtext-workload
+  namespace: default
+spec:
+  completionMode: Indexed  # Required for TPU workloads
+  backoffLimit: 0
+  completions: 4  # number of nodes
+  parallelism: 4
+  template:
+    metadata:
+      labels:
+        job-name: v6e-maxtext-workload
+    spec:
+      restartPolicy: Never
+      subdomain: v6e-maxtext-workload
+      tolerations:
+      - key: "google.com/tpu"
+        operator: "Exists"
+        effect: "NoSchedule"
+      nodeSelector:
+        cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
+        cloud.google.com/gke-tpu-topology: 4x4
+      dnsPolicy: ClusterFirstWithHostNet  # Ensure proper name resolution for TPU pods
+      containers:
+      - name: training-workload  
+        image: <replace with path to your maxtext docker image>
+        ports:
+        - containerPort: 8471  # Default TPU communication port
+        - containerPort: 9431  # TPU metrics port for monitoring
+        command:
+        - /bin/bash
+        - -c
+        - |
+          env
+          echo "run name: maxtext-llama-2-tpu-${TIMESTAMP}"
+          echo "gcs bucket path: ${GCS_BUCKET_PATH}"
+          echo "Job starting!";
+          trap 'echo "Exiting..."; touch /usr/share/maxtext/workload_terminated' EXIT
+          python3 /deps/MaxText/train.py /deps/MaxText/configs/base.yml run_name=maxtext-llama2-tpu-${TIMESTAMP} model_name=llama2-7b attention=dot_product remat_policy=save_qkv_proj use_iota_embed=true max_target_length=1024 tokenizer_path=/deps/assets/tokenizer.llama2 dataset_path=${DATASET_PATH} per_device_batch_size=1 checkpoint_period=5 steps=100 base_output_directory=${GCS_BUCKET_PATH} enable_gcp_workload_monitoring=True
+          echo "Job completed!";
+
+        env:
+        - name: GOOGLE_APPLICATION_CREDENTIALS
+          value: "/secrets/tpu-prod-env-one-vm-key.json"  # Path to the mounted service account key
+        volumeMounts:
+        - name: gcs-key
+          mountPath: "/secrets"
+          readOnly: true
+        - name: "workload-shared-volume"
+          mountPath: "/usr/share/maxtext"
+        resources:
+          requests:
+            google.com/tpu: "4"  # Adjust based on TPU topology
+          limits:
+            google.com/tpu: "4"
+      - name: workload-observability
+        image: us-west2-docker.pkg.dev/gce-ai-infra/workload-observability/model-workload-observability:heartbeat
+        command:
+        - /bin/bash
+        - -c
+        - |
+          env
+          echo "GCS_BUCKET_PATH: ${GCS_BUCKET_PATH}, timestamp: ${TIMESTAMP}"
+          echo "MaxText logs are sent to ${GCS_BUCKET_PATH}/maxtext-llama2-tpu-${TIMESTAMP}"
+          python -u /app/main.py --replica_id 0 --gpu_index 0 &
+          while [ ! -e "/usr/share/maxtext/workload_terminated" ];
+          do
+          sleep 10;
+          done
+          pkill -f 'python -u /app/main.py --replica_id' || true
+          sleep 10
+        env:
+        - name: JOB_TIMESTAMP
+          value: "{{ $TIMESTAMP }}"
+        - name: JOB_NAME
+          value: "maxtext-llama2-tpu-${TIMESTAMP}"
+        - name: TFEVENTS_PATH
+          value: "${GCS_BUCKET_PATH}/maxtext-llama2-tpu-${TIMESTAMP}/tensorboard/maxtext-llama2-tpu-${TIMESTAMP}"
+        - name: TFEVENTS_METRIC_TAG
+          value: "perf/step_time_seconds"
+        - name: REPORT_HEARTBEAT
+          value: "false"
+        - name: GLOBAL_RANK
+          valueFrom:
+            fieldRef:
+              fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
+        volumeMounts:
+        - name: gcs-key
+          mountPath: "/secrets"
+          readOnly: true
+        - name: "workload-shared-volume"
+          mountPath: "/usr/share/maxtext"
+      volumes:
+      - name: gcs-key
+        secret:
+          secretName: gcs-key
+      - name: workload-shared-volume
+        emptyDir: {}