forked from AI-Hypercomputer/maxtext
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add guide on integrating with workload observability agent
- Loading branch information
Showing
2 changed files
with
208 additions
and
0 deletions.
There are no files selected for viewing
97 changes: 97 additions & 0 deletions
97
getting_started/google_cloud_monitoring/Google_Cloud_Monitoring.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
# Getting started with Google Cloud monitoring | ||
This guide provides an overview on how to integrate Google's [workload observability agent](https://us-west2-docker.pkg.dev/gce-ai-infra/workload-observability/model-workload-observability) container into your MaxText training workload. | ||
|
||
## Overview | ||
To address Google Cloud's lack of visibility into user workload performance, Google has added a customer workload performance monitoring feature for critical workloads sensitive to infrastructure changes. | ||
Once integrated, the workload observability agent reports metric to Google Cloud, enabling Google engineers to track workload performance metrics. | ||
If performance falls below a certain threshold, the Google Cloud on-call team will be alerted. | ||
|
||
The workload observability currently supports heartbeat and performance (training step time) metrics. In the near future, support for the goodput metric will also be added. | ||
Users should work with their Customer Engineer (CE) and the Google team to define appropriate thresholds for the performance metrics. | ||
|
||
This guide provides an example of how to integrate with the workload observability agent to send metrics to Google Cloud for monitoring for MaxText workloads. | ||
|
||
## Pre-requisites | ||
Please make sure you have the following before starting: | ||
1. A GCP account with billing enabled | ||
2. A GKE cluster ready in your project. For this example, we use a v6e TPU cluster with a 4x4 topology. If you choose a different cluster, please remember to modify your configurations accordingly. | ||
3. A service account with the following permissions: | ||
- Access to your Google Cloud Storage (GCS) bucket | ||
- Access to your Artifact Repository | ||
- Access to your GKE cluster | ||
|
||
## Instructions | ||
### 1. Authenticate with GCP | ||
Verify you're authenticated with GCP on your environment: | ||
``` | ||
gcloud auth login | ||
gcloud auth configure-docker | ||
``` | ||
|
||
### 2. Set up GCS bucket, artifact repository & cluster nodepool | ||
Create a GCS bucket that will serve as the output directory for your MaxText training workload: | ||
``` | ||
gsutil mb -l <your-zone> gs://<your-bucket-name>/ | ||
``` | ||
|
||
Export the GCS bucket path as environment variable `GCS_BUCKET_PATH`: | ||
``` | ||
export GCS_BUCKET_PATH=gs://<your-bucket-name> | ||
``` | ||
|
||
Create an artifact repository for Docker images: | ||
``` | ||
gcloud artifacts repositories create <repo-name> \ | ||
--repository-format=docker \ | ||
--location=<your-zone> \ | ||
--description=<your-choice> | ||
``` | ||
|
||
Create a nodepool on the GKE cluster: | ||
``` | ||
gcloud container node-pools create <pool-name> \ | ||
-- location=<yoru-zone> \ | ||
--cluster=<your-gke-cluster-name> | ||
--node-locations=<your-node-locations> \ | ||
--machine-type=ct6e-standard-4t \ | ||
--tpu-topology=4x4 \ | ||
--num-nodes=4 | ||
``` | ||
|
||
### 3. Create a Kubernetes secret | ||
Create a Kubernetes secret to provide access to your GCS bucket: | ||
``` | ||
kubectl create secret generic gcs-key \ | ||
--from-file=/path/to/your/service-account-key.json | ||
``` | ||
|
||
### 4. Build your MaxText Docker image | ||
In the project root directory, run | ||
``` | ||
bash docker_build_dependency_image.sh DEVICE=tpu | ||
# tag and push your image | ||
docker tag maxtext_base_image:latest <your-docker-registry-path>/maxtext_base_tpu:latest | ||
docker push <your-docker-registry-path>/maxtext_base_tpu:latest | ||
``` | ||
|
||
On this [line](./tpu_v6e_with_gcp_monitoring.yaml#L39) of the config file, replace the placeholder image with the image you just built (`<your-docker-registry-path>/maxtext_base_tpu:latest`). | ||
|
||
### 5. Set the training dataset | ||
Export the path to your dataset on GCS bucket: | ||
|
||
``` | ||
EXPORT DATASET_PATH=<path-to-your-training-dataset> | ||
``` | ||
|
||
If you don't have a training dataset or want to try with synthetic data, replace `dataset_path=${DATASET_PATH}` with `dataset_type=synthetic` on this [line](./tpu_v6e_with_gcp_monitoring.yaml#L52). | ||
|
||
### 6. Launch the workload | ||
Finally, launch the workload with the [config](./tpu_v6e_with_gcp_monitoring.yaml) in this directory | ||
``` | ||
# in the current directory | ||
export TIMESTAMP=$(date +"%Y-%m-%dT%H-%M-%S") && envsubst < tpu_v6e_with_gcp_monitoring.yaml | kubectl apply -f - | ||
``` | ||
|
||
Once deployed, the workload observability container will parse the TFevents files written to your chosen GCS bucket and report metrics to Google Cloud for monitoring. |
111 changes: 111 additions & 0 deletions
111
getting_started/google_cloud_monitoring/tpu_v6e_with_gcp_monitoring.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
apiVersion: v1 | ||
kind: Service | ||
metadata: | ||
name: v6e-maxtext | ||
namespace: default | ||
spec: | ||
clusterIP: None | ||
selector: | ||
job-name: v6e-maxtext-workload | ||
type: ClusterIP | ||
--- | ||
apiVersion: batch/v1 | ||
kind: Job | ||
metadata: | ||
name: v6e-maxtext-workload | ||
namespace: default | ||
spec: | ||
completionMode: Indexed # Required for TPU workloads | ||
backoffLimit: 0 | ||
completions: 4 # number of nodes | ||
parallelism: 4 | ||
template: | ||
metadata: | ||
labels: | ||
job-name: v6e-maxtext-workload | ||
spec: | ||
restartPolicy: Never | ||
subdomain: v6e-maxtext-workload | ||
tolerations: | ||
- key: "google.com/tpu" | ||
operator: "Exists" | ||
effect: "NoSchedule" | ||
nodeSelector: | ||
cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice | ||
cloud.google.com/gke-tpu-topology: 4x4 | ||
dnsPolicy: ClusterFirstWithHostNet # Ensure proper name resolution for TPU pods | ||
containers: | ||
- name: training-workload | ||
image: <replace with path to your maxtext docker image> | ||
ports: | ||
- containerPort: 8471 # Default TPU communication port | ||
- containerPort: 9431 # TPU metrics port for monitoring | ||
command: | ||
- /bin/bash | ||
- -c | ||
- | | ||
env | ||
echo "run name: maxtext-llama-2-tpu-${TIMESTAMP}" | ||
echo "gcs bucket path: ${GCS_BUCKET_PATH}" | ||
echo "Job starting!"; | ||
trap 'echo "Exiting..."; touch /usr/share/maxtext/workload_terminated' EXIT | ||
python3 /deps/MaxText/train.py /deps/MaxText/configs/base.yml run_name=maxtext-llama2-tpu-${TIMESTAMP} model_name=llama2-7b attention=dot_product remat_policy=save_qkv_proj use_iota_embed=true max_target_length=1024 tokenizer_path=/deps/assets/tokenizer.llama2 dataset_path=${DATASET_PATH} per_device_batch_size=1 checkpoint_period=5 steps=100 base_output_directory=${GCS_BUCKET_PATH} enable_gcp_workload_monitoring=True | ||
echo "Job completed!"; | ||
env: | ||
- name: GOOGLE_APPLICATION_CREDENTIALS | ||
value: "/secrets/tpu-prod-env-one-vm-key.json" # Path to the mounted service account key | ||
volumeMounts: | ||
- name: gcs-key | ||
mountPath: "/secrets" | ||
readOnly: true | ||
- name: "workload-shared-volume" | ||
mountPath: "/usr/share/maxtext" | ||
resources: | ||
requests: | ||
google.com/tpu: "4" # Adjust based on TPU topology | ||
limits: | ||
google.com/tpu: "4" | ||
- name: workload-observability | ||
image: us-west2-docker.pkg.dev/gce-ai-infra/workload-observability/model-workload-observability:heartbeat | ||
command: | ||
- /bin/bash | ||
- -c | ||
- | | ||
env | ||
echo "GCS_BUCKET_PATH: ${GCS_BUCKET_PATH}, timestamp: ${TIMESTAMP}" | ||
echo "MaxText logs are sent to ${GCS_BUCKET_PATH}/maxtext-llama2-tpu-${TIMESTAMP}" | ||
python -u /app/main.py --replica_id 0 --gpu_index 0 & | ||
while [ ! -e "/usr/share/maxtext/workload_terminated" ]; | ||
do | ||
sleep 10; | ||
done | ||
pkill -f 'python -u /app/main.py --replica_id' || true | ||
sleep 10 | ||
env: | ||
- name: JOB_TIMESTAMP | ||
value: "{{ $TIMESTAMP }}" | ||
- name: JOB_NAME | ||
value: "maxtext-llama2-tpu-${TIMESTAMP}" | ||
- name: TFEVENTS_PATH | ||
value: "${GCS_BUCKET_PATH}/maxtext-llama2-tpu-${TIMESTAMP}/tensorboard/maxtext-llama2-tpu-${TIMESTAMP}" | ||
- name: TFEVENTS_METRIC_TAG | ||
value: "perf/step_time_seconds" | ||
- name: REPORT_HEARTBEAT | ||
value: "false" | ||
- name: GLOBAL_RANK | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index'] | ||
volumeMounts: | ||
- name: gcs-key | ||
mountPath: "/secrets" | ||
readOnly: true | ||
- name: "workload-shared-volume" | ||
mountPath: "/usr/share/maxtext" | ||
volumes: | ||
- name: gcs-key | ||
secret: | ||
secretName: gcs-key | ||
- name: workload-shared-volume | ||
emptyDir: {} |