This recipe outlines the steps for running a Llama-3-70B pretraining workload on A3 Mega GKE Node pools by using the NVIDIA NeMo framework.
For this recipe, the following setup is used:
- Orchestration - Google Kubernetes Engine (GKE)
- Job configuration and deployment - Helm chart is used to configure and deploy the Kubernetes Index Job. This job encapsulates the NVIDIA NeMo Megatron GPT pretraining workload. The chart generates the job's manifest, adhering to best practices for using GPUDirect-TCPXO with Google Kubernetes Engine (GKE), which includes setting optimal values for NVIDIA NCCL and the TCPXO NCCL plugin.
This recipe has been optimized for and tested with the following configuration:
- A cluster with 16 and 64 nodes a3-megagpu-8g machines
- Machine placement in the cluster is configured using a compact placement policy
- GPUDirect-TCPXO
component versions:
- NCCL Plugin: v1.0.3
- RxDM sidecar: v1.0.9
- NVIDIA NeMo NGC container image: 24.07
- FP8 and BF16 precision training
- Uses a Wikipedia pretraining tokenized dataset that can be found here:
gs://nemo-megatron-demo/training-data/tokenized/bpe2gpt/wikipedia/
. By default, the job is configured to execute 50 training steps. If you want to change the number of training steps, see Configure and submit a pretraining job.
Before running this recipe, ensure your environment is configured as follows:
- A GKE cluster with the following setup:
- An A3 Mega node pool (minimum of 16 nodes, 128 GPUs)
- Topology-aware scheduling enabled
- An Artifact Registry repository to store the Docker image.
- A Google Cloud Storage (GCS) bucket to store results. Important: This bucket must be in the same region as the GKE cluster.
- A client workstation with the following pre-installed:
- Google Cloud SDK
- Helm
- kubectl
To prepare the required environment, see GKE environment setup guide.
It is recommended to use Cloud Shell as your client to complete the steps.
Cloud Shell comes pre-installed with the necessary utilities, including
kubectl
, the Google Cloud SDK
, and Helm
.
In the Google Cloud console, start a Cloud Shell Instance.
From your client, complete the following steps:
- Set the environment variables to match your environment:
export PROJECT_ID=<PROJECT_ID>
export REGION=<CLUSTER_REGION>
export CLUSTER_REGION=<CLUSTER_REGION>
export CLUSTER_NAME=<CLUSTER_NAME>
export GCS_BUCKET=<GCS_BUCKET>
export ARTIFACT_REGISTRY=<ARTIFACT_REGISTRY>
Replace the following values:
<PROJECT_ID>
: your Google Cloud project ID<REGION>
: The region where you want to run the Cloud Build<CLUSTER_REGION>
: the region where your cluster is located<CLUSTER_NAME>
: the name of your GKE cluster<GCS_BUCKET>
: the name of your Cloud Storage bucket. Do not include thegs://
prefix<ARTIFACT_REGISTRY>
: the full name of your Artifact Registry in the following format: LOCATION-docker.pkg.dev/PROJECT_ID/REPOSITORY
- Set the default project:
gcloud config set project $PROJECT_ID
From your client, clone the gpu-recipes
repository and set a reference to the recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a3mega/llama-3-70b/nemo-pretraining-gke
From your client, get the credentials for your cluster.
gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
To build the container, complete the following steps from your client:
-
Use Cloud Build to build and push the container image.
cd $REPO_ROOT/src/docker/nemo-24.07 gcloud builds submit --region=${REGION} \ --config cloudbuild.yml \ --substitutions _ARTIFACT_REGISTRY=$ARTIFACT_REGISTRY \ --timeout "2h" \ --machine-type=e2-highcpu-32 \ --quiet \ --async
This command outputs the build ID
.
-
You can monitor the build progress by streaming the logs for the
build ID
. To do this, run the following command.Replace
<BUILD_ID>
with your build ID.BUILD_ID=<BUILD_ID> gcloud beta builds log $BUILD_ID --region=$REGION
The default job setting is 50 training steps and fp8 precision. To execute the job with the default settings, run the following command from your client:
cd $RECIPE_ROOT
helm install -f values.yaml \
--set-file nemo_config=$REPO_ROOT/src/frameworks/a3mega/nemo-configs/llama3-70b-fp8.yaml \
--set workload.image=${ARTIFACT_REGISTRY}/nemo_workload:24.07 \
--set workload.gcsBucketForDataCataPath=${GCS_BUCKET} \
$USER-llama-3-70b-128-nemo \
$REPO_ROOT/src/helm-charts/a3mega/nemo-training
To run on 512 GPUs, use the --set workload.gpus=512
.
cd $RECIPE_ROOT
helm install -f values.yaml \
--set-file nemo_config=$REPO_ROOT/src/frameworks/a3mega/nemo-configs/llama3-70b-fp8.yaml \
--set workload.image=${ARTIFACT_REGISTRY}/nemo_workload:24.07 \
--set workload.gcsBucketForDataCataPath=${GCS_BUCKET} \
--set workload.gpus=512 \
$USER-llama-3-70b-512-nemo \
$REPO_ROOT/src/helm-charts/a3mega/nemo-training
You can overwrite any of the default
NeMo configurations
for this job. To do this, we can set the new arguments using --set workload.arguments
.
Examples
-
To set the number of training steps to 100, run the following command from your client:
cd $RECIPE_ROOT helm install -f values.yaml \ --set-file nemo_config=$REPO_ROOT/src/frameworks/a3mega/nemo-configs/llama3-70b-fp8.yaml \ --set workload.image=${ARTIFACT_REGISTRY}/nemo_workload:24.07 \ --set workload.gcsBucketForDataCataPath=${GCS_BUCKET} \ --set workload.arguments="{trainer.max_steps=100}" \ $USER-llama-3-70b-128-nemo \ $REPO_ROOT/src/helm-charts/a3mega/nemo-training
-
To run the training job using bf16 precision:
cd $RECIPE_ROOT helm install -f values.yaml \ --set-file nemo_config=$REPO_ROOT/src/frameworks/a3mega/nemo-configs/llama3-70b-fp8.yaml \ --set workload.image=${ARTIFACT_REGISTRY}/nemo_workload:24.07 \ --set workload.gcsBucketForDataCataPath=${GCS_BUCKET} \ --set workload.arguments="{model.fp8=false,model.fp8_hybrid=false}" \ $USER-llama-3-70b-128-bf16-nemo \ $REPO_ROOT/src/helm-charts/a3mega/nemo-training
To check the status of pods in the indexed job, run the following command from your client:
kubectl get pods | grep $USER-llama-3-70b-128-nemo
To get the logs for one of the pods, run the following command from your client:
kubectl logs "<pod_name>"
When completed, the job creates several artifacts, including logs and traces, and places them in the configured Google Cloud Storage bucket as follows:
gs://${GCS_BUCKET}/nemo-experiments/<JOB_ID>
├── hparams.yaml
├── lightning_logs.txt
├── nemo_error_logs.txt
├── nemo_log_globalrank-[RANK]_localrank-[LOCAL].txt
├── dllogger
│ ├── rank-0
│ │ ├── dllogger.json
...
hparams.yaml
: the NeMo configuration used by the pretraining script. This includes the combined configuration file and the command line overrideslightning_logs.txt
: the log files generated by PyTorch Lightning, which is used by NeMonemo_error_logs.txt
: the warning and error logs generated by NeMonemo_log_globalrank-[RANK]_localrank-[LOCAL].txt
: the NeMo logs for each rankdllogger/: The log captured by [NVIDIA DLLogger](https://github.com/NVIDIA/dllogger)
: DLLogger is configured to store logs on the rank 0 node. The log is in JSON format and includes loss, step_time, and other key metrics for each training step
Here is an example of an entry in the DLLogger log:
DLLL {
"timestamp": "1728595441.952723",
"datetime": "2024-10-10 21:24:01.952723",
"elapsedtime": "2087.21432",
"type": "LOG",
"step": 36,
"data": {
"reduced_train_loss": 7.976484775543213,
"lr": 0.000008490565960528329,
"global_step": 36,
"consumed_samples": 37888,
"train_backward_timing in s": 0.00005416870044427924,
"train_step_timing in s": 45.81364059448242,
"epoch": 0
}
}
The DLLogger log can be used to calculate the Model FLOPS Utilization (MFU) metric, as described in the next section.
This section explains how to calculate key training performance metrics,
such as Model FLOPS Utilization (MFU), using the dllogger.json
file generated during training.
We provide a tool called training_metrics to help you easily compute these metrics. This tool can calculate the following metrics:
- MFU: Model FLOPS Utilization
- Average training step time: the average time taken for each training step
- TFLOPS per GPU: the number of Tera Floating Point Operations per second achieved by each GPU
To calculate training performance metrics using the training_metrics
tool, complete the
following steps command from your client:
-
Download the
dllogger.json
file. Thedllogger.json
file is generated during the training session.To download the file, run the following command. Replace
<JOB_ID>
with the ID of your training session.gcloud storage cp gs://${GCS_BUCKET}/nemo-experiments/<JOB_ID>/dllogger/rank-0/dllogger.json \ /path/to/your/local/dllogger.json
-
Run the
process_training_results.py
scriptcd $REPO_ROOT/src/utils/training_metrics python3 process_training_results.py --file /path/to/your/local/dllogger.json \ --batch_size 1024 \ --num_accelerators 128 \ --model_type llama3-70b \ --accelerator_type h100 \ --precision fp8 \ --start_step=15 \ --end_step=40
Note: The batch_size
, num_accelerators
, precision
, model_type
and accelerator_type
are the
specific values for this recipe running the default configuration. Average step time is computed
between the step 15 and 40. Modify the --num_accelerators=512
in case you run on 512 GPUs and
--precision=bf16
in case you run your training session using bf16 precision.
For more detailed information and advanced usage instructions of this tool, see the full documentation
You can delete the job and other resources created by the Helm chart. To uninstall Helm, run the following command from your client:
helm uninstall $USER-llama-3-70b-128-nemo