Skip to content

Commit

Permalink
Merge pull request #572 from MicrosoftDocs/main
Browse files Browse the repository at this point in the history
11/22/2024 PM Publish
  • Loading branch information
Taojunshen authored Nov 22, 2024
2 parents f06444f + bdbd9e1 commit 6f9286d
Show file tree
Hide file tree
Showing 3 changed files with 51 additions and 41 deletions.
13 changes: 12 additions & 1 deletion articles/aks/access-private-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ With the Azure CLI, you can use `command invoke` to access private clusters with

With the Azure portal, you can use the `Run command` feature to run commands on your private cluster. The `Run command` feature uses the same `command invoke` functionality to run commands on your cluster.

The pod created by the `Run command` provides `kubectl` and `helm` for operating your cluster. `jq`, `xargs`, `grep`, and `awk` are available for Bash support.

## Before you begin

Before you begin, make sure you have the following resources and permissions:
Expand All @@ -27,7 +29,12 @@ Before you begin, make sure you have the following resources and permissions:

### Limitations

The pod created by the `run` command provides `helm` and the latest compatible version of `kubectl` for your cluster with `kustomize`.
This feature is designed to simplify cluster access and is ***not designed for programmatic access***. If you have a program invoke Kubernetes using `Run command`, the following disadvantages apply:

* You only get *exitCode* and *text output*, and you lose API level details.
* One extra hop introduces extra failure points.

The pod created by the `Run command` is hard coded with a `200m CPU` and `500Mi memory` request, and a `500m CPU` and `1Gi memory` limit. In rare cases where all your node is packed, the pod can't be scheduled within the ARM API limitation of 60 seconds. This means that the `Run command` would fail, even if it's configured to autoscale.

`command invoke` runs the commands from your cluster, so any commands run in this manner are subject to your configured networking restrictions and any other configured restrictions. Make sure there are enough nodes and resources in your cluster to schedule this command pod.

Expand Down Expand Up @@ -117,6 +124,10 @@ You can use the following kubectl commands with the `Run command` feature:
4. Select the file(s) you want to attach and then select **Attach**.
5. Enter the command you want to run and select **Run**.

## Disable `Run command`

Currently, the only way you can disable the `Run command` feature is by setting [`.properties.apiServerAccessProfile.disableRunCommand` to `true`](/rest/api/aks/managed-clusters/create-or-update).

---

## Troubleshooting
Expand Down
76 changes: 37 additions & 39 deletions articles/aks/gpu-multi-instance.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,18 @@ ms.subservice: aks-nodes

# Create a multi-instance GPU node pool in Azure Kubernetes Service (AKS)

Nvidia's A100 GPU can be divided in up to seven independent instances. Each instance has its own memory and Stream Multiprocessor (SM). For more information on the Nvidia A100, see [Nvidia A100 GPU][Nvidia A100 GPU].
NVIDIA's A100 GPU can be divided in up to seven independent instances. Each instance has its own Stream Multiprocessor (SM), which is responsible for executing instructions in parallel, and GPU memory. For more information on the NVIDIA A100, see [NVIDIA A100 GPU][NVIDIA A100 GPU].

This article walks you through how to create a multi-instance GPU node pool in an Azure Kubernetes Service (AKS) cluster.
This article walks you through how to create a multi-instance GPU node pool using a MIG-compatible VM size in an Azure Kubernetes Service (AKS) cluster.

## Prerequisites and limitations

* An Azure account with an active subscription. If you don't have one, you can [create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F).
* Azure CLI version 2.2.0 or later installed and configured. Run `az --version` to find the version. If you need to install or upgrade, see [Install Azure CLI][install-azure-cli].
* The Kubernetes command-line client, [kubectl](https://kubernetes.io/docs/reference/kubectl/), installed and configured. If you use Azure Cloud Shell, `kubectl` is already installed. If you want to install it locally, you can use the [`az aks install-cli`][az-aks-install-cli] command.
* Helm v3 installed and configured. For more information, see [Installing Helm](https://helm.sh/docs/intro/install/).
* You can't use Cluster Autoscaler with multi-instance node pools.
* Multi-instance GPU node pools are not currently supported on Azure Linux.
* You can't use Cluster Autoscaler with multi-instance GPU node pools.

## GPU instance profiles

Expand All @@ -33,9 +34,9 @@ GPU instance profiles define how GPUs are partitioned. The following table shows
| MIG 4g.20gb | 4/7 | 4/8 | 1 |
| MIG 7g.40gb | 7/7 | 8/8 | 1 |

As an example, the GPU instance profile of `MIG 1g.5gb` indicates that each GPU instance has 1g SM(Computing resource) and 5gb memory. In this case, the GPU is partitioned into seven instances.
As an example, the GPU instance profile of `MIG 1g.5gb` indicates that each GPU instance has 1g SM (streaming multiprocessors) and 5gb memory. In this case, the GPU is partitioned into seven instances.

The available GPU instance profiles available for this instance size include `MIG1g`, `MIG2g`, `MIG3g`, `MIG4g`, and `MIG7g`.
The available GPU instance profiles available for this VM size include `MIG1g`, `MIG2g`, `MIG3g`, `MIG4g`, and `MIG7g`.

> [!IMPORTANT]
> You can't change the applied GPU instance profile after node pool creation.
Expand All @@ -53,25 +54,30 @@ The available GPU instance profiles available for this instance size include `MI
```azurecli-interactive
az aks create \
--resource-group myResourceGroup \
--name myAKSCluster\
--node-count 1 \
--name myAKSCluster \
--generate-ssh-keys
```
3. Configure `kubectl` to connect to your AKS cluster using the [`az aks get-credentials`][az-aks-get-credentials] command.

```azurecli-interactive
az aks get-credentials --resource-group myResourceGroup --name myAKSCluster
```

## Create a multi-instance GPU node pool

You can use either the Azure CLI or an HTTP request to the ARM API to create the node pool.

### [Azure CLI](#tab/azure-cli)

* Create a multi-instance GPU node pool using the [`az aks nodepool add`][az-aks-nodepool-add] command and specify the GPU instance profile.
* Create a multi-instance GPU node pool using the [`az aks nodepool add`][az-aks-nodepool-add] command and specify the GPU instance profile. The example below creates a node pool with the `Standard_ND96asr_v4` MIG-compatible GPU VM size.

```azurecli-interactive
az aks nodepool add \
--name mignode \
--name aks-mignode \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--node-vm-size Standard_ND96asr_v4 \
--node-count 1 \
--gpu-instance-profile MIG1g
```

Expand All @@ -94,12 +100,12 @@ You can use either the Azure CLI or an HTTP request to the ARM API to create the

## Determine multi-instance GPU (MIG) strategy

Before you install the Nvidia plugins, you need to specify which multi-instance GPU (MIG) strategy to use for GPU partitioning: *Single strategy* or *Mixed strategy*. The two strategies don't affect how you execute CPU workloads, but how GPU resources are displayed.
Before you install the NVIDIA plugins, you need to specify which multi-instance GPU (MIG) strategy to use for GPU partitioning: *Single strategy* or *Mixed strategy*. The two strategies don't affect how you execute CPU workloads, but how GPU resources are displayed.

* **Single strategy**: The single strategy treats every GPU instance as a GPU. If you use this strategy, the GPU resources are displayed as `nvidia.com/gpu: 1`.
* **Mixed strategy**: The mixed strategy exposes the GPU instances and the GPU instance profile. If you use this strategy, the GPU resource are displayed as `nvidia.com/mig1g.5gb: 1`.

## Install the NVIDIA device plugin and GPU feature discovery
## Install the NVIDIA device plugin and GPU feature discovery (GFD) components

1. Set your MIG strategy as an environment variable. You can use either single or mixed strategy.

Expand All @@ -111,52 +117,40 @@ Before you install the Nvidia plugins, you need to specify which multi-instance
export MIG_STRATEGY=mixed
```

2. Add the Nvidia device plugin and GPU feature discovery helm repos using the `helm repo add` and `helm repo update` commands.
2. Add the NVIDIA device plugin helm repository using the `helm repo add` and `helm repo update` commands.

```azurecli-interactive
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo add nvgfd https://nvidia.github.io/gpu-feature-discovery
helm repo update
```

3. Install the Nvidia device plugin using the `helm install` command.

```azurecli-interactive
helm install \
--version=0.14.0 \
--generate-name \
--set migStrategy=${MIG_STRATEGY} \
nvdp/nvidia-device-plugin
```

4. Install the GPU feature discovery using the `helm install` command.
4. Install the NVIDIA device plugin using the `helm install` command.

```azurecli-interactive
helm install \
--version=0.2.0 \
helm install nvdp nvdp/nvidia-device-plugin \
--version=0.15.0 \
--generate-name \
--set migStrategy=${MIG_STRATEGY} \
nvgfd/gpu-feature-discovery
--set gfd.enabled=true \
--namespace nvidia-device-plugin \
--create-namespace
```

> [!NOTE]
> Helm installation of the NVIDIA device plugin **[version 0.15.0](https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.15.0) and above** consolidates the device plugin **and** GFD repositories. Separate helm installation of the GFD software component is not recommended in this tutorial.
## Confirm multi-instance GPU capability

1. Configure `kubectl` to connect to your AKS cluster using the [`az aks get-credentials`][az-aks-get-credentials] command.

```azurecli-interactive
az aks get-credentials --resource-group myResourceGroup --name myAKSCluster
```

2. Verify the connection to your cluster using the `kubectl get` command to return a list of cluster nodes.
1. Verify the `kubectl` connection to your cluster using the `kubectl get` command to return a list of cluster nodes.

```azurecli-interactive
kubectl get nodes -o wide
```

3. Confirm the node has multi-instance GPU capability using the `kubectl describe node` command. The following example command describes the node named *mignode*, which uses MIG1g as the GPU instance profile.
2. Confirm the node has multi-instance GPU capability using the `kubectl describe node` command. The following example command describes the node named *aks-mignode*, which uses MIG1g as the GPU instance profile.

```azurecli-interactive
kubectl describe node mignode
kubectl describe node aks-mignode
```

Your output should resemble the following example output:
Expand All @@ -173,7 +167,7 @@ Before you install the Nvidia plugins, you need to specify which multi-instance

## Schedule work

The following examples are based on cuda base image version 12.1.1 for Ubuntu22.04, tagged as `12.1.1-base-ubuntu22.04`.
The following examples are based on CUDA base image **version 12.1.1** for Ubuntu **22.04**, tagged as `12.1.1-base-ubuntu22.04`.

### Single strategy

Expand Down Expand Up @@ -268,7 +262,11 @@ If you don't see multi-instance GPU capability after creating the node pool, con

## Next steps

For more information on AKS node pools, see [Manage node pools for a cluster in AKS](./manage-node-pools.md).
To learn more about GPUs on Azure Kubernetes Service, see:

* [Create a Linux GPU-enabled node pool on AKS](./gpu-cluster.md).
* [Create a Windows GPU-enabled node pool on AKS](./use-windows-gpu.md)
* [Learn about use cases for GPU workloads on AKS](https://learn.microsoft.com/azure/architecture/reference-architectures/containers/aks-gpu/gpu-aks)

<!-- LINKS - internal -->
[az-group-create]: /cli/azure/group#az_group_create
Expand All @@ -279,4 +277,4 @@ For more information on AKS node pools, see [Manage node pools for a cluster in
[az-aks-get-credentials]: /cli/azure/aks#az_aks_get_credentials

<!-- LINKS - external-->
[Nvidia A100 GPU]:https://www.nvidia.com/en-us/data-center/a100/
[NVIDIA A100 GPU]:https://www.nvidia.com/en-us/data-center/a100/
3 changes: 2 additions & 1 deletion articles/aks/long-term-support.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,8 @@ If you want to carry out an in-place migration, the AKS service will migrate you
```

> [!NOTE]
> 1.30 is the next LTS version after 1.27. You can opt into LTS from a 1.30 version cluster through the steps given above. LTS version 1.27 will go end of life (EOL) by July 2025.
> 1.30 is the next LTS version after 1.27. You can opt into LTS from a 1.30 version cluster through the steps given above. LTS version 1.27 will go end of life (EOL) by July 2025.
> Supported Patches in LTS today : [1.27.100] [https://github.com/aks-lts/kubernetes/blob/release-1.27-lts/CHANGELOG/CHANGELOG-1.27.md#v127100-akslts]
## Disable long-term support on an existing cluster

Expand Down

0 comments on commit 6f9286d

Please sign in to comment.