How to Set Up a GPU-Enabled Kubernetes Cluster on GKE: Step-by-Step Guide for AI & ML Workloads

Hrittik Roy
8 Minute Read

Graphics Processing Units (GPUs) are essential for powering demanding workloads such as artificial intelligence (AI), machine learning (ML), large language models (LLMs), and a variety of other applications. With more workloads being distributed, the need to serve GPUs in a distributed way has also emerged.

Kubernetes has become the standard for orchestrating containers, particularly for workloads that rely heavily on GPUs. The Kubernetes ecosystem is extensive, with tools like Run:AI and Kai simplifying resource management. Given that GPUs represent a significant investment, it’s essential to understand how to set up and manage GPUs properly.

This blog guides you through setting up a GPU-enabled Kubernetes cluster on a managed Kubernetes distribution from Google, specifically Google Kubernetes Engine (GKE), for AI and ML workloads.

GPU Kubernetes Cluster on GKE

Google Kubernetes Engine (GKE) is the flagship managed Kubernetes engine from Google Cloud that helps you scale and deploy containerized applications. GKE offers a range of NVIDIA GPUs (including H100, A100, L4, B200, GB200 and T4 GPUs) that can be attached to nodes in your clusters, enabling efficient handling of compute-intensive tasks through two modes:

Standard Mode: In standard mode, you configure your cluster manually to attach GPU hardware to cluster nodes based on the workloads. If you don’t need GPUs, detach your GPU node pool; if you need them, attach one.

Autopilot Mode: For a hands-off approach, GKE Autopilot allows GPU workload deployment without the need for infrastructure management. However, you will incur a flat fee of $0.10 per hour for each cluster, in addition to the control plane cost.

Depending on the GPU's architecture, it can be set up in passthrough mode (granting virtual machines direct control over the GPU hardware for enhanced performance), time slicing, and Multi-Instance GPU (MIG).

Prerequisites

This blog will focus on the Standard Mode on GKE. Before we start, there are a few things you need to configure, considering that GPUs are in high demand and you’re using a managed provider.

GPU Quota: GPUs are highly demanded, and it’s very possible your account doesn’t have the quota required in your selected region/zone. Without an adequate quota, you won't be able to create GPU nodes. You can view the quotas in IAM & Admin > Quotas and adjust according to your requirements. Read more about requesting quotas here.

Google Cloud SDK (gcloud): To interact (create/delete) with GKE. Make sure to have a project and a billing account linked and authenticated. Install the gcloud CLI [Note: You can use the UI if you prefer]

Kubernetes CLI (kubectl): To manage Kubernetes clusters. Install tools

Basic Kubernetes Knowledge: Familiarity with NodePool, Nodes, and kubectl commands.

Setting Up GPU on GKE Cluster

After completing your prerequisites, the next step is to configure a cluster, attach a GPU pool to it, and then test the workloads.

Step 1: Enable the Google Kubernetes Engine API

First and foremost, you need to enable the Google Kubernetes Engine API.

gcloud services enable container.googleapis.com

This will allow you to create GKE Clusters on your account.

Step 2: Create a GKE Cluster

With gcloud installed, you can create your cluster using the command gcloud container clusters create, along with a few parameters to specify the project and location of the deployment. To create a basic GKE cluster with a single non-GPU node to serve as the control plane, use the following command:

gcloud container clusters create gpu-cluster \
    --zone us-central1 \
    --release-channel "regular" \
    --machine-type "n1-standard-4" \
    --num-nodes 1

If you want autoscaling, you can configure the --enable-autoscaling flag, but for this demo, we will set it up for GPUs and keep the control plane minimalist in the us-central1 region.

The successful output would be something similar:

Creating cluster gpu-cluster in us-central1... Cluster is being health-checked 
(Kubernetes Control Plane is healthy)...done.                                  
Created [https://container.googleapis.com/v1/projects/hrittik-project/zones/us-central1/clusters/gpu-cluster].
To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/us-central1/gpu-cluster?project=hrittik-project
kubeconfig entry generated for gpu-cluster.

NAME         LOCATION     MASTER_VERSION      MASTER_IP      MACHINE_TYPE   NODE_VERSION        NUM_NODES  STATUS
gpu-cluster  us-central1  1.32.4-gke.1353003  35.222.72.130  n1-standard-4  1.32.4-gke.1353003  3          RUNNING

With the control plane setup on regular CPU hardware, you can create a Node Pool for GPU workloads with GPUs.

The reason we separate this is that a best practice is to distinguish between your GPU and non-GPU workloads for cost efficiency, as GPUs are costly.

Step 3: Create a GPU Node Pool

For GPU nodes, we will again use the gcloud command, but this time we will include the --accelerator flag to attach GPUs in our cluster. This flag has three parameters:

type: This parameter specifies the GPU Type (here, NVIDIA Tesla T4)

count: This specifies how many GPUs of the specified type should be attached to each node in the node pool.

gpu-driver-version: With this parameter, you can choose the driver installation between manual installation with the NVIDIA GPU Operator or using GKE's managed GPU driver installation. For simplicity, we will opt for the managed offering, where GKE manages driver installation, device plugin deployment, and driver updates during node auto-upgrades. The manual approach is advantageous when you have specific requirements.

In this demo, we also configure the --enable-autoscaling flag so you can experiment with a few GPUs, with a minimum of 1 GPU always available and a maximum of 3 at surge.

You can use the following command to create the pool:

gcloud container node-pools create gpu-node-pool \
    --cluster gpu-cluster \
    --zone us-central1 \
    --machine-type "n1-standard-4" \
    --accelerator "type=nvidia-tesla-t4,count=1,gpu-driver-version=latest" \
    --num-nodes 1 \
    --min-nodes 1 \
    --max-nodes 3 \
    --enable-autoscaling

A successful output would return the name of the GPU pool along with other details:

Creating node pool gpu-node-pool...done.                                                                                                                                      
Created [https://container.googleapis.com/v1/projects/hrittik-project/zones/us-central1/clusters/gpu-cluster/nodePools/gpu-node-pool].
NAME           MACHINE_TYPE   DISK_SIZE_GB  NODE_VERSION
gpu-node-pool  n1-standard-4  100           1.32.4-gke.1353003

Step 4: Verify your Nodes

The gcloud utility automatically configures the kubeconfig on your local machine. If that’s not the case for you, use the following command:

gcloud container clusters get-credentials gpu-cluster --zone us-central1

Once the authentication is complete and the two node pools are configured, using the kubectl get nodes command will show you all the available nodes.

~ ❯ kubectl get nodes                                                                                                                                                                                                      1m 28s ○ gpu-cluster
NAME                                          STATUS   ROLES    AGE     VERSION
gke-gpu-cluster-default-pool-00369598-wrm6    Ready    <none>   15m     v1.32.4-gke.1353003
gke-gpu-cluster-default-pool-407e8ca0-xx5k    Ready    <none>   15m     v1.32.4-gke.1353003
gke-gpu-cluster-default-pool-680730b0-fc90    Ready    <none>   15m     v1.32.4-gke.1353003
gke-gpu-cluster-gpu-node-pool-b13d6460-rtz2   Ready    <none>   4m39s   v1.32.4-gke.1353003
gke-gpu-cluster-gpu-node-pool-d1e385d9-2fgs   Ready    <none>   4m34s   v1.32.4-gke.1353003
gke-gpu-cluster-gpu-node-pool-ea752474-j96x   Ready    <none>   4m40s   v1.32.4-gke.1353003

If you want to list the accelerator nodes, use the `kubectl get nodes -l cloud.google.com/gke-accelerator=nvidia-tesla-t4` command to display them with labels and selectors as below:

~ ❯ kubectl get nodes -l cloud.google.com/gke-accelerator=nvidia-tesla-t4                                                                                                                                                         ○ gpu-cluster
NAME                                          STATUS   ROLES    AGE     VERSION
gke-gpu-cluster-gpu-node-pool-b13d6460-rtz2   Ready    <none>   7m3s    v1.32.4-gke.1353003
gke-gpu-cluster-gpu-node-pool-d1e385d9-2fgs   Ready    <none>   6m58s   v1.32.4-gke.1353003
gke-gpu-cluster-gpu-node-pool-ea752474-j96x   Ready    <none>   7m4s    v1.32.4-gke.1353003

You can easily view the nodes using the GCP Dashboard! Just navigate to your cluster and take a look at your NodePool:

Step 5: Test a Sample GPU Workload

The next step is to test whether the GPU cluster is functioning correctly. To do this, we use a simple cuda-vector-add Pod and run it with an nvidia.com/gpu: 1 limit. This will schedule it to the nodes with a GPU. Run the following command in your cluster:

kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test-pod
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vector-add
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
    resources:
      limits:
        nvidia.com/gpu: 1
EOF

If things are configured correctly, the Pod will be in a running state. Using the kubectl logs gpu-test-pod, you can look into the logs of the Pod:

~ ❯ kubectl logs gpu-test-pod                                                                                                                                                                                                     ○ gpu-cluster
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

"Test PASSED" signifies successful GPU computation, allowing your workloads to operate on GPUs in your cluster. With this, you can deploy various models to your GPU-enabled cluster.

If you want to explore more details, using the GCP Logs Explorer is the best option. For example, the picture above shows how our Pod requested a GPU during the specified duration to run the CUDA add operation.

Step 6: Clean up Resources

The next step is to delete your GPU-enabled Kubernetes cluster, freeing up resources and preventing unnecessary charges. To do this use the following command:

gcloud container clusters delete gpu-cluster --zone=us-central1

The output would look like this once you confirm deletion:

~ ❯ gcloud container clusters delete gpu-cluster --zone=us-central1   
                                                                                                                                                        
The following clusters will be deleted.
 - [gpu-cluster] in [us-central1]
Do you want to continue (Y/n)?  y
Deleting cluster gpu-cluster...done.                                                                                                                                                                                                           
Deleted [https://container.googleapis.com/v1/projects/hrittik-project/zones/us-central1/clusters/gpu-cluster].       

Managing GPU Resources for Multiple Teams

While GKE provides excellent GPU support, organizations often encounter challenges with GPU utilization and multi-tenancy. This is where vCluster, an open-source Kubernetes multi-tenancy solution, can significantly enhance your GPU infrastructure.

The main issues organizations face with expensive hardware like GPUs are underutilization and over-sharing. vCluster creates virtual Kubernetes clusters that operate within your physical GKE cluster. Each virtual cluster has its own API server, offering better isolation than namespaces while being more cost-effective than separate clusters. The benefits include:

  1. Teams Gain Control: You can share your GPUs between virtual clusters across teams, alleviating the burden of under utilization.

  2. Cost Savings: You don’t need to pay control plane costs to cloud providers and can rely on vCluster to manage isolation. Moreover, you can sync resources from Host to remove duplication in your stack and save licensing costs.

  3. Rapid Provisioning: Launch environments in seconds instead of hours, enabling your team to ship quickly.

A recent example comes from a Platform Lead at a Fortune 500 finance company, who stated, “We replaced our previous setup of one cluster per team with virtual clusters and virtual nodes on a shared fleet. Now, each team feels like they have their own dedicated GPU platform, resulting in significantly better utilization.”

Final Thoughts

To wrap up, we’ve covered how to set up a GKE cluster for GPU workloads, test GPU usage, and explored the options for building a flexible GPU cluster. As sharing GPUs across teams becomes more important, Kubernetes offers the tools to balance efficiency and isolation for modern AI and ML workloads.

Having private, dedicated, and shared GPU tenancy options within a single cluster allows you to support various workloads and teams without sacrificing isolation. This flexibility helps maximize your GPU return on investment (ROI). As a result, vCluster is becoming the preferred solution for teams of different sizes looking to achieve maximum savings and utilization.

More Questions? Join our Slack to talk to the team behind vCluster!

Sign up for our newsletter

Be the first to know about new features, announcements and industry insights.