Table of Contents
Graphics Processing Units (GPUs) are essential for powering demanding workloads such as artificial intelligence (AI), machine learning (ML), large language models (LLMs), and a variety of other applications. With more workloads being distributed, the need to serve GPUs in a distributed way has also emerged.
Kubernetes has become the standard for orchestrating containers, particularly for workloads that rely heavily on GPUs. The Kubernetes ecosystem is extensive, with tools like Run:AI and Kai simplifying resource management. Given that GPUs represent a significant investment, it’s essential to understand how to set up and manage GPUs properly.
This blog guides you through setting up a GPU-enabled Kubernetes cluster on a managed Kubernetes distribution from Google, specifically Google Kubernetes Engine (GKE), for AI and ML workloads.
GPU Kubernetes Cluster on GKE
Google Kubernetes Engine (GKE) is the flagship managed Kubernetes engine from Google Cloud that helps you scale and deploy containerized applications. GKE offers a range of NVIDIA GPUs (including H100, A100, L4, B200, GB200 and T4 GPUs) that can be attached to nodes in your clusters, enabling efficient handling of compute-intensive tasks through two modes:
Standard Mode: In standard mode, you configure your cluster manually to attach GPU hardware to cluster nodes based on the workloads. If you don’t need GPUs, detach your GPU node pool; if you need them, attach one.
Autopilot Mode: For a hands-off approach, GKE Autopilot allows GPU workload deployment without the need for infrastructure management. However, you will incur a flat fee of $0.10 per hour for each cluster, in addition to the control plane cost.
Depending on the GPU's architecture, it can be set up in passthrough mode (granting virtual machines direct control over the GPU hardware for enhanced performance), time slicing, and Multi-Instance GPU (MIG).
Prerequisites
This blog will focus on the Standard Mode on GKE. Before we start, there are a few things you need to configure, considering that GPUs are in high demand and you’re using a managed provider.
GPU Quota: GPUs are highly demanded, and it’s very possible your account doesn’t have the quota required in your selected region/zone. Without an adequate quota, you won't be able to create GPU nodes. You can view the quotas in IAM & Admin > Quotas
and adjust according to your requirements. Read more about requesting quotas here.
Google Cloud SDK (gcloud): To interact (create/delete) with GKE. Make sure to have a project and a billing account linked and authenticated. Install the gcloud CLI [Note: You can use the UI if you prefer]
Kubernetes CLI (kubectl): To manage Kubernetes clusters. Install tools
Basic Kubernetes Knowledge: Familiarity with NodePool, Nodes, and kubectl commands.
Setting Up GPU on GKE Cluster
After completing your prerequisites, the next step is to configure a cluster, attach a GPU pool to it, and then test the workloads.
Step 1: Enable the Google Kubernetes Engine API
First and foremost, you need to enable the Google Kubernetes Engine API.
gcloud services enable container.googleapis.com
This will allow you to create GKE Clusters on your account.
Step 2: Create a GKE Cluster
With gcloud
installed, you can create your cluster using the command gcloud container clusters create
, along with a few parameters to specify the project and location of the deployment. To create a basic GKE cluster with a single non-GPU node to serve as the control plane, use the following command:
gcloud container clusters create gpu-cluster \
--zone us-central1 \
--release-channel "regular" \
--machine-type "n1-standard-4" \
--num-nodes 1
If you want autoscaling, you can configure the --enable-autoscaling
flag, but for this demo, we will set it up for GPUs and keep the control plane minimalist in the us-central1 region.
The successful output would be something similar:
Creating cluster gpu-cluster in us-central1... Cluster is being health-checked
(Kubernetes Control Plane is healthy)...done.
Created [https://container.googleapis.com/v1/projects/hrittik-project/zones/us-central1/clusters/gpu-cluster].
To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/us-central1/gpu-cluster?project=hrittik-project
kubeconfig entry generated for gpu-cluster.
NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS
gpu-cluster us-central1 1.32.4-gke.1353003 35.222.72.130 n1-standard-4 1.32.4-gke.1353003 3 RUNNING
With the control plane setup on regular CPU hardware, you can create a Node Pool for GPU workloads with GPUs.

The reason we separate this is that a best practice is to distinguish between your GPU and non-GPU workloads for cost efficiency, as GPUs are costly.
Step 3: Create a GPU Node Pool
For GPU nodes, we will again use the gcloud command, but this time we will include the --accelerator
flag to attach GPUs in our cluster. This flag has three parameters:
type: This parameter specifies the GPU Type (here, NVIDIA Tesla T4)
count: This specifies how many GPUs of the specified type should be attached to each node in the node pool.
gpu-driver-version: With this parameter, you can choose the driver installation between manual installation with the NVIDIA GPU Operator or using GKE's managed GPU driver installation. For simplicity, we will opt for the managed offering, where GKE manages driver installation, device plugin deployment, and driver updates during node auto-upgrades. The manual approach is advantageous when you have specific requirements.
In this demo, we also configure the --enable-autoscaling
flag so you can experiment with a few GPUs, with a minimum of 1 GPU always available and a maximum of 3 at surge.
You can use the following command to create the pool:
gcloud container node-pools create gpu-node-pool \
--cluster gpu-cluster \
--zone us-central1 \
--machine-type "n1-standard-4" \
--accelerator "type=nvidia-tesla-t4,count=1,gpu-driver-version=latest" \
--num-nodes 1 \
--min-nodes 1 \
--max-nodes 3 \
--enable-autoscaling
A successful output would return the name of the GPU pool along with other details:
Creating node pool gpu-node-pool...done.
Created [https://container.googleapis.com/v1/projects/hrittik-project/zones/us-central1/clusters/gpu-cluster/nodePools/gpu-node-pool].
NAME MACHINE_TYPE DISK_SIZE_GB NODE_VERSION
gpu-node-pool n1-standard-4 100 1.32.4-gke.1353003
Step 4: Verify your Nodes
The gcloud
utility automatically configures the kubeconfig
on your local machine. If that’s not the case for you, use the following command:
gcloud container clusters get-credentials gpu-cluster --zone us-central1
Once the authentication is complete and the two node pools are configured, using the kubectl get nodes
command will show you all the available nodes.
~ ❯ kubectl get nodes 1m 28s ○ gpu-cluster
NAME STATUS ROLES AGE VERSION
gke-gpu-cluster-default-pool-00369598-wrm6 Ready <none> 15m v1.32.4-gke.1353003
gke-gpu-cluster-default-pool-407e8ca0-xx5k Ready <none> 15m v1.32.4-gke.1353003
gke-gpu-cluster-default-pool-680730b0-fc90 Ready <none> 15m v1.32.4-gke.1353003
gke-gpu-cluster-gpu-node-pool-b13d6460-rtz2 Ready <none> 4m39s v1.32.4-gke.1353003
gke-gpu-cluster-gpu-node-pool-d1e385d9-2fgs Ready <none> 4m34s v1.32.4-gke.1353003
gke-gpu-cluster-gpu-node-pool-ea752474-j96x Ready <none> 4m40s v1.32.4-gke.1353003
If you want to list the accelerator nodes, use the `kubectl get nodes -l cloud.google.com/gke-accelerator=nvidia-tesla-t4` command to display them with labels and selectors as below:
~ ❯ kubectl get nodes -l cloud.google.com/gke-accelerator=nvidia-tesla-t4 ○ gpu-cluster
NAME STATUS ROLES AGE VERSION
gke-gpu-cluster-gpu-node-pool-b13d6460-rtz2 Ready <none> 7m3s v1.32.4-gke.1353003
gke-gpu-cluster-gpu-node-pool-d1e385d9-2fgs Ready <none> 6m58s v1.32.4-gke.1353003
gke-gpu-cluster-gpu-node-pool-ea752474-j96x Ready <none> 7m4s v1.32.4-gke.1353003
You can easily view the nodes using the GCP Dashboard! Just navigate to your cluster and take a look at your NodePool:

Step 5: Test a Sample GPU Workload
The next step is to test whether the GPU cluster is functioning correctly. To do this, we use a simple cuda-vector-add Pod and run it with an nvidia.com/gpu: 1 limit. This will schedule it to the nodes with a GPU. Run the following command in your cluster:
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: gpu-test-pod
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
resources:
limits:
nvidia.com/gpu: 1
EOF
If things are configured correctly, the Pod will be in a running state. Using the kubectl logs gpu-test-pod
, you can look into the logs of the Pod:
~ ❯ kubectl logs gpu-test-pod ○ gpu-cluster
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
"Test PASSED" signifies successful GPU computation, allowing your workloads to operate on GPUs in your cluster. With this, you can deploy various models to your GPU-enabled cluster.

If you want to explore more details, using the GCP Logs Explorer is the best option. For example, the picture above shows how our Pod requested a GPU during the specified duration to run the CUDA add operation.
Step 6: Clean up Resources
The next step is to delete your GPU-enabled Kubernetes cluster, freeing up resources and preventing unnecessary charges. To do this use the following command:
gcloud container clusters delete gpu-cluster --zone=us-central1
The output would look like this once you confirm deletion:
~ ❯ gcloud container clusters delete gpu-cluster --zone=us-central1
The following clusters will be deleted.
- [gpu-cluster] in [us-central1]
Do you want to continue (Y/n)? y
Deleting cluster gpu-cluster...done.
Deleted [https://container.googleapis.com/v1/projects/hrittik-project/zones/us-central1/clusters/gpu-cluster].
Managing GPU Resources for Multiple Teams
While GKE provides excellent GPU support, organizations often encounter challenges with GPU utilization and multi-tenancy. This is where vCluster, an open-source Kubernetes multi-tenancy solution, can significantly enhance your GPU infrastructure.
The main issues organizations face with expensive hardware like GPUs are underutilization and over-sharing. vCluster creates virtual Kubernetes clusters that operate within your physical GKE cluster. Each virtual cluster has its own API server, offering better isolation than namespaces while being more cost-effective than separate clusters. The benefits include:
Teams Gain Control: You can share your GPUs between virtual clusters across teams, alleviating the burden of under utilization.
Cost Savings: You don’t need to pay control plane costs to cloud providers and can rely on vCluster to manage isolation. Moreover, you can sync resources from Host to remove duplication in your stack and save licensing costs.
Rapid Provisioning: Launch environments in seconds instead of hours, enabling your team to ship quickly.
A recent example comes from a Platform Lead at a Fortune 500 finance company, who stated, “We replaced our previous setup of one cluster per team with virtual clusters and virtual nodes on a shared fleet. Now, each team feels like they have their own dedicated GPU platform, resulting in significantly better utilization.”
Final Thoughts
To wrap up, we’ve covered how to set up a GKE cluster for GPU workloads, test GPU usage, and explored the options for building a flexible GPU cluster. As sharing GPUs across teams becomes more important, Kubernetes offers the tools to balance efficiency and isolation for modern AI and ML workloads.
Having private, dedicated, and shared GPU tenancy options within a single cluster allows you to support various workloads and teams without sacrificing isolation. This flexibility helps maximize your GPU return on investment (ROI). As a result, vCluster is becoming the preferred solution for teams of different sizes looking to achieve maximum savings and utilization.
More Questions? Join our Slack to talk to the team behind vCluster!