Table of Contents
The demand for running AI and ML workloads is exploding. To keep up with this increasing appetite for compute power platform teams throughout the industry are running into a familiar pain point: GPUs are expensive, and they are rarely used efficiently. If you have ever watched a single pod lock down a powerful A100 to run inference for a few minutes, then sit idle while another team waits in queue, you have felt the sting.
That’s where GPU sharing in Kubernetes comes in. It’s not just a nice-to-have anymore, it’s quickly becoming a critical part of building scalable, cost-efficient AI infrastructure.
Let’s walk through what GPU sharing actually means in Kubernetes, how it works, and what tradeoffs you’ll want to be aware of before rolling it out.
Why We Care About GPU Sharing
Kubernetes was built for running processes on CPUs and RAM, scaling application workloads with uniform replicas and round-robin load balancing. This pattern is effective for most HTTP request processing because the requests are generally short-lived and relatively uniform. Also, each replica can process each request equally well.
Scheduling, quota enforcement, and overcommit strategies all assume those hardware resources are relatively divisible. But working with GPUs isn't so easy.
Processing LLM inference workloads is much more unique, with expensive requests that vary in size and processing time with each load. This means that the typical scaling and load balancing patterns created for HTTP requests tend to fall short when processing LLM requests.
By default, Kubernetes allocates GPUs at the device level. Using the nvidia operator, you request nvidia.com/gpu: 1 and you get the whole thing—memory, compute, and all. There’s no native support for slicing that resource up across pods.
That’s a problem for multi-tenant clusters or teams running mixed workloads. Training a model might need a full graphics card, but inference or experimentation might need only a fraction. Without sharing, your cluster burns money and stalls productivity.
The Challenge: GPUs Don’t Like to Share
The problem isn’t Kubernetes, it’s the GPU hardware itself.
GPUs weren’t originally designed for fine-grained partitioning. They are stateful, sensitive to context switching, and come with vendor-specific tooling. That has made sharing them safely and predictably a hard technical problem. But the ecosystem is catching up.
How GPU Sharing Actually Works in Kubernetes
Depending on your hardware, Kubernetes setup, and workload type, there are a few main approaches to GPU sharing.
1. NVIDIA MPS (Multi-Process Service)
MPS is a binary-compatible implementation of the CUDA API. It allows multiple CUDA applications to share the same GPU by running as separate processes. It’s a software-level solution that works well for parallelizable inference or small batch jobs.

Pros:
Doesn’t require new hardware.
Supported in NVIDIA’s device plugin stack.
Cons:
Shared memory space; limited isolation.
Not ideal for jobs with strict performance isolation needs.
2. MIG (Multi-Instance GPU)
Newer NVIDIA GPUs (like the A100 or H100) support MIG, which lets you carve a single GPU into multiple hardware-backed “slices” with isolated memory and compute.

Pros:
Hardware-enforced isolation.
Each slice can be scheduled independently by Kubernetes.
Cons:
Only works on MIG-capable GPUs.
Fixed partition sizes, less flexible than full dynamic scheduling.
3. Timeslicing or Custom Scheduling
There are also third-party solutions that use timeslicing or scheduling logic to simulate GPU sharing. Projects like Volcano, Kubeflow, and run:ai to name a few.
Volcano is a cloud native batch scheduler designed for high-performance workloads—great if you’re managing queues of jobs across teams and want fine-grained control over priorities and fairness. Run:ai is a software platform from NVIDIA that is designed to manage and optimize GPU resources for AI workloads. Kubeflow, on the other hand, is an open-source toolkit focused on managing the end-to-end machine learning lifecycle on Kubernetes, and while it doesn’t directly handle GPU sharing, it often runs on clusters where efficient GPU orchestration is critical.

Pros:
Can run on a wide range of hardware.
Enables fairness or job prioritization across teams.
Cons:
Adds scheduler complexity.
May result in unpredictable latency.
GPU Sharing with Virtual Machines
Something else to consider when it comes to GPU hardware is the complication of running your cluster on virtual machines (VMs). How much this matters depends on how the GPUs are exposed to the VMs, and what kind of sharing mechanism you’re using–either GPU passthrough or vGPU.
GPU passthrough allows a VM to directly access a physical GPU as if it’s directly connected to the host machine. The GPU is exclusively assigned to the VM by the hypervisor, which makes the GPU fully available to workloads in the VM. The downside to this approach is that you can’t share the GPU across multiple VMs easily.
A virtual GPU (vGPU) allows multiple VMs to share a GPU’s processing power simultaneously, unlike traditional passthrough where the GPU is dedicated to a single VM. A vGPU allows the hypervisor to virtualize a GPU and assign slices to multiple VMs. Each VM sees a virtualized portion of the GPU and can run its own workloads. A cluster inside the VM can then treat the vGPU slice as a device. This approach is proprietary to NVIDIA and requires a license, compatible GPU hardware, and sometimes vendor-specific tooling.
MPS or MIG still work inside a VM with some caveats. Some hypervisors block MIG mode, so you may not be able to use MIG if the hypervisor doesn’t expose the full GPU. MPS requires the VM to have full access to the GPU’s driver stack and compute runtime, which isn’t always straightforward depending on the host setup.
GPU Sharing with Bare Metal
The preferred option for GPU sharing with Kubernetes is to run directly on bare metal. Running on bare metal offers your clusters full control of the GPU with no hypervisor restrictions. All the approaches we previously mentioned–MIG, MPS, and custom scheduling–are all easier to support. And, you’ll have lower latency and better performance running directly on your servers.
Achieving multi-tenancy on bare metal can be achieved by a few different methods–shared nodes, dedicated nodes, and private nodes.
Shared nodes – tenants share nodes dynamically with shared CNI and CSI
Dedicated nodes – nodes assigned to tenants with labels while CNI and CSI are shared
Private nodes – hosted control plane model where private nodes assigned to separate clusters and entirely separate CNI and CSI
Saiyam Pathak dives into these three methods of creating multi-tenancy in much greater detail in his post Bare Metal Kubernetes with GPU: Challenges and Multi-Tenancy Solutions. Take a look at that piece to learn more about these three main approaches to multi-tenancy while getting the most out of your GPU hardware.
Which would you pick?
That’s a lot of information, and as you can see, there are several conditions to consider based on your situation. Which scenario best suits your needs? If your graphics card doesn't support MIG, maybe you use timeslicing. If you have a MIG GPU then you may even timeslice in addition to MIG. It's complicated and depends on your use case, but every day there are more tools to help.
GPU sharing in Kubernetes isn’t magic, but it is real, and it can transform the efficiency of your platform if you set it up thoughtfully.
If your teams are waiting on GPU access or your budget is ballooning with underused hardware, it’s time to dig in. Whether you start with MPS, MIG, or a hybrid approach, the tools are here, and getting better fast.