GPU VM vs bare metal: what we measured on Compute with Hivenet

Customers ask a fair question before they move serious GPU work into a virtual machine: will the VM slow me down?

For light workloads, the answer may not matter much. For multi-GPU training and inference, it can matter a lot. A fast GPU does not help if the communication path between GPUs is poorly exposed, if the guest sees the wrong PCIe topology, or if the virtualization layer limits the hardware details that libraries such as NCCL depend on.

So we tested a workload that can expose those problems quickly: NCCL AllReduce on a single host with 8× NVIDIA GeForce RTX 5090 GPUs.

The bare-metal baseline measured 19.25 GB/s. The VM on Compute with Hivenet measured 19.34 GB/s on the same benchmark.

That is a +0.5% delta in the VM’s favor, but we do not read it as “the VM is faster.” The difference sits inside normal run-to-run variance.

The useful conclusion is simpler: on this benchmark, the Compute with Hivenet VM matched bare-metal bandwidth. This result has also held across thousands of continuous benchmark runs on the same 8-GPU host configuration.

‍

What we tested

We used NVIDIA’s NCCL AllReduce benchmark, all_reduce_perf, on a single host with eight NVIDIA GeForce RTX 5090 GPUs.

NCCL, the NVIDIA Collective Communications Library, is widely used in multi-GPU AI workloads. AllReduce is a collective operation where GPUs exchange and combine data. It appears often in distributed training, tensor-parallel inference, and other workloads where GPUs need to coordinate rather than work alone.

That makes it a useful test for GPU VM performance. A single-GPU benchmark can miss problems in the system around the GPU. Multi-GPU communication is less forgiving. If PCIe passthrough is incomplete, if topology is exposed poorly, or if traffic is routed inefficiently, NCCL is likely to show it.

Area	What we measured
Benchmark	NCCL `all_reduce_perf`
Metric	AllReduce bus bandwidth (`busbw`)
Message size	1 GB
Hardware	8 × NVIDIA GeForce RTX 5090
Scope	Single host, 8 GPUs
Bare metal	19.25 GB/s
Compute with Hivenet VM	19.34 GB/s
Reading	+0.5%, within run-to-run variance
Validation	Thousands of continuous benchmark runs on the same 8-GPU host configuration
Not covered	App-level training time, storage, CPU bottlenecks, or multi-node performance unless tested separately

‍

Higher is better.

The difference was +0.5%, which is within normal run-to-run variance. The responsible reading is that virtualization did not create a measurable bandwidth penalty in this test.

‍

Why this benchmark matters

Virtualization has a mixed reputation in GPU computing, and some of that skepticism is earned.

A VM can underperform when GPU passthrough is treated as a checkbox rather than a full system configuration problem. The guest may detect the GPU, but still miss important details about topology, routing, or device behavior. Drivers may load correctly while communication performance falls short of what the hardware can deliver.

That distinction matters for AI workloads. Multi-GPU jobs often spend meaningful time moving data, synchronizing tensors, or coordinating parallel work. When communication slows down, the whole workload can slow down with it.

That is why we tested NCCL AllReduce instead of relying on a softer benchmark. It asks a direct question: can the VM use the multi-GPU path properly, or does virtualization get in the way?

In this measurement, the Compute with Hivenet VM delivered the same practical bandwidth as the bare-metal baseline.

‍

How Compute with Hivenet gets there

The result depends on host configuration. Passing a GPU through to a VM is only the starting point.

Compute with Hivenet hosts use a tuned configuration with NUMA-aware PCIe topology. GPUs are bound to the correct CPU socket and exposed under the right PCIe root complexes, so NCCL can see real link speeds and route traffic as expected.

That detail matters. GPU performance is shaped by the whole system: CPU socket, PCIe layout, GPU placement, driver behavior, and how the guest operating system sees the hardware.

The goal is to give the VM a clean, predictable GPU environment. Inside the guest, teams should be able to use standard NVIDIA tooling such as drivers, NVML, nvidia-smi, and profiling tools. Hardware-specific features are available where the assigned GPU supports them.

That last qualifier is important. We do not want to imply that every feature exists on every GPU type. The promise is not “all NVIDIA features everywhere.” The point is that assigned devices are passed through with the access expected for that hardware.

‍

What this result supports, and where it stops

A benchmark is only useful when its limits are clear.

This test measured NCCL AllReduce bus bandwidth on one host with eight NVIDIA GeForce RTX 5090 GPUs, using busbw at a 1 GB message size. It does not prove that every workload will perform identically to bare metal. Your application may depend on CPU behavior, memory pressure, storage, data loading, driver versions, model architecture, batch size, precision, or framework settings.

It also does not measure multi-node training performance. Once a job spans multiple hosts, networking becomes a major factor.

Compute with Hivenet provides 50 Gbps cross-host networking without RDMA by default. Cross-host RDMA is available on request. If your workload depends on multi-node training, talk to us before scoping and benchmark the exact pattern you plan to run.

The result should also not be treated as a generic claim about all GPU VMs. Virtualization performance depends on how the host is configured. A poorly tuned GPU VM can still underperform.

This supports	This does not prove
The VM did not show a measurable NCCL AllReduce bandwidth penalty in this test.	Every workload will perform identically to bare metal.
Tuned PCIe passthrough can expose the GPU path cleanly for single-host multi-GPU work.	Multi-node training will behave the same without separate testing.
Compute with Hivenet can be credible for single-host training, inference, and fine-tuning workloads.	Storage, networking, CPU, data loading, or framework choices will never become bottlenecks.
The result has held across thousands of continuous benchmark runs on the same 8-GPU host configuration.	Every possible GPU configuration, software stack, or application pattern will show the same result.

‍

The narrower claim is the stronger one: with Compute with Hivenet’s tuned passthrough configuration, this single-host multi-GPU communication benchmark matched bare metal within run-to-run variance.

‍

What this means for training and inference

For single-host multi-GPU workloads, this benchmark is a useful signal.

If your workload runs across several GPUs on one Compute with Hivenet instance, the virtualization layer should not be the first thing you suspect when measuring NCCL communication performance. On the tested 8× NVIDIA GeForce RTX 5090 setup, the VM matched bare-metal bandwidth within measurement noise, and the result has held across thousands of continuous benchmark runs on the same host configuration.

That is relevant for workloads such as multi-GPU model training, tensor-parallel inference, pipeline-parallel workloads, fine-tuning, and evaluation runs that need direct GPU access with control over the software environment.

A VM also gives teams more control over the guest operating system. You can bring a CUDA-compatible Linux image, install system packages, use Docker or containerd, and set up NVIDIA tooling around your own workflow.

That control is one reason teams choose VMs for longer-running services, custom runtime environments, and workloads where the operating system matters.

The benchmark answers the main concern: choosing VM-level control on Compute with Hivenet did not create a measured single-host NCCL bandwidth penalty in this test.

‍

How to benchmark your own workload

NCCL AllReduce is a good starting point, but it should not be the only test you run.

For training, compare end-to-end step time, GPU utilization, communication time, and data loading behavior. For inference, measure latency, throughput, concurrency behavior, and cold starts if they matter to your application. For fine-tuning, test with the same model, batch size, precision, sequence length, and optimizer settings you expect to use in production.

Keep the comparison tight. Use the same GPU count, driver stack, CUDA version, framework version, model, batch size, and benchmark command. Run several iterations instead of trusting a single number. Watch for warm-up effects. Record the exact instance type and configuration.

The goal is not to make the VM win a lab contest. The goal is to know whether the environment behaves predictably under your workload.

Our NCCL result gives a useful baseline: for single-host 8× NVIDIA GeForce RTX 5090 communication, the Compute with Hivenet VM matched bare metal within variance across thousands of continuous benchmark runs.

‍

Common questions

‍

Does multi-GPU training run slower under a Compute with Hivenet VM?

On this benchmark, no. NCCL AllReduce on the tested 8× NVIDIA GeForce RTX 5090 setup matched bare-metal bandwidth within run-to-run variance.

That does not mean every training job will show identical end-to-end timing. Training performance depends on the model, data pipeline, precision, batch size, framework settings, storage behavior, and how much time the job spends in communication. For the part this benchmark measures, we did not see a virtualization penalty.

‍

Do I get full GPU access inside the VM?

The assigned GPU is passed through to the VM, so you can use standard NVIDIA drivers, NVML, nvidia-smi, and profiling tools in the guest environment.

Feature availability depends on the GPU you choose. If a specific hardware feature matters to your workload, check that it is supported on the assigned GPU type before you build around it.

‍

Can I run my own kernel or container runtime?

Yes. With a VM, you control the guest environment. You can use a CUDA-compatible Linux image, install your preferred container runtime, and configure the software stack around your workload.

Compute with Hivenet can provide a tuned default image, but you are not limited to a narrow container-only setup. That is useful for teams with existing Docker, containerd, Kubernetes, or custom runtime requirements.

‍

What about multi-node workloads?

Multi-node workloads should be tested separately. This benchmark measured a single host with eight GPUs.

Once a job spans multiple hosts, network behavior becomes a major factor. Compute with Hivenet provides 50 Gbps networking without RDMA by default, and cross-host RDMA is available on request. For multi-node training, talk to us before scoping and test the actual training pattern you plan to run.

‍

Should I choose a VM or a container for GPU work?

Choose a VM when you need operating system control, custom system packages, longer-running services, stricter isolation, or an environment that mirrors your own infrastructure.

Choose a container-first setup when you want fast job-based iteration and do not need to manage the operating system.

The performance question should still be tested against your workload, but this benchmark shows that a Compute with Hivenet VM can deliver bare-metal-level single-host multi-GPU communication performance when the host is configured correctly.

‍

Same silicon. Same speed. Measured.

GPU buyers and engineering teams should be skeptical of loose performance claims. “Bare-metal-like” is easy to say and often too vague to help anyone make a decision.

So we measured a workload where virtualization overhead would be hard to hide.

On 8× NVIDIA GeForce RTX 5090 GPUs, Compute with Hivenet delivered 19.34 GB/s on NCCL AllReduce inside a VM, compared with 19.25 GB/s on the bare-metal baseline. The difference is within run-to-run variance, and the result has held across thousands of continuous benchmark runs on the same 8-GPU host configuration.

For teams running single-host multi-GPU training, inference, or fine-tuning, the takeaway is clear: Compute with Hivenet gives you VM-level control without a measured bandwidth penalty on this communication-heavy benchmark.

NVIDIA GeForce RTX 5090 instances on Compute with Hivenet start at €0.75 per GPU-hour, with per-second on-demand billing. For larger production training runs, contact sales about committed capacity.

Explore Compute with Hivenet if you need GPU capacity with VM control, assigned GPU access, and measured single-host performance. Start with the docs to launch an instance, or check pricing if you already know the GPU configuration you need.