← Blog
October 3, 2025

A friendly guide to multi‑GPU LLM serving

Most apps can stay on one GPU longer than you think. Move to multiple GPUs when memory or throughput goals demand it, or when your model size exceeds available GPU memory. Parallelism adds communication costs and new failure modes—plan it, test it, and keep your caps tight.

Try Compute today: Launch a vLLM inference server on Compute with 2×, 4×, or 8× GPU presets. Choose France or UAE, stream by default, and keep max_tokens and context caps sensible while you test batch shapes.

When multi‑GPU actually pays off

  • Model does not fit. Even with int8, model size (weights + KV‑cache) can exceed a single card’s VRAM.
  • Context is long. High concurrency + long prompts/outputs (increased sequence length) push cache past safe headroom.
  • Throughput ceiling. You need more tokens/second at the same latency target than one card can deliver.

Tuning the global batch size is important for optimizing throughput and GPU utilization in multi-GPU setups.

If your issue is mostly queueing or oversized caps, fix those first. Multi‑GPU won’t save a bad scheduler.

first. It is simpler, isolates failures, and scales well for many workloads.

Why did latency get worse after going multi‑GPU?

Likely communication overhead or a batch shape that triggers cache pressure. Check interconnect bandwidth, trim caps, and re‑measure.

Can multi‑GPU help with long context?

Yes, by spreading memory across cards. But also consider RAG and quantization before adding complexity.

How do I know it’s time to upgrade?

When TTFT p95 rises and TPS flattens at steady traffic despite clean caps and healthy memory headroom on one GPU.

What is the role of the embedding layer in pipeline parallelism?

The embedding layer maps input vocabulary to hidden states. In pipeline parallelism, the embedding layer is often placed at the start of the pipeline and may be tied or shared across model stages to ensure consistency and efficiency.

How are transformer blocks and transformer layers distributed across GPUs?

Transformer blocks and transformer layers are split across GPUs in pipeline and tensor parallelism. Each GPU processes a subset of these layers, allowing the model to scale efficiently and handle larger architectures.

How are expert layers distributed in Mixture of Experts (MoE) models?

Expert layers in MoE architectures are distributed across multiple GPUs. This distribution enables parallel computation of different experts, improving scalability and computational efficiency during training and inference.

What are the challenges of training LLM with large activation memory?

Training LLM (large language models) requires managing significant activation memory. Specialized frameworks like NeMo help distribute activation data and optimize memory use, which is critical for efficient multi-GPU training.

How do the sequence and sequence dimension affect parallelism strategies?

Parallelism strategies like sequence parallelism partition and distribute activation data along the sequence dimension. This allows efficient handling of long input sequences and better utilization of GPU memory and compute resources.

What does linear propagation of data through pipeline stages mean?

Linear propagation means data moves sequentially through each pipeline stage, with each stage inferring shapes and processing outputs in order, without skip connections or complex routing.

How is pipeline parallelism implemented in popular frameworks?

Pipeline parallelism is implemented in frameworks like Megatron-LM and DeepSpeed by integrating with Data Parallelism (DP), Tensor Parallelism (TP), ZeRO, and various pipeline schedules. These frameworks provide practical configurations and codebases for deploying pipeline parallelism effectively.