← Blog
October 3, 2025

Um guia amigável para servir LLM com várias GPUs

A maioria dos aplicativos pode ficar em uma GPU por mais tempo do que você imagina. Mude para várias GPUs quando as metas de memória ou taxa de transferência exigirem, ou quando o tamanho do modelo exceder a memória de GPU disponível. O paralelismo adiciona custos de comunicação e novos modos de falha: planeje, teste e mantenha seus limites rígidos.

Experimente o Compute hoje mesmo: Lance um vLLM servidor de inferência ligado Computar com 2×, 4× ou 8× Predefinições de GPU. Escolha a França ou os Emirados Árabes Unidos, transmita por padrão e mantenha os limites máximos de contexto e max_tokens razoáveis ao testar formas de lote.

Quando a multiGPU realmente compensa

  • O modelo não se encaixa. Mesmo com o int8, o tamanho do modelo (pesos + KV‑cache) pode exceder a VRAM de uma única placa.
  • O contexto é longo. Alta simultaneidade e avisos/saídas longas (maior comprimento da sequência) empurram o cache para além do espaço livre seguro.
  • Teto de rendimento. Você precisa de mais tokens/segundo na mesma meta de latência do que um cartão pode fornecer.

Ajustar o tamanho global do lote é importante para otimizar a taxa de transferência e a utilização da GPU em configurações com várias GPUs.

Se o problema for principalmente filas ou limites grandes, corrija-os primeiro. A multiGPU não salvará um agendador ruim.

first. It is simpler, isolates failures, and scales well for many workloads.

Why did latency get worse after going multi‑GPU?

Likely communication overhead or a batch shape that triggers cache pressure. Check interconnect bandwidth, trim caps, and re‑measure.

Can multi‑GPU help with long context?

Yes, by spreading memory across cards. But also consider RAG and quantization before adding complexity.

How do I know it’s time to upgrade?

When TTFT p95 rises and TPS flattens at steady traffic despite clean caps and healthy memory headroom on one GPU.

What is the role of the embedding layer in pipeline parallelism?

The embedding layer maps input vocabulary to hidden states. In pipeline parallelism, the embedding layer is often placed at the start of the pipeline and may be tied or shared across model stages to ensure consistency and efficiency.

How are transformer blocks and transformer layers distributed across GPUs?

Transformer blocks and transformer layers are split across GPUs in pipeline and tensor parallelism. Each GPU processes a subset of these layers, allowing the model to scale efficiently and handle larger architectures.

How are expert layers distributed in Mixture of Experts (MoE) models?

Expert layers in MoE architectures are distributed across multiple GPUs. This distribution enables parallel computation of different experts, improving scalability and computational efficiency during training and inference.

What are the challenges of training LLM with large activation memory?

Training LLM (large language models) requires managing significant activation memory. Specialized frameworks like NeMo help distribute activation data and optimize memory use, which is critical for efficient multi-GPU training.

How do the sequence and sequence dimension affect parallelism strategies?

Parallelism strategies like sequence parallelism partition and distribute activation data along the sequence dimension. This allows efficient handling of long input sequences and better utilization of GPU memory and compute resources.

What does linear propagation of data through pipeline stages mean?

Linear propagation means data moves sequentially through each pipeline stage, with each stage inferring shapes and processing outputs in order, without skip connections or complex routing.

How is pipeline parallelism implemented in popular frameworks?

Pipeline parallelism is implemented in frameworks like Megatron-LM and DeepSpeed by integrating with Data Parallelism (DP), Tensor Parallelism (TP), ZeRO, and various pipeline schedules. These frameworks provide practical configurations and codebases for deploying pipeline parallelism effectively.