
Real traffic is messy. New requests arrive while others are mid‑generation. Continuous batching is especially important for large language model (LLM) inference and other AI models used in real-world applications, where serving efficiency and resource utilization are critical. If your server waits for a full batch to finish before it starts the next one, GPUs sit idle and users wait.
Continuous batching keeps the queue moving so the GPU rarely pauses, which is crucial for efficient text generation. It can achieve throughput improvements of up to 23x over naive batching in LLM inference scenarios. Additionally, continuous batching loads model weights at the token level rather than at the request level, further enhancing efficiency.
Continuous systems can process materials around the clock to maximize output and speed up production. Continuous batching allows manufacturers to achieve high output volumes while ensuring precise control over ingredient ratios and mix quality for different formulations.
Ingredient measuring and storing can be done in an upstream hopper located above the mixer, allowing the next batch to be prepared while the previous one is mixing.
Prefill is the first pass where the model reads the entire input or input sequence (in addition to the prompt) and builds the key/value cache. Decode is the step‑by‑step generation that produces tokens for each input sequence or query.
During prefill, input tokens are processed in parallel, while sequences are handled step by step during decode. Prefill likes big parallel work; decode benefits from many small steps packed together. Good schedulers treat them differently.
Continuous batching does not require modification to the model and enables advanced memory optimizations. For instance, PagedAttention allocates memory in fixed-size pages, allowing non-contiguous KV cache storage to improve memory efficiency. PagedAttention reduces internal fragmentation by allocating GPU memory slots on demand instead of in advance.
Continuous batching also helps manage model weights efficiently during inference. Throughput and latency are improved across all percentiles when using continuous batching in LLM inference. LLM inference is memory-IO bound, not compute bound, meaning it takes more time to load data to the GPU than to perform calculations.
However, continuous manufacturing typically requires significant capital investment, and the process is more complex than traditional batch systems. In drug manufacturing, continuous batching allows for real-time quality monitoring while defining specific lots to meet regulatory requirements.
Continuous manufacturing in pharmaceuticals can incorporate batching for segments like multi-step synthesis where intermediates are isolated and validated before continuing in a continuous flow.
Additionally, continuous batching involves advanced automation technology and complex control systems requiring specialized expertise. The sequential operations in continuous batching involve loading the next pre-weighed batch into the mixer while the current batch is being discharged, creating a smooth workflow. Implementing a continuous batching system demands significant investment in hardware and software infrastructure.
This section describes key steps to create an efficient batching setup for your models.
Start with defaults, then change one thing at a time:
For thorough validation, it is recommended to run these tests over several iterations, using different iterations and multiple iterations to cover a range of scenarios. For example, you can vary prompt lengths or concurrency levels in each run to observe system behavior.
Track these at minimum:
Continuous batching is not magic. It is a practical way to keep GPUs busy and users happy when traffic is uneven. Start with safe caps, measure TTFT and TPS, and adjust batch limits where the numbers say it matters.
Try Compute today
When you are ready, launch a vLLM endpoint on Compute. Choose hardware, set caps, and get an HTTPS URL that works with OpenAI SDKs.
Prefill reads the entire prompt once to set up memory. Decode generates tokens step by step using that memory.
Big enough to keep the GPU busy during decode without causing cache thrash. Test with your real prompts and cap max tokens.
Usually memory pressure or oversized outputs. Trim prompts, cap outputs, and check cache hit rate.
Yes. Streaming is the default for many servers. Users see tokens while the scheduler keeps admitting other requests.
No. Single‑GPU nodes benefit a lot. Multi‑GPU helps when memory or throughput needs exceed one card.
ChatGPT typically generates tokens at a rate dependent on the underlying hardware and load, but common throughput rates range from hundreds to thousands of tokens per second on optimized servers.
On average, 1,000 tokens correspond to about 750 words, though this can vary depending on the language and tokenization method.
A token is a unit of text used in natural language processing, often a word or subword piece that the model processes during inference or training.
Humans generally read around 200 to 300 words per minute, which roughly translates to 250 to 400 tokens per minute, or about 4 to 7 tokens per second.
TTFT stands for Time To First Token, the latency measured from receiving a request to the moment the first token of the model output is generated.
TTFT is measured by timing the interval between when a model server receives the first request and when it outputs the first token, often tracked in benchmarking scripts.
TPOT (Tokens Per Operation Time) is a performance metric indicating how many tokens a model generates per unit of processing time, useful for assessing throughput.
TPS means Tokens Per Second, a measure of the model's throughput during inference, indicating how many tokens are generated each second.
KV cache refers to the key-value cache that stores intermediate key and value tensors computed during the attention mechanism to speed up subsequent token generation.
GPU KV cache is the storage of key-value pairs in GPU memory used during model inference to optimize attention computations and reduce redundant calculations.
In large language models, the KV cache holds cached key and value vectors from previous tokens to efficiently compute attention for new tokens without recalculating past states.
A key-value store cache is a data storage method where data is stored as pairs of keys and corresponding values, enabling fast retrieval; in LLMs, this concept applies to caching intermediate computations.
Dynamic batching groups incoming inference requests into batches dynamically based on arrival times and batch size limits, running batches either when full or after a timeout to balance latency and throughput.
Static batching waits for a batch to fill completely before processing, potentially increasing latency, while dynamic batching processes batches once full or after a set time, improving latency and resource utilization. Static batching is most appropriate when latency is not an issue. Static batching can substantially increase latency, limiting its use cases. Static batching requires a well-managed queue of requests to feed the model efficiently. Static batching can increase latency substantially, limiting its use cases. Static batching processes requests once a set number of requests has been received.
In shipping, batching refers to grouping multiple orders or shipments together to optimize transport efficiency and reduce costs.
In Unity, dynamic batching is a rendering optimization technique that combines multiple small meshes into a single draw call dynamically to improve graphics performance.