
Users feel delay before they see words. Low latency is crucial for a good user experience and for optimizing a model's performance, as minimizing response times directly impacts how users perceive generative AI systems. Systems fail when queues stretch and memory runs out. A small set of metrics can warn you early and point to the right fix. Tracking these metrics is crucial for ensuring system reliability and maintaining the model's performance. Keep the list short, wire it well, and make alerts boring.
On Compute, a vLLM endpoint gives you an HTTPS URL with OpenAI‑style routes and predictable capacity. Place it close to users, then watch TTFT, TPS, and cache headroom.
These are the core performance metrics for LLM inference. Various factors, such as input prompt complexity, resource usage, and system configuration, influence these metrics. Monitoring tools and different methods are used to track and analyze them.
Time to first token (TTFT). Gap between request send and the first token. The most important number for perceived response times and user experience. The attention mechanism and prefill time contribute significantly to the delay before the first token, as the model processes input tokens and builds its key-value cache before generating responses. Track p50 and p95. The final token marks the end of a response.
Tokens per second (TPS). Throughput once tokens start. Higher is better for UX and capacity. TPS is calculated based on how many requests are processed and the number of output tokens generated per second for all the requests, often measured alongside requests per second. Inter token latency is the average time between consecutive tokens, and total TPS is a key metric for system throughput. Efficient attention computation can help reduce inter token latency. Track p50 and p95.
Queue length. Requests waiting for a decode slot. Rising length with flat traffic equals trouble.
GPU memory headroom. Free VRAM during peaks. Low headroom predicts failures, slow starts, and evictions. Resource usage and cost implications should be considered when monitoring GPU memory.
Cache hit rate. Health of the KV‑cache and any prompt caches. Lower hit rates often explain TTFT creep.
Prefill vs decode time. Prefill time is affected by the input prompt and the number of input tokens, as these factors determine how long the model spends in the prefill phase before decoding.
Error rate by type. Complexity and other factors can influence error rates, so tracking by type helps identify root causes.
Note: These metrics may vary depending on system configuration, workload complexity, and service provider.
Request rate (RPS). Pairs with queue length to explain pressure. Request rate is typically calculated as requests per second, providing a quantitative measure of system throughput.
Prefill vs decode time. Splitting these helps isolate long prompts from slow generation. The input prompt and the number of input tokens are key factors that affect prefill time, as more complex or lengthy prompts can increase processing duration.
Error rate by type. OOM, timeouts, 4xx/5xx. Group by route and model. Complexity and other factors can influence error rates, making it important to analyze error types in context.
Network latency. Client ↔ endpoint RTT by region. Spikes here can mimic server slowness.
Thermals and clocks. Throttling shows up as flat TPS and rising TTFT. Monitoring resource usage and cost is important when evaluating GPU memory headroom or error rates, as these can impact both performance and budgeting.
Supporting metrics are tracked using monitoring tools to ensure comprehensive observability and timely detection of issues.
Effective instrumentation for LLM observability relies on monitoring tools and various methods to ensure comprehensive tracking and analysis.
Dashboards are built using monitoring tools to visualize key performance metrics for large language models (LLMs).
“Are users feeling slow right now?”
TTFT p50/p95 over time with traffic overlay. Drill by region and model. Response times and responses are key indicators of user experience.
“Can we take more load?”
TPS p50/p95, queue length, and GPU memory headroom. Flat TPS + rising queue = not yet.
“Why did latency spike?”
Split prefill vs decode, add network RTT. Long prefill → prompts too big. Long decode → caps or batch shape.
“Are we wasting tokens?”
Distribution of output tokens vs your caps. Big tail = loose settings.
Note: Dashboard metrics should be interpreted in context, as variations in measurement conditions or model configurations can affect results.
Set thresholds you can defend. Examples to start:
Define a service level objective (SLO), such as “TTFT p95 ≤ 800 ms and error rate ≤ 1% over 28 days.” An SLO is often part of a service level agreement (SLA), which is a formal contract between the service provider and the user specifying performance and quality standards. Track the error budget and page when you burn it too fast.
Use monitoring tools and different methods to track SLOs and performance metrics. For alerting, select the method that best fits your operational needs.
Note: SLOs should be tailored to the needs of the service provider and users to ensure they are meaningful and actionable.
These are common methods for load testing:
Note: Results may vary depending on the method used and system configuration.
Note: These pitfalls can lead to increased operational costs and inefficient resource usage if not properly managed.
Focusing on essential metrics is more effective than tracking too many. Small, stable metrics beat a wall of charts. Watch TTFT and TPS, keep queues short, and leave headroom in memory. Fix prompts and caps before you change hardware.
Note: Simplicity is key—prioritize essential metrics to avoid unnecessary complexity.
Try Compute today
Prefer predictable ops? Launch a vLLM endpoint on Compute in France or UAE, cap tokens, and track TTFT/TPS from day one.
Time to first token—the delay before the model's output begins, specifically the time it takes to create the first new token. Users feel it more than any other metric.
Enough to keep chat smooth and queues short at your traffic. Measure with your prompts. Total TPS (tokens per second) reflects the throughput across all the requests and depends on how many requests are handled at once. Optimize batch shape and caps to raise it.
Five core ones cover most issues: TTFT, TPS, queue length, GPU memory headroom, and cache hit rate. These are the essential performance metrics for monitoring LLMs. Various methods exist for tracking these metrics, depending on your infrastructure and evaluation needs. Add request rate and error types for context.
Use synthetic traffic in a staging environment, or temporarily raise thresholds in production and fire a controlled burst. Monitoring tools and various methods can be used to test alerting systems effectively.
Helpful once you run multiple nodes or a gateway. Distributed tracing is one of several monitoring tools and methods for system observability. Start with request IDs and clear spans; add tracing as you grow.
TTFT, or Time to First Token, measures the delay from when a request is sent until the first token of the model’s output is received. This metric includes the prefill time, during which the attention mechanism processes the input and creates the key-value cache needed to create the first token. It is a critical metric for perceived responsiveness.
TTFT is calculated by measuring the time it takes to create the first new token. This involves recording the timestamp when a request is sent and the timestamp when the first output token arrives, then calculating the difference.
TPOT (Time per Output Token) is the average time taken to generate each token after the first one, reflecting the steady-state token generation speed. Specifically, TPOT measures the inter token latency between consecutive tokens produced by the model. Efficient attention computation can help reduce this inter token latency, improving overall decoding efficiency.
TPS stands for Tokens Per Second and indicates how many tokens a model can generate each second during inference, measuring throughput. TPS is closely related to requests per second, as both metrics help evaluate system performance. Total TPS measures the overall throughput across all the requests being handled simultaneously, reflecting the system's capacity to generate output tokens per second for all the requests combined.
A token per second is a unit measuring the number of tokens produced by the model every second during output generation. This metric is calculated by dividing the number of tokens generated by the elapsed time.
ChatGPT typically generates tokens at a rate of about 20 to 30 tokens per second, depending on server load and model version.
Note: TPS may vary depending on performance metrics such as latency and throughput, as well as system conditions.
Approximately 750 words correspond to 1,000 tokens since a token averages about 0.75 words in English. The word-to-token ratio is calculated based on the average word length and the process of tokenization.
Seven tokens per second is relatively slow for many real-time applications but may be acceptable for longer, less time-sensitive generations.
Inference is measured using different methods, which include tracking performance metrics such as latency (TTFT, total generation time) and throughput (TPS).
Examples include text completion, question answering, translation, summarization, and code generation. These are all examples of responses generated as the model's output.
The inference process begins with the input prompt, which is tokenized into input tokens. The attention mechanism processes these input tokens during the prefill time, creating the key-value cache necessary for the model to generate a response. This prefill phase is crucial for creating the first new token, which marks the start of output token generation. The model then continues decoding and generating output tokens until the response is complete.
In court, inference refers to a conclusion reached based on evidence and reasoning rather than explicit statements.
Observability is the ability to understand a system’s internal state by analyzing its outputs, such as logs, metrics, and traces, using monitoring tools and various methods to assess and interpret the process and results.
Monitoring uses predefined metrics and alerts, often relying on monitoring tools and specific methods to track system health. Both observability and monitoring utilize monitoring tools and methods, but observability enables exploratory analysis without prior knowledge, offering a deeper understanding of the process and providing deeper insights.
Metrics, logs, and traces are the three pillars that provide comprehensive system visibility, supported by various monitoring tools and methods.
In DevOps, observability helps teams proactively detect, diagnose, and resolve issues by collecting and analyzing telemetry data. Observability relies on monitoring tools and various methods to analyze the process, ensuring effective detection and resolution of issues.
A KV (key-value) cache stores intermediate key-value pairs during model inference to speed up token generation and enables efficient attention computation during inference.
GPU KV cache refers to storing key-value pairs in GPU memory to optimize attention computations during LLM inference, supporting efficient attention computation.
The KV cache holds previously computed attention keys and values to avoid redundant calculations when generating new tokens. This cache is essential for efficient attention computation, enabling faster and more optimal processing during token generation.
A key-value store cache is a data structure that stores data indexed by unique keys for fast retrieval.
Throughput is a key performance metric that represents the amount of work done or output produced by a system in a given time, such as tokens generated per second. Throughput is typically calculated as output divided by time, and can be measured using different methods depending on the evaluation approach.
Throughput measures actual processed data over time, while bandwidth is the maximum possible data transfer rate. Both throughput and bandwidth are important performance metrics when evaluating system efficiency.
Throughput is a performance metric that is calculated as tokens per second. For example, generating 100 tokens per second during LLM inference demonstrates throughput.
Output rate or processing rate.
A 95th percentile latency under 500 milliseconds is generally considered good for real-time applications.
P95 tail latency is the latency value below which 95% of requests complete, indicating worst-case performance for most users.
P50 latency is the median latency, meaning half of the requests complete faster and half slower than this value.
No, p95 is a percentile metric representing the value below which 95% of observations fall, not an average. Both p95 and average are types of performance metrics used to evaluate and compare system performance.
An error budget defines the allowable threshold of errors or latency breaches within a service period to balance reliability and innovation. Error budgets are typically part of a service level agreement (SLA) between the service provider and the user, and are based on performance metrics such as latency and throughput.
Error budget is calculated using performance metrics as follows: Error budget = (1 - SLO target) × total requests or time period.
SLO (Service Level Objective) sets the performance target as part of a service level agreement between the service provider and user, and is based on performance metrics; error budget quantifies allowable deviations from that target.
A budget error occurs when the error budget is exceeded, indicating service reliability issues and potential problems with performance metrics such as latency, throughput, or other benchmarking results.
GPU memory is the dedicated RAM on a graphics processing unit used to store data and computations during processing. It is a key aspect of resource usage, as monitoring GPU memory helps track overall resource utilization and performance metrics in LLM observability solutions.
The needed GPU memory depends on model size, batch size, precision, as well as resource usage and various factors such as throughput, latency, and response quality; larger models and batches require more memory.
Using GPU memory refers to the allocation of GPU RAM for model parameters, intermediate data, and computation during inference. This is a form of resource usage, as tracking GPU memory utilization is essential for understanding overall system performance and efficiency.
You can check GPU memory usage with tools like nvidia-smi on NVIDIA GPUs, system monitoring software, or specialized monitoring tools designed for advanced observability. Monitoring tools can help track GPU memory usage, alert you to issues, and provide detailed metrics for effective system management.