
Streaming is the easiest win for UX and cost. Users see words sooner, cancel when they have enough, and you spend fewer tokens. Streaming reduces latency, which is critical for a smooth end-user experience. You only need two things: a server that streams, and a client that reads chunks without buffering.
Try Compute today
Launch a vLLM inference server on Compute. You get an HTTPS endpoint with OpenAI‑style routes that stream by default. Point your existing OpenAI SDK at the new base URL and start measuring TTFT.
Server‑Sent Events (SSE). One‑way stream over HTTP. SSE streams data from the server to the client over an HTTP connection. Simple, proxy‑friendly, great for token streaming. The EventSource API is standardized as part of the HTML Living Standard by the WHATWG. Works with EventSource in browsers and with streaming responses in most HTTP clients.
WebSockets. Two‑way messages over a persistent socket. WebSockets use the WebSocket protocol and are managed via a WebSocket object, enabling real-time, bi-directional communication between client and server. Useful when the client must send events mid‑stream (typing, cursor sync, collaborative edits). Token streaming is the mode in which the server returns the tokens one by one as the model generates them.
Rule of thumb: use SSE for chat unless you truly need bi‑directional messaging. With token streaming, the server can start returning tokens before generating the whole response.
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "https://YOUR-ENDPOINT/v1", apiKey: process.env.KEY });
const stream = await client.chat.completions.create({
model: "f3-7b-instruct",
messages: [{ role: "user", content: "Draft a short update about project status." }],
stream: true,
max_tokens: 200
});
for await (const chunk of stream) {
const delta = chunk.choices?.[0]?.delta?.content;
if (delta) process.stdout.write(delta);
}
from openai import OpenAI
client = OpenAI(base_url="https://YOUR-ENDPOINT/v1", api_key="YOUR_KEY")
with client.chat.completions.stream(
model="f3-7b-instruct",
messages=[{"role":"user","content":"Write a one‑paragraph summary."}],
max_tokens=200,
) as stream:
for event in stream:
if event.type == "token":
print(event.token, end="")
Cancel fast when users stop reading:
const controller = new AbortController();
// pass signal: client.chat.completions.create({ ..., stream: true, signal: controller.signal })
// later on
controller.abort();
The WebSocket server establishes a persistent websocket connection with clients, enabling real-time, bidirectional communication. During the initial handshake, the server and client exchange HTTP headers, including Sec-WebSocket-Key, Sec-WebSocket-Version, and Sec-WebSocket-Protocol, to upgrade the HTTP connection to a WebSocket connection and ensure security and protocol compliance.
// Server sketch using ws
import { WebSocketServer } from "ws";
const wss = new WebSocketServer({ port: 8080 });
wss.on("connection", (ws) => {
ws.on("message", async (msg) => {
const { prompt } = JSON.parse(msg.toString());
// call your OpenAI‑compatible endpoint with stream=true
for await (const token of generateStream(prompt)) {
ws.send(token); // backpressure: check ws.bufferedAmount
}
});
ws.on("close", () => {
// Handle cleanup when the connection closes
// Free resources or perform any necessary cleanup here
});
});
When the connection closes, the WebSocket server triggers a close event, allowing you to handle cleanup and free resources associated with that websocket connection.
Handle backpressure: pause when ws.bufferedAmount is large; resume when it drains. In browsers, use streams API with reader.read() and respect ReadableStreamDefaultReader signals.
Try Compute today
Deploy a vLLM endpoint on Compute. Streaming is on by default. Place it near users, set strict output caps, and watch TTFT and TPS improve.
Use SSE for most chat. Reach for WebSockets only when you need two‑way messages. Cancel promptly, cap outputs, and turn off proxy buffering. Measure time to first token and tokens per second, then tune caps and batch limits before changing hardware.
A one‑way HTTP stream the server pushes to the client. Token streaming maps well to SSE.
SSE is one‑way and simple; WebSockets are two‑way and better for interactive apps. For chat output, SSE is usually enough.
Yes—use stream: true or an SSE client. You will receive incremental tokens until the model stops or you cancel.
Abort the HTTP request (SSE) or close the WebSocket. Always free server resources on cancel.
An intermediary is buffering the response. Disable buffering for the route and keep the connection alive.
Yes—EventSource for SSE, or fetch() with the Streams API to read chunks.