Streaming (LLMs)

Définition

Le streaming signifie retourner LLM output token by token (or chunk by chunk) as it is generated, instead of waiting for the full response. Users see text appear incrementally, which lowers perceived latency and improves chat and assistant use cases.

C'est supported by most LLM APIs (OpenAI, Anthropic, Gemini, open-source servers like vLLM) via Server-Sent Events (SSE) or similar protocols. The same prompt engineering and RAG or agents patterns apply; only the response delivery is incremental.

Comment ça fonctionne

Le client envoie une requête avec le prompt (et un contexte RAG ou des résultats d'outils optionnels). Le serveuruns the model autoregressively and, instead of buffering the full output, pushes each new token (or a small chunk of tokens) to the client as soon as it is generated. The client renders tokens as they arrive (par ex. in a chat UI). Connection stays open until the model emits an end-of-sequence token or the client stops the stream.

Cas d'utilisation

Streaming is the default for chat and any interactive use where users expect to see progress immediately.

Chat UIs and assistants where text should appear as it is generated
Long-form generation (summaries, code) to show progress and allow early cancellation
Reducing perceived latency when full response would take several seconds

Streaming (LLMs)

Définition

Comment ça fonctionne

Cas d'utilisation

Documentation externe

Voir aussi

Définition​

Comment ça fonctionne​

Cas d'utilisation​

Documentation externe​

Voir aussi​

Définition

Comment ça fonctionne

Cas d'utilisation

Documentation externe

Voir aussi