Streaming (LLMs)
Definition
Streaming bedeutet die Rückgabe von LLM Ausgabe Token für Token (oder Stück für Stück) bei der Generierung, anstatt auf die vollständige Antwort zu warten. Benutzer sehen Text inkrementell erscheinen, was die wahrgenommene Latenz senkt und Chat- und Assistenz- use cases.
Es ist supported by most LLM APIs (OpenAI, Anthropic, Gemini, open-source servers like vLLM) via Server-Sent Events (SSE) or similar protocols. The same prompt engineering and RAG or agents patterns apply; only the response delivery is incremental.
Funktionsweise
The client sends a request mit dem prompt (and optional RAG context or tool results). The server runs the model autoregressively and, anstatt buffering die vollständige output, pushes each new token (or a small chunk of tokens) to the client as soon as it is generated. The client renders tokens as they arrive (z. B. in a chat UI). Connection stays open until the model emits an end-of-sequence token or the client stops the stream.
Anwendungsfälle
Streaming ist der Standard für chat and any interactive use where users expect to see progress immediately.
- Chat UIs and assistants where text should appear as it is generated
- Long-form generation (summaries, code) to show progress and allow early cancellation
- Reducing perceived latency when full response would take several seconds