Streaming (LLMs)
Definição
Streaming significa retornar a saída do LLM token por token (ou fragmento por fragmento) conforme é gerada, em vez de of waiting for the full response. Users see text appear incrementally, which lowers perceived latency and improves chat and assistant use cases.
É supported by most LLM APIs (OpenAI, Anthropic, Gemini, open-source servers like vLLM) via Server-Sent Events (SSE) or similar protocols. The same prompt engineering and RAG or agents patterns apply; only the response delivery is incremental.
Como funciona
O cliente envia uma requisição com o prompt (e contexto RAG ou resultados de ferramentas opcionais). O servidor euns the model autoregressively and, instead of buffering the full output, pushes each new token (or a small chunk of tokens) to the client as soon as it is generated. The client renders tokens as they arrive (por ex. in a chat UI). Connection stays open until the model emits an end-of-sequence token or the client stops the stream.
Casos de uso
Streaming is the default for chat and any interactive use where users expect to see progress immediately.
- Chat UIs and assistants where text should appear as it is generated
- Long-form generation (summaries, code) to show progress and allow early cancellation
- Reducing perceived latency when full response would take several seconds