Ollama supports streaming responses, which is very important for application scenarios that require real-time display of generated content.
1. Enable Streaming Response:
Set the "stream": true parameter in the API call:
bashcurl http://localhost:11434/api/generate -d '{ "model": "llama3.1", "prompt": "Tell me a story about AI", "stream": true }'
2. Python Streaming Response Example:
pythonimport ollama # Stream text generation for chunk in ollama.generate(model='llama3.1', prompt='Tell me a story', stream=True): print(chunk['response'], end='', flush=True) # Stream chat messages = [ {'role': 'user', 'content': 'Explain quantum computing'} ] for chunk in ollama.chat(model='llama3.1', messages=messages, stream=True): if 'message' in chunk: print(chunk['message']['content'], end='', flush=True)
3. JavaScript Streaming Response Example:
javascriptimport ollama from 'ollama-js' const client = new ollama.Ollama() // Stream generation const stream = await client.generate({ model: 'llama3.1', prompt: 'Tell me a story', stream: true }) for await (const chunk of stream) { process.stdout.write(chunk.response) }
4. Handle Streaming Response with requests Library:
pythonimport requests import json response = requests.post( 'http://localhost:11434/api/generate', json={ 'model': 'llama3.1', 'prompt': 'Hello, how are you?', 'stream': True }, stream=True ) for line in response.iter_lines(): if line: data = json.loads(line) print(data.get('response', ''), end='', flush=True)
5. Advantages of Streaming Response:
- Better User Experience: Real-time display of generated content, reduced waiting time
- Lower Memory Usage: No need to cache complete response
- Faster Time to First Token: Start displaying content immediately
- Better Interactivity: Users can see partial results early
6. Considerations for Handling Streaming Response:
- Correctly handle JSON lines, each line is a separate JSON object
- Handle connection interruption and reconnection logic
- Consider adding timeout mechanisms
- Implement cancellation functionality to stop generation
7. Advanced Streaming Processing:
pythonimport ollama from queue import Queue from threading import Thread def stream_to_queue(queue, model, prompt): for chunk in ollama.generate(model=model, prompt=prompt, stream=True): queue.put(chunk['response']) queue.put(None) # End marker # Use queue to process streaming response queue = Queue() thread = Thread(target=stream_to_queue, args=(queue, 'llama3.1', 'Tell me a story')) thread.start() while True: chunk = queue.get() if chunk is None: break print(chunk, end='', flush=True) thread.join()