Ollama supports running multiple models concurrently, which is very useful for application scenarios that need to handle multiple requests simultaneously or use different models.
1. View Running Models:
bash# View currently loaded models ollama ps
Output example:
shellNAME ID SIZE PROCESSOR UNTIL llama3.1 1234567890 4.7GB 100% GPU 4 minutes from now mistral 0987654321 4.2GB 100% GPU 2 minutes from now
2. Concurrent Request Processing:
Ollama automatically handles concurrent requests without additional configuration:
pythonimport ollama import concurrent.futures def generate_response(prompt, model): response = ollama.generate(model=model, prompt=prompt) return response['response'] # Execute multiple requests concurrently with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor: futures = [ executor.submit(generate_response, "Tell me a joke", "llama3.1"), executor.submit(generate_response, "Explain AI", "mistral"), executor.submit(generate_response, "Write code", "codellama") ] for future in concurrent.futures.as_completed(futures): print(future.result())
3. Configure Concurrency Parameters:
Set concurrency-related parameters in Modelfile:
dockerfileFROM llama3.1 # Set parallel processing count PARAMETER num_parallel 4 # Set batch size PARAMETER num_batch 512
4. Use Different Models for Different Tasks:
pythonimport ollama # Use different models to process different types of tasks def process_request(task_type, input_text): if task_type == "chat": return ollama.generate(model="llama3.1", prompt=input_text) elif task_type == "code": return ollama.generate(model="codellama", prompt=input_text) elif task_type == "analysis": return ollama.generate(model="mistral", prompt=input_text)
5. Model Switching and Unloading:
bash# Manually unload model (free memory) ollama stop llama3.1 # Reload model ollama run llama3.1
6. Resource Management Strategies:
Memory Management:
- Monitor memory usage
- Adjust concurrency based on hardware resources
- Regularly unload unused models
GPU Allocation:
dockerfile# Specify number of GPU layers PARAMETER num_gpu 35 # Use GPU completely PARAMETER num_gpu 99
7. Advanced Concurrency Patterns:
pythonimport ollama from queue import Queue import threading class ModelPool: def __init__(self, models): self.models = models self.queue = Queue() def worker(self): while True: task = self.queue.get() if task is None: break model, prompt = task response = ollama.generate(model=model, prompt=prompt) print(f"{model}: {response['response'][:50]}...") self.queue.task_done() def start_workers(self, num_workers=3): for _ in range(num_workers): threading.Thread(target=self.worker, daemon=True).start() def add_task(self, model, prompt): self.queue.put((model, prompt)) # Use model pool pool = ModelPool(["llama3.1", "mistral", "codellama"]) pool.start_workers(3) pool.add_task("llama3.1", "Hello") pool.add_task("mistral", "Hi") pool.add_task("codellama", "Write code")