Optimizing Ollama performance can be approached from multiple aspects:
1. Model Quantization Selection: Ollama supports models with different quantization levels, balancing precision and performance:
q4_0/q4_k_m- 4-bit quantization, fast speed, slightly lower precisionq5_0/q5_k_m- 5-bit quantization, balanced speed and precisionq8_0- 8-bit quantization, high precision, slower speedf16- 16-bit floating point, highest precision, slowest
bash# Download different quantization versions ollama pull llama3.1:8b-q4_k_m ollama pull llama3.1:8b-q8_0
2. Parameter Tuning:
dockerfile# Set in Modelfile PARAMETER temperature 0.7 # Control randomness, 0-1 PARAMETER top_p 0.9 # Nucleus sampling, 0-1 PARAMETER top_k 40 # Number of sampling candidates PARAMETER num_ctx 4096 # Context window size PARAMETER repeat_penalty 1.1 # Repetition penalty PARAMETER num_gpu 1 # Number of GPU layers
3. GPU Acceleration: Ensure GPU is properly configured:
bash# Linux/macOS export CUDA_VISIBLE_DEVICES=0 # Check GPU usage nvidia-smi
4. Batch Processing Optimization:
bash# Set batch size during API calls curl http://localhost:11434/api/generate -d '{ "model": "llama3.1", "prompt": "Hello", "options": { "num_batch": 512, "num_gpu": 99 } }'
5. Memory Management:
dockerfile# Limit context length to reduce memory usage PARAMETER num_ctx 2048 # Use smaller models FROM llama3.1:8b # instead of 70b
6. Concurrent Requests:
Ollama supports processing multiple requests concurrently, but is limited by hardware resources. Can optimize by increasing the num_parallel parameter:
dockerfilePARAMETER num_parallel 2
7. Caching Strategy: Ollama caches models and responses, first load is slower, subsequent requests will be faster.
Performance Monitoring:
bash# View resource usage ollama ps # View logs ollama logs