Skip to main content

Performance Tuning

Optimize Ollama for your hardware and workload.

Understanding Performance Metrics

Tokens Per Second (TPS)

The primary measure of inference speed:

  • Prompt processing - How fast input is processed
  • Generation - How fast output is produced
# Benchmark a model
ollama run qwen3:32b "Write a 500-word essay about AI" --verbose
# Look for: eval rate: XX tokens/s

Typical Performance

ModelRTX 4090RTX 5090
7B Q4120 t/s150 t/s
32B Q445 t/s60 t/s
70B Q415 t/s28 t/s

Environment Variables

Essential Settings

# Number of GPU layers (default: all that fit)
$env:OLLAMA_NUM_GPU = 99

# Maximum loaded models (memory permitting)
$env:OLLAMA_MAX_LOADED_MODELS = 3

# Parallel request handling
$env:OLLAMA_NUM_PARALLEL = 4

# Keep models loaded (default: 5m)
$env:OLLAMA_KEEP_ALIVE = "30m"

Memory Management

# CPU layers for overflow (slower, but allows larger models)
$env:OLLAMA_NUM_CPU = 4

# Flash attention (enabled by default)
$env:OLLAMA_FLASH_ATTENTION = 1

Persistence

Set variables permanently:

[Environment]::SetEnvironmentVariable("OLLAMA_KEEP_ALIVE", "30m", "User")

Context Window Optimization

What is Context?

Context window = how much text the model can "see" at once. Larger context:

  • Allows longer conversations
  • Uses more VRAM
  • Slightly slower inference

Default Context

ModelDefault ContextMax Context
llama3.14096131072
qwen3409632768
mistral409632768

Adjusting Context

# Increase for long documents
ollama run qwen3:32b --num-ctx 16384

# Reduce for faster response
ollama run qwen3:32b --num-ctx 2048

VRAM Impact

ContextAdditional VRAM
4096Base
8192+500MB
16384+1.5GB
32768+4GB

Batch Processing

Concurrent Requests

For API-heavy workloads:

# Allow 4 parallel requests
$env:OLLAMA_NUM_PARALLEL = 4

# Each request gets its own context
# VRAM usage multiplies!

Batch Size Tuning

# Larger batches = faster throughput, more VRAM
$env:OLLAMA_BATCH_SIZE = 512 # Default varies by model

Model Loading

Keep Alive Settings

# Never unload (fast responses, uses VRAM)
$env:OLLAMA_KEEP_ALIVE = -1

# Unload after 5 minutes (default)
$env:OLLAMA_KEEP_ALIVE = "5m"

# Unload immediately (saves VRAM)
$env:OLLAMA_KEEP_ALIVE = 0

Preloading Models

# Load model at startup
ollama run qwen3:32b "" # Empty prompt just loads

# Or via API
curl http://localhost:11434/api/generate -d '{"model":"qwen3:32b"}'

LRU Eviction

When VRAM fills, least-recently-used models are unloaded:

Model A loaded → Model B loaded → Model C loaded
↓ (VRAM full)
Model A unloaded (LRU) ← Model C used

Quantization Trade-offs

Quality vs Speed

QuantizationQualitySpeedVRAM
Q8_099%SlowerMore
Q6_K97%MediumMedium
Q4_K_M95%FastLess
Q4_092%FastestLeast

When to Use Each

  • Q8: When quality is critical (coding, reasoning)
  • Q4_K_M: General use (best balance)
  • Q4_0: Speed priority, acceptable quality loss

GPU-Specific Optimization

NVIDIA Settings

# Check current GPU state
nvidia-smi

# Lock GPU clocks for consistent performance
nvidia-smi -pm 1 # Persistence mode
nvidia-smi -lgc 2100 # Lock graphics clock

Power Management

ModePerformancePower
Maximum100%100%
Balanced95%80%
Power Saver70%50%
# Set maximum performance
nvidia-smi -pl 575 # Set power limit (watts)

Monitoring Performance

Real-time GPU Stats

# Live monitoring
nvidia-smi -l 1

# Or more detailed
nvidia-smi dmon -s pucvmet

Ollama Metrics

# Check loaded models
ollama ps

# Model info including parameters
ollama show qwen3:32b

Benchmark Script

# Simple benchmark
$prompt = "Write a detailed explanation of quantum computing in 500 words."
$models = @("llama3.1:8b", "qwen3:32b")

foreach ($model in $models) {
$start = Get-Date
ollama run $model $prompt | Out-Null
$duration = (Get-Date) - $start
Write-Host "$model : $($duration.TotalSeconds)s"
}

Common Performance Issues

Slow First Response

Cause: Model loading from disk.

Fix: Increase keep-alive or preload:

$env:OLLAMA_KEEP_ALIVE = "1h"

Degrading Performance Over Time

Cause: Thermal throttling.

Fix: Improve cooling or reduce power limit:

nvidia-smi -pl 500  # Reduce from 575W to 500W

Out of Memory Errors

Cause: Too many models or too large context.

Fix:

# Reduce loaded models
$env:OLLAMA_MAX_LOADED_MODELS = 1

# Or reduce context
ollama run qwen3:32b --num-ctx 2048

Slow Streaming

Cause: Network buffering or client-side issues.

Fix: Check API client settings:

# Test direct API
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:32b",
"prompt": "Hello",
"stream": true
}'

Interactive Chat (Single User)

$env:OLLAMA_KEEP_ALIVE = "30m"
$env:OLLAMA_MAX_LOADED_MODELS = 2
$env:OLLAMA_NUM_PARALLEL = 1

API Server (Multiple Users)

$env:OLLAMA_KEEP_ALIVE = "1h"
$env:OLLAMA_MAX_LOADED_MODELS = 1 # Dedicate VRAM
$env:OLLAMA_NUM_PARALLEL = 4

Batch Processing

$env:OLLAMA_KEEP_ALIVE = "-1"  # Never unload
$env:OLLAMA_NUM_PARALLEL = 8
$env:OLLAMA_BATCH_SIZE = 1024