Skip to main content

Hardware Guide

Understanding GPU requirements and optimizing for your hardware.

Why VRAM Matters

Large Language Models (LLMs) load their weights into GPU memory. More VRAM = larger models = better intelligence.

VRAM vs Model Size

Model ParametersFP16 VRAMQ8 VRAMQ4 VRAM
7B14GB8GB4GB
13B26GB14GB7GB
32B64GB34GB18GB
70B140GB75GB40GB

Quantization explained:

  • FP16 - Full precision, best quality
  • Q8 - 8-bit quantization, ~95% quality
  • Q4 - 4-bit quantization, ~90% quality
Rule of Thumb

For Q4 quantization: VRAM needed ≈ Parameters (B) × 0.6

Example: 32B model needs ~19GB VRAM at Q4.

GPU Recommendations

Budget: RTX 3060 (12GB) - ~$300

Best for: Learning, small models, quick experiments

Can run:

  • 7B models at Q8 (llama3.1:8b, mistral:7b)
  • 13B models at Q4 (qwen3:14b-q4)

Cannot run:

  • 32B+ models (insufficient VRAM)

Mid-Range: RTX 4090 (24GB) - ~$1,600

Best for: Serious local AI work, coding assistants

Can run:

  • 7-13B models at FP16
  • 32B models at Q4-Q6 (qwen3:32b, deepseek-r1:32b)
  • 70B models at Q2-Q3 (usable but degraded)

Sweet spot: 32B models with excellent quality.

High-End: RTX 5090 (32GB) - ~$2,000

Best for: Professional use, maximum model quality

Can run:

  • 32B models at Q8 (near-full quality)
  • 70B models at Q4 (llama3.3:70b-instruct-q4_K_M)
  • Multiple models loaded simultaneously

Why 32GB matters: The jump from 24GB to 32GB unlocks 70B models and allows running multiple smaller models concurrently.

Tested Configurations

Configuration 1: RTX 5090 (32GB)

ModelQuantizationLoad TimeTokens/sec
qwen3:32bQ83.2s45
deepseek-r1:32bQ83.5s42
llama3.3:70bQ4_K_M8.1s28

Configuration 2: RTX 4090 (24GB)

ModelQuantizationLoad TimeTokens/sec
qwen3:32bQ4_K_M2.8s52
llama3.1:8bQ80.9s95
mistral:7bQ80.8s105

Multi-GPU Setups

Ollama supports splitting models across multiple GPUs:

# Automatic layer distribution
$env:OLLAMA_NUM_GPU = 2
ollama serve
Diminishing Returns

Multi-GPU adds PCIe transfer overhead. Two RTX 4090s are often slower than one RTX 5090 for single-model inference.

CPU Offloading

When VRAM is insufficient, layers can offload to system RAM:

# Allow 10 layers on CPU
$env:OLLAMA_NUM_CPU_LAYERS = 10

Trade-off: CPU layers are 10-50x slower than GPU layers. Use this only when you must run a specific model.

Storage Considerations

Model Storage

Models are stored in ~/.ollama/models/. Typical sizes:

ModelDisk Space
7B Q44GB
13B Q48GB
32B Q418GB
70B Q440GB

Recommendation: Use NVMe SSD for model storage. Load times are 2-3x faster than SATA SSD.

Moving Model Storage

# Move to different drive
$env:OLLAMA_MODELS = "D:\ollama-models"
[Environment]::SetEnvironmentVariable("OLLAMA_MODELS", "D:\ollama-models", "User")

Performance Optimization

Flash Attention

Enabled by default in Ollama. Reduces VRAM usage and improves speed for long contexts.

Context Window

Larger context = more VRAM. Default is typically 2048-4096 tokens.

# Increase context (uses more VRAM)
ollama run qwen3:32b --context 8192

Batch Size

For API-heavy workloads:

$env:OLLAMA_MAX_LOADED_MODELS = 3
$env:OLLAMA_NUM_PARALLEL = 4

Cooling and Power

Power Requirements

GPUTDPPSU Recommendation
RTX 3060170W550W+
RTX 4090450W850W+
RTX 5090575W1000W+

Thermal Throttling

LLM inference is sustained load. Monitor temperatures:

# Real-time GPU stats
nvidia-smi -l 1

Target: Keep GPU under 80°C for consistent performance. Consider aftermarket cooling for sustained workloads.