Hardware Guide

Understanding GPU requirements and optimizing for your hardware.

Why VRAM Matters

Large Language Models (LLMs) load their weights into GPU memory. More VRAM = larger models = better intelligence.

VRAM vs Model Size

Model Parameters	FP16 VRAM	Q8 VRAM	Q4 VRAM
7B	14GB	8GB	4GB
13B	26GB	14GB	7GB
32B	64GB	34GB	18GB
70B	140GB	75GB	40GB

Quantization explained:

FP16 - Full precision, best quality
Q8 - 8-bit quantization, ~95% quality
Q4 - 4-bit quantization, ~90% quality

Rule of Thumb

For Q4 quantization: VRAM needed ≈ Parameters (B) × 0.6

Example: 32B model needs ~19GB VRAM at Q4.

GPU Recommendations

Budget: RTX 3060 (12GB) - ~$300

Best for: Learning, small models, quick experiments

Can run:

7B models at Q8 (llama3.1:8b, mistral:7b)
13B models at Q4 (qwen3:14b-q4)

Cannot run:

32B+ models (insufficient VRAM)

Mid-Range: RTX 4090 (24GB) - ~$1,600

Best for: Serious local AI work, coding assistants

Can run:

7-13B models at FP16
32B models at Q4-Q6 (qwen3:32b, deepseek-r1:32b)
70B models at Q2-Q3 (usable but degraded)

Sweet spot: 32B models with excellent quality.

High-End: RTX 5090 (32GB) - ~$2,000

Best for: Professional use, maximum model quality

Can run:

32B models at Q8 (near-full quality)
70B models at Q4 (llama3.3:70b-instruct-q4_K_M)
Multiple models loaded simultaneously

Why 32GB matters: The jump from 24GB to 32GB unlocks 70B models and allows running multiple smaller models concurrently.

Tested Configurations

Configuration 1: RTX 5090 (32GB)

Model	Quantization	Load Time	Tokens/sec
qwen3:32b	Q8	3.2s	45
deepseek-r1:32b	Q8	3.5s	42
llama3.3:70b	Q4_K_M	8.1s	28

Configuration 2: RTX 4090 (24GB)

Model	Quantization	Load Time	Tokens/sec
qwen3:32b	Q4_K_M	2.8s	52
llama3.1:8b	Q8	0.9s	95
mistral:7b	Q8	0.8s	105

Multi-GPU Setups

Ollama supports splitting models across multiple GPUs:

# Automatic layer distribution
$env:OLLAMA_NUM_GPU = 2
ollama serve

Diminishing Returns

Multi-GPU adds PCIe transfer overhead. Two RTX 4090s are often slower than one RTX 5090 for single-model inference.

CPU Offloading

When VRAM is insufficient, layers can offload to system RAM:

# Allow 10 layers on CPU
$env:OLLAMA_NUM_CPU_LAYERS = 10

Trade-off: CPU layers are 10-50x slower than GPU layers. Use this only when you must run a specific model.

Storage Considerations

Model Storage

Models are stored in ~/.ollama/models/. Typical sizes:

Model	Disk Space
7B Q4	4GB
13B Q4	8GB
32B Q4	18GB
70B Q4	40GB

Recommendation: Use NVMe SSD for model storage. Load times are 2-3x faster than SATA SSD.

Moving Model Storage

# Move to different drive
$env:OLLAMA_MODELS = "D:\ollama-models"
[Environment]::SetEnvironmentVariable("OLLAMA_MODELS", "D:\ollama-models", "User")

Performance Optimization

Flash Attention

Enabled by default in Ollama. Reduces VRAM usage and improves speed for long contexts.

Context Window

Larger context = more VRAM. Default is typically 2048-4096 tokens.

# Increase context (uses more VRAM)
ollama run qwen3:32b --context 8192

Batch Size

For API-heavy workloads:

$env:OLLAMA_MAX_LOADED_MODELS = 3
$env:OLLAMA_NUM_PARALLEL = 4

Cooling and Power

Power Requirements

GPU	TDP	PSU Recommendation
RTX 3060	170W	550W+
RTX 4090	450W	850W+
RTX 5090	575W	1000W+

Thermal Throttling

LLM inference is sustained load. Monitor temperatures:

# Real-time GPU stats
nvidia-smi -l 1

Target: Keep GPU under 80°C for consistent performance. Consider aftermarket cooling for sustained workloads.

Why VRAM Matters​

VRAM vs Model Size​

GPU Recommendations​

Budget: RTX 3060 (12GB) - ~$300​

Mid-Range: RTX 4090 (24GB) - ~$1,600​

High-End: RTX 5090 (32GB) - ~$2,000​

Tested Configurations​

Configuration 1: RTX 5090 (32GB)​

Configuration 2: RTX 4090 (24GB)​

Multi-GPU Setups​

CPU Offloading​

Storage Considerations​

Model Storage​

Moving Model Storage​

Performance Optimization​

Flash Attention​

Context Window​

Batch Size​

Cooling and Power​

Power Requirements​

Thermal Throttling​

Why VRAM Matters

VRAM vs Model Size

GPU Recommendations

Budget: RTX 3060 (12GB) - ~$300

Mid-Range: RTX 4090 (24GB) - ~$1,600

High-End: RTX 5090 (32GB) - ~$2,000

Tested Configurations

Configuration 1: RTX 5090 (32GB)

Configuration 2: RTX 4090 (24GB)

Multi-GPU Setups

CPU Offloading

Storage Considerations

Model Storage

Moving Model Storage

Performance Optimization

Flash Attention

Context Window

Batch Size

Cooling and Power

Power Requirements

Thermal Throttling