How to Diagnose and Fix CUDA Out of Memory (OOM) Errors in Local LLMs (Error 901)
Dealing with PyTorch CUDA Allocation errors while running Llama 3, DeepSeek-R1, or Mistral? Learn how to configure PYTORCH_CUDA_ALLOC_CONF, use GGUF quantization offsets, and offload layers to system RAM.
Understanding CUDA Out-of-Memory (OOM) in Local AI Models
When loading high-parameter Large Language Models (LLMs) locally, developers frequently encounter the dreaded torch.cuda.OutOfMemoryError. This occurs when the combined size of the model weights, KV cache, and activation tensors exceeds the physical VRAM available on your graphics card (e.g., Nvidia RTX series).
# Step 1: Optimize PyTorch Allocator Settings
By default, PyTorch allocates VRAM aggressively, which can lead to memory fragmentation. You can configure the allocator to be more conservative and reuse fragmented memory blocks by defining the PYTORCH_CUDA_ALLOC_CONF environment variable.
For Linux/macOS:
export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:128"
For Windows PowerShell:
$env:PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:128"
# Step 2: Leverage GGUF Quantization Levels
Running a model in 16-bit float (fp16) requires roughly 2GB of VRAM per billion parameters. By using quantized weights in GGUF format (Q4_K_M or Q8_0), you reduce the bit-width of weight representations, shrinking memory demands by up to 75% with negligible accuracy trade-offs.
1. For a 7B parameter model, fp16 needs 14GB VRAM. 2. Q4_K_M quantization drops this down to ~4.5GB VRAM.
# Step 3: Configure Layer Offloading in llama.cpp or Ollama
If your GPU VRAM is slightly below the required model size, you can offload a portion of the model's transformer layers to your systems main thread memory (CPU/RAM).
Using Ollama, this offloading is calculated automatically. On standalone llama.cpp deployments, use the -ngl or --n-gpu-layers flag:
``
./main -m Llama-3-8B-Q4_K_M.gguf -ngl 32 -p "Tell me about neural nets"
``
Adjust the integer until VRAM utilization hovers at 90-95% without crashing.