OffloadKqv
Contents
[
Hide
]
OffloadKqv controls whether the KQV (attention) computation and the KV cache itself live on GPU memory. On GPU-enabled builds, the default is to offload; disable only when fighting for VRAM.
Quick reference
| Type | bool? |
| Default | null (uses native default — typically true on GPU builds) |
| Category | KV cache / GPU |
| Field on | ContextParameters.OffloadKqv |
What it does
true— KV cache tensors and attention computation live on the GPU. Benefits throughput; uses more VRAM.false— KV cache stays on CPU even when layers are offloaded viaGpuLayers. Reduces VRAM usage; slower because GPU must read KV from host memory.null— native default (usuallytrueon GPU builds, irrelevant on CPU builds).
Disabling is useful on GPUs where the weights fit but the KV cache would push you into OOM at long contexts.
When to change it
| Scenario | Value |
|---|---|
| Default GPU inference | null (true) |
| Short on VRAM at long context — trade speed for memory | false |
| CPU-only | null (irrelevant) |
Example
var preset = new Qwen25Preset();
preset.BaseModelInferenceParameters.GpuLayers = 999; // full offload
preset.ContextParameters.OffloadKqv = false; // but keep KV on CPU to save VRAM
preset.ContextParameters.ContextSize = 131072; // long context
Interactions
GpuLayers— withOffloadKqv = false, GPU layers access KV from CPU — slower but saves VRAM.TypeK,TypeV— quantizing KV reduces memory regardless of placement.FlashAttentionMode— FA reduces KV memory pressure.
What’s next
- TypeK, TypeV — KV dtype.
- GpuLayers — weight offload.
- Out of memory troubleshooting — memory pressure recipes.