OffloadKqv

OffloadKqv controls whether the KQV (attention) computation and the KV cache itself live on GPU memory. On GPU-enabled builds, the default is to offload; disable only when fighting for VRAM.

Quick reference

Type bool?
Default null (uses native default — typically true on GPU builds)
Category KV cache / GPU
Field on ContextParameters.OffloadKqv

What it does

  • true — KV cache tensors and attention computation live on the GPU. Benefits throughput; uses more VRAM.
  • false — KV cache stays on CPU even when layers are offloaded via GpuLayers. Reduces VRAM usage; slower because GPU must read KV from host memory.
  • null — native default (usually true on GPU builds, irrelevant on CPU builds).

Disabling is useful on GPUs where the weights fit but the KV cache would push you into OOM at long contexts.

When to change it

Scenario Value
Default GPU inference null (true)
Short on VRAM at long context — trade speed for memory false
CPU-only null (irrelevant)

Example

var preset = new Qwen25Preset();
preset.BaseModelInferenceParameters.GpuLayers = 999;    // full offload
preset.ContextParameters.OffloadKqv = false;            // but keep KV on CPU to save VRAM
preset.ContextParameters.ContextSize = 131072;          // long context

Interactions

  • GpuLayers — with OffloadKqv = false, GPU layers access KV from CPU — slower but saves VRAM.
  • TypeK, TypeV — quantizing KV reduces memory regardless of placement.
  • FlashAttentionMode — FA reduces KV memory pressure.

What’s next