TypeK

TypeK is the data type used to store the K (keys) tensor of the KV cache. Choosing a smaller dtype reduces KV cache memory at the cost of slight quality loss.

Quick reference

Type GgmlType? enum (39 values)
Default null (use native default, usually F16)
Common values F32, F16, BF16, Q8_0, Q5_1, Q4_0
Category KV cache
Field on ContextParameters.TypeK

What it does

For each layer, the engine stores one K tensor of shape (heads × seq_len × head_dim). The dtype controls memory per element:

Dtype Bits Relative size Quality impact
F32 32 1.0× None. Rarely worth the memory.
F16 16 0.5× Default. Minimal impact.
BF16 16 0.5× Alternative to F16. Slightly different numerical range.
Q8_0 8 0.25× Very small quality loss for substantial savings.
Q5_1 5 ~0.16× Noticeable at long contexts.
Q4_0 4 ~0.125× Aggressive; use only under strong memory pressure.

Rule of thumb: K is more sensitive to precision than V. Prefer to quantize V more aggressively than K.

When to change it

Scenario Value
Default null (F16)
Long context, mild memory pressure F16 (keep K high precision)
Long context, tight memory Q8_0
Extreme memory constraint Q5_1 (accept quality drop)

Example

using Aspose.LLM.Abstractions.Models;

var preset = new Qwen25Preset();
preset.ContextParameters.TypeK = GgmlType.F16;   // keep K precise
preset.ContextParameters.TypeV = GgmlType.Q8_0;  // quantize V

using var api = AsposeLLMApi.Create(preset);

Interactions

  • TypeV — companion V-cache dtype.
  • ContextSize — cache cost scales with context; quantization pays off more at long contexts.
  • FlashAttentionMode — works with quantized KV on recent backends.

What’s next