TypeK

TypeK is the data type used to store the K (keys) tensor of the KV cache. Choosing a smaller dtype reduces KV cache memory at the cost of slight quality loss.

Quick reference


Type	`GgmlType?` enum (39 values)
Default	`null` (use native default, usually `F16`)
Common values	`F32`, `F16`, `BF16`, `Q8_0`, `Q5_1`, `Q4_0`
Category	KV cache
Field on	`ContextParameters.TypeK`

What it does

For each layer, the engine stores one K tensor of shape (heads × seq_len × head_dim). The dtype controls memory per element:

Dtype	Bits	Relative size	Quality impact
`F32`	32	1.0×	None. Rarely worth the memory.
`F16`	16	0.5×	Default. Minimal impact.
`BF16`	16	0.5×	Alternative to F16. Slightly different numerical range.
`Q8_0`	8	0.25×	Very small quality loss for substantial savings.
`Q5_1`	5	~0.16×	Noticeable at long contexts.
`Q4_0`	4	~0.125×	Aggressive; use only under strong memory pressure.

Rule of thumb: K is more sensitive to precision than V. Prefer to quantize V more aggressively than K.

When to change it

Scenario	Value
Default	`null` (F16)
Long context, mild memory pressure	`F16` (keep K high precision)
Long context, tight memory	`Q8_0`
Extreme memory constraint	`Q5_1` (accept quality drop)

Example

using Aspose.LLM.Abstractions.Models;

var preset = new Qwen25Preset();
preset.ContextParameters.TypeK = GgmlType.F16;   // keep K precise
preset.ContextParameters.TypeV = GgmlType.Q8_0;  // quantize V

using var api = AsposeLLMApi.Create(preset);

Interactions

TypeV — companion V-cache dtype.
ContextSize — cache cost scales with context; quantization pays off more at long contexts.
FlashAttentionMode — works with quantized KV on recent backends.

What’s next

TypeV — V-cache dtype.
Low memory tuning — practical quantization recipes.
Context parameters hub — all context knobs.

Embeddings TypeV