ContextSize
ContextSize is the maximum number of tokens the loaded model can attend to at once. It caps how long a conversation — system prompt plus all turns plus the response being generated — can be before the engine must trim via CacheCleanupStrategy.
Quick reference
| Type | int? |
| Default | null (use the model’s training maximum from GGUF metadata) |
| Range | 1 to the model’s maximum; typical 4096 – 262144 |
| Category | Context size and batching |
| Field on | ContextParameters.ContextSize |
What it does
The engine reserves KV cache for exactly ContextSize tokens at model load time. Every token of the system prompt, chat history, current user message, and generated output fits in this window. When a new token would overflow, the active CacheCleanupStrategy trims older content.
nullor0— use the model’s trained maximum as recorded in GGUF metadata.- A positive integer up to the trained maximum — reserve exactly that many tokens of KV space.
Built-in presets set ContextSize to a model-appropriate default. For example:
| Preset | Default ContextSize |
|---|---|
Gemma3Preset |
8 192 |
Phi4Preset |
16 384 |
Qwen25Preset |
32 768 |
Qwen3Preset |
32 768 |
Llama32Preset |
131 072 |
Oss20Preset |
131 072 |
DeepSeekCoder2Preset |
163 840 |
Memory cost
KV cache scales linearly with context size. At the default KV dtype (F16), a 7B-parameter model typically claims ~2 MB of KV per 1 024 tokens per attention layer. For a 32-layer model at 32K context, that’s ~2 GB of KV cache in addition to the model weights.
To save memory, reduce ContextSize or quantize the KV cache via TypeK and TypeV.
When to change it
| Scenario | Value |
|---|---|
| Short-form Q&A | 4096 – 8192 |
| Moderate chat | 16384 – 32768 |
| Long documents, extended conversations | 65536 – 131072 |
| Very long context (needs YaRN scaling) | 262144+ |
| Memory-constrained deployment | Smallest that fits your use case |
Larger ContextSize always costs memory; raise only when you actually use the extra window.
Example
var preset = new Qwen25Preset();
preset.ContextParameters.ContextSize = 8192; // reduce from default 32768
preset.ContextParameters.TypeV = GgmlType.Q8_0; // further cut V-cache memory
using var api = AsposeLLMApi.Create(preset);
Interactions
CacheCleanupStrategy— whenContextSizeis exhausted, this strategy trims.TypeK,TypeV— KV cache dtype multiplies context cost.FlashAttentionMode— reduces memory at long contexts.RopeScalingType— needed to push beyond the model’s trained maximum.NBatch— batch size for prompt processing.
What’s next
- TypeK, TypeV — KV quantization.
- Low memory tuning — shrinking context and KV together.
- Long context tuning — pushing toward 128K+.
- Estimate memory requirements — predict KV cost.