ContextSize

ContextSize is the maximum number of tokens the loaded model can attend to at once. It caps how long a conversation — system prompt plus all turns plus the response being generated — can be before the engine must trim via CacheCleanupStrategy.

Quick reference

Type int?
Default null (use the model’s training maximum from GGUF metadata)
Range 1 to the model’s maximum; typical 4096262144
Category Context size and batching
Field on ContextParameters.ContextSize

What it does

The engine reserves KV cache for exactly ContextSize tokens at model load time. Every token of the system prompt, chat history, current user message, and generated output fits in this window. When a new token would overflow, the active CacheCleanupStrategy trims older content.

  • null or 0 — use the model’s trained maximum as recorded in GGUF metadata.
  • A positive integer up to the trained maximum — reserve exactly that many tokens of KV space.

Built-in presets set ContextSize to a model-appropriate default. For example:

Preset Default ContextSize
Gemma3Preset 8 192
Phi4Preset 16 384
Qwen25Preset 32 768
Qwen3Preset 32 768
Llama32Preset 131 072
Oss20Preset 131 072
DeepSeekCoder2Preset 163 840

Memory cost

KV cache scales linearly with context size. At the default KV dtype (F16), a 7B-parameter model typically claims ~2 MB of KV per 1 024 tokens per attention layer. For a 32-layer model at 32K context, that’s ~2 GB of KV cache in addition to the model weights.

To save memory, reduce ContextSize or quantize the KV cache via TypeK and TypeV.

When to change it

Scenario Value
Short-form Q&A 40968192
Moderate chat 1638432768
Long documents, extended conversations 65536131072
Very long context (needs YaRN scaling) 262144+
Memory-constrained deployment Smallest that fits your use case

Larger ContextSize always costs memory; raise only when you actually use the extra window.

Example

var preset = new Qwen25Preset();
preset.ContextParameters.ContextSize = 8192;  // reduce from default 32768
preset.ContextParameters.TypeV = GgmlType.Q8_0; // further cut V-cache memory

using var api = AsposeLLMApi.Create(preset);

Interactions

What’s next