ContextSize

ContextSize is the maximum number of tokens the loaded model can attend to at once. It caps how long a conversation — system prompt plus all turns plus the response being generated — can be before the engine must trim via CacheCleanupStrategy.

Quick reference


Type	`int?`
Default	`null` (use the model’s training maximum from GGUF metadata)
Range	`1` to the model’s maximum; typical `4096` – `262144`
Category	Context size and batching
Field on	`ContextParameters.ContextSize`

What it does

The engine reserves KV cache for exactly ContextSize tokens at model load time. Every token of the system prompt, chat history, current user message, and generated output fits in this window. When a new token would overflow, the active CacheCleanupStrategy trims older content.

null or 0 — use the model’s trained maximum as recorded in GGUF metadata.
A positive integer up to the trained maximum — reserve exactly that many tokens of KV space.

Built-in presets set ContextSize to a model-appropriate default. For example:

Preset	Default `ContextSize`
`Gemma3Preset`	8 192
`Phi4Preset`	16 384
`Qwen25Preset`	32 768
`Qwen3Preset`	32 768
`Llama32Preset`	131 072
`Oss20Preset`	131 072
`DeepSeekCoder2Preset`	163 840

Memory cost

KV cache scales linearly with context size. At the default KV dtype (F16), a 7B-parameter model typically claims ~2 MB of KV per 1 024 tokens per attention layer. For a 32-layer model at 32K context, that’s ~2 GB of KV cache in addition to the model weights.

To save memory, reduce ContextSize or quantize the KV cache via TypeK and TypeV.

When to change it

Scenario	Value
Short-form Q&A	`4096` – `8192`
Moderate chat	`16384` – `32768`
Long documents, extended conversations	`65536` – `131072`
Very long context (needs YaRN scaling)	`262144+`
Memory-constrained deployment	Smallest that fits your use case

Larger ContextSize always costs memory; raise only when you actually use the extra window.

Example

var preset = new Qwen25Preset();
preset.ContextParameters.ContextSize = 8192;  // reduce from default 32768
preset.ContextParameters.TypeV = GgmlType.Q8_0; // further cut V-cache memory

using var api = AsposeLLMApi.Create(preset);

Interactions

CacheCleanupStrategy — when ContextSize is exhausted, this strategy trims.
TypeK, TypeV — KV cache dtype multiplies context cost.
FlashAttentionMode — reduces memory at long contexts.
RopeScalingType — needed to push beyond the model’s trained maximum.
NBatch — batch size for prompt processing.

What’s next

TypeK, TypeV — KV quantization.
Low memory tuning — shrinking context and KV together.
Long context tuning — pushing toward 128K+.
Estimate memory requirements — predict KV cost.

NBatch