Documentation – Context parameters

Net: ContextSize

Thu, 23 Apr 2026 00:00:00 +0000

ContextSize is the maximum number of tokens the loaded model can attend to at once. It caps how long a conversation — system prompt plus all turns plus the response being generated — can be before the engine must trim via CacheCleanupStrategy.

Quick reference


Type	`int?`
Default	`null` (use the model’s training maximum from GGUF metadata)
Range	`1` to the model’s maximum; typical `4096` – `262144`
Category	Context size and batching
Field on	`ContextParameters.ContextSize`

What it does

The engine reserves KV cache for exactly ContextSize tokens at model load time. Every token of the system prompt, chat history, current user message, and generated output fits in this window. When a new token would overflow, the active CacheCleanupStrategy trims older content.

null or 0 — use the model’s trained maximum as recorded in GGUF metadata.
A positive integer up to the trained maximum — reserve exactly that many tokens of KV space.

Built-in presets set ContextSize to a model-appropriate default. For example:

Preset	Default `ContextSize`
`Gemma3Preset`	8 192
`Phi4Preset`	16 384
`Qwen25Preset`	32 768
`Qwen3Preset`	32 768
`Llama32Preset`	131 072
`Oss20Preset`	131 072
`DeepSeekCoder2Preset`	163 840

Memory cost

KV cache scales linearly with context size. At the default KV dtype (F16), a 7B-parameter model typically claims ~2 MB of KV per 1 024 tokens per attention layer. For a 32-layer model at 32K context, that’s ~2 GB of KV cache in addition to the model weights.

To save memory, reduce ContextSize or quantize the KV cache via TypeK and TypeV.

When to change it

Scenario	Value
Short-form Q&A	`4096` – `8192`
Moderate chat	`16384` – `32768`
Long documents, extended conversations	`65536` – `131072`
Very long context (needs YaRN scaling)	`262144+`
Memory-constrained deployment	Smallest that fits your use case

Larger ContextSize always costs memory; raise only when you actually use the extra window.

Example

var preset = new Qwen25Preset();
preset.ContextParameters.ContextSize = 8192;  // reduce from default 32768
preset.ContextParameters.TypeV = GgmlType.Q8_0; // further cut V-cache memory

using var api = AsposeLLMApi.Create(preset);

Interactions

CacheCleanupStrategy — when ContextSize is exhausted, this strategy trims.
TypeK, TypeV — KV cache dtype multiplies context cost.
FlashAttentionMode — reduces memory at long contexts.
RopeScalingType — needed to push beyond the model’s trained maximum.
NBatch — batch size for prompt processing.

What’s next

TypeK, TypeV — KV quantization.
Low memory tuning — shrinking context and KV together.
Long context tuning — pushing toward 128K+.
Estimate memory requirements — predict KV cost.

Net: NBatch

Thu, 23 Apr 2026 00:00:00 +0000

NBatch is the logical maximum batch size — the upper bound on the number of tokens submitted in one call to the native llama_decode function. Larger batch sizes speed up prompt processing at the cost of more temporary memory.

Quick reference


Type	`uint?`
Default	`null` (native default, typically 2048)
Range	`512` – `8192` typical; power-of-two values recommended
Category	Context size and batching
Field on	`ContextParameters.NBatch`

What it does

When the engine processes a prompt (system message + conversation history + new user turn), it feeds tokens to the model in batches. NBatch caps the largest batch sent in one call.

Smaller NBatch (512) — lower memory footprint, slower prompt processing.
Larger NBatch (4096, 8192) — faster prompt processing, more temporary memory.

NBatch affects prompt processing time, not generation throughput. Once the first output token is produced, subsequent tokens come one at a time regardless of batch size.

When to change it

Scenario	Value
Default	`null` (use native default)
Fast prompt processing, ample memory	`4096`
Memory-constrained	`512` or `1024`
Very long prompts (summarization, long context)	`4096` – `8192`

Built-in presets set NBatch based on the model’s needs — Qwen25Preset uses 3072, Llama32Preset uses 2048, vision presets often use 4096.

Example

var preset = new Qwen25Preset();
preset.ContextParameters.NBatch = 4096;  // faster prompt processing
preset.ContextParameters.NUbatch = 4096;

using var api = AsposeLLMApi.Create(preset);

Interactions

NUbatch — physical batch size; typically set equal to or less than NBatch.
ContextSize — NBatch should not exceed ContextSize.
NThreadsBatch — threads that process the batch.

What’s next

NUbatch — physical batch size.
NThreadsBatch — prompt-processing threads.
Reduce first-token latency — batch size’s role in TTFT.

Net: NUbatch

Thu, 23 Apr 2026 00:00:00 +0000

NUbatch is the physical maximum batch size — the largest chunk actually processed in a single kernel call. Normally set equal to or smaller than NBatch.

Quick reference


Type	`uint?`
Default	`null` (native default, typically equal to `NBatch`)
Range	`≤ NBatch`
Category	Context size and batching
Field on	`ContextParameters.NUbatch`

What it does

NBatch defines the logical batch — the largest number of tokens submitted at once. NUbatch defines the largest chunk the engine processes in a single kernel invocation. When NUbatch < NBatch, the engine splits one logical batch into multiple kernel calls.

NUbatch = NBatch (simplest case) — one logical batch = one kernel call.
NUbatch < NBatch — one logical batch dispatched as several smaller kernel invocations.

The split matters mainly in specific multi-sequence scenarios where sequential processing of sub-batches is required. For single-sequence chat, NUbatch = NBatch is typical.

When to change it

Scenario	Value
Default — match `NBatch`	same as `NBatch`
Advanced multi-sequence workflows	Smaller than `NBatch`

Most deployments set NUbatch = NBatch and never touch this field.

Example

var preset = new Qwen25Preset();
preset.ContextParameters.NBatch = 4096;
preset.ContextParameters.NUbatch = 4096;  // match the logical batch

using var api = AsposeLLMApi.Create(preset);

Interactions

NBatch — upper bound; NUbatch ≤ NBatch.
NSeqMax — parallel sequence cap, related in multi-sequence scenarios.

What’s next

NBatch — logical batch cap.
NSeqMax — parallel sequences.
Context parameters hub — all context knobs.

Net: NSeqMax

Thu, 23 Apr 2026 00:00:00 +0000

NSeqMax is the maximum number of distinct sequences the engine handles in parallel, each with its own state. It matters for recurrent or state-tracking models. Standard transformer chat presets do not require tuning it.

Quick reference


Type	`uint?`
Default	`null` (native default, typically 1 for transformer models)
Range	`1` and above; power-of-two values recommended for advanced scenarios
Category	Context size and batching
Field on	`ContextParameters.NSeqMax`

What it does

For recurrent or state-space models (Mamba, RWKV, hybrid architectures), each independent sequence carries its own recurrent state. NSeqMax caps how many such states the engine maintains simultaneously.

NSeqMax = 1 (default for standard transformers) — no parallel state tracking needed.
NSeqMax = 4+ — enables parallel recurrent-model sequences.

Transformer models (Qwen, Llama, Gemma, Phi, etc.) do not maintain per-sequence hidden state in this sense. NSeqMax = 1 is correct for them.

When to change it

Scenario	Value
Standard transformer chat presets	Leave `null` or `1`
Recurrent / state-space model	Set to the number of parallel sequences you serve

If you are not building against a recurrent-model-specific preset, leave NSeqMax at the default.

Example

// Standard transformer use case — no change needed.
var preset = new Qwen25Preset();
// preset.ContextParameters.NSeqMax = null; // (default)

using var api = AsposeLLMApi.Create(preset);

Interactions

NBatch, NUbatch — batch sizes interact with sequence count in multi-sequence scenarios.

What’s next

Context parameters hub — all context knobs.
NBatch — batch size for prompts.

Net: NThreads

Thu, 23 Apr 2026 00:00:00 +0000

NThreads is the number of CPU threads the engine uses during generation — when producing each output token sequentially. Generation is bandwidth-bound and often does not benefit from all available cores.

Quick reference


Type	`int?`
Default	`null` (falls back to `EngineParameters.DefaultThreads`)
Range	`1` and above
Category	Threading
Field on	`ContextParameters.NThreads`

What it does

During the generation phase (token-by-token decode), the engine distributes matrix multiplications across NThreads CPU threads. When null, it uses EngineParameters.DefaultThreads, which defaults to ProcessorCount - 1.

NThreads = 4 — decent for 4-core machines; use most cores.
NThreads = 8 — common sweet spot on mainstream desktop CPUs.
NThreads = 16+ — diminishing returns; sometimes slower due to cache contention and memory-bandwidth saturation.

Unlike prompt processing (which scales well with more threads), generation often peaks at 8-12 threads and degrades with more. Benchmark on your hardware.

When to change it

Scenario	Value
Default	`null` (use `DefaultThreads`)
Laptop / 4-8 core	`4` – `6`
Mainstream desktop	`8` – `10`
High-core server (but avoid over-allocation)	`10` – `16`
Competing with other CPU workloads	Cap explicitly to half `ProcessorCount`

Set NThreads and NThreadsBatch separately — generation and prompt processing have different optima.

Example

var preset = new Qwen25Preset();
preset.ContextParameters.NThreads = 8;
preset.ContextParameters.NThreadsBatch = 16;  // more threads for prompt processing

using var api = AsposeLLMApi.Create(preset);

Interactions

EngineParameters.DefaultThreads — fallback when NThreads is null.
NThreadsBatch — prompt-processing threads.
CPU acceleration — NThreads has no effect when GPU offload is active for every layer.

What’s next

NThreadsBatch — prompt-processing variant.
CPU acceleration — how threading interacts with AVX variants.
Performance issues — thread-related throughput issues.

Net: NThreadsBatch

Thu, 23 Apr 2026 00:00:00 +0000

NThreadsBatch is the number of CPU threads the engine uses during prompt processing (the initial prefill phase). Prompt processing is embarrassingly parallel and benefits from using most or all available cores.

Quick reference


Type	`int?`
Default	`null` (falls back to `EngineParameters.DefaultThreads`)
Range	`1` and above
Category	Threading
Field on	`ContextParameters.NThreadsBatch`

What it does

During prompt processing, the engine runs matrix multiplications over many tokens at once. This workload parallelizes well: more threads directly translate to higher throughput, up to memory-bandwidth limits.

NThreadsBatch = ProcessorCount — typical. Use all cores for fast prompt ingestion.
NThreadsBatch = half ProcessorCount — leave room for other CPU workloads.
NThreadsBatch < NThreads — unusual, almost always wrong for modern CPUs.

Prompt processing happens once per incoming message (on user input). Generation (NThreads) happens per output token. For chat, prompt-processing time dominates when the prompt is long, generation dominates when the output is long.

When to change it

Scenario	Value
Default	`null` (use `DefaultThreads`)
Dedicated inference machine	`ProcessorCount` (all cores)
Shared machine	`ProcessorCount / 2`
Very long prompts, memory-bound hardware	Benchmark — adding threads may not help past 16

Example

var preset = new Qwen25Preset();
preset.ContextParameters.NThreads = 8;                             // generation
preset.ContextParameters.NThreadsBatch = Environment.ProcessorCount; // prompt prefill

Interactions

NThreads — generation threads; typically different from NThreadsBatch.
NBatch — larger batch sizes better utilize high NThreadsBatch.
EngineParameters.DefaultThreads — fallback when null.

What’s next

NThreads — generation-phase threads.
NBatch — batch size.
Reduce first-token latency — prompt-processing throughput’s role in TTFT.

Net: RopeScalingType

Thu, 23 Apr 2026 00:00:00 +0000

RopeScalingType selects the algorithm used to scale RoPE (Rotary Position Embedding) when the effective context size exceeds the model’s training window. Different algorithms produce different quality trade-offs at long contexts.

Quick reference


Type	`RopeScalingType?` enum
Default	`null` (use model default)
Values	`Unspecified`, `None`, `Linear`, `Yarn`, `LongRope`
Category	Position encoding
Field on	`ContextParameters.RopeScalingType`

What it does

Transformer models use RoPE to encode token positions. The frequencies RoPE uses are trained on a specific maximum context. To go beyond that trained maximum, the position encoding must be scaled.

Value	Behavior
`Unspecified` (`-1`)	Use whatever the model’s GGUF metadata specifies.
`None` (`0`)	No scaling; use raw RoPE. Only valid within the trained context.
`Linear` (`1`)	Linear interpolation of positions. Simple, moderate quality loss.
`Yarn` (`2`)	YaRN (Yet another RoPE extensioN) — higher quality at long contexts.
`LongRope` (`3`)	LongRoPE algorithm for very extended contexts.

Most built-in presets leave this as Unspecified — the model’s metadata declares its own preferred scaling. Override only when you push the model past its declared maximum.

When to change it

Scenario	Value
Default	`Unspecified` (model’s metadata wins)
Run a model within its native context window	`None` or `Unspecified`
Extend context 2-4× training window, simple scaling acceptable	`Linear`
Extend context with better quality	`Yarn`
Push toward 1M+ contexts	`LongRope` (if model supports it)

Example

using Aspose.LLM.Abstractions.Models;

var preset = new Llama32Preset();
preset.ContextParameters.ContextSize = 131072;
preset.ContextParameters.RopeScalingType = RopeScalingType.Yarn;
preset.ContextParameters.YarnOrigCtx = 8192;  // the model's original training context

using var api = AsposeLLMApi.Create(preset);

Interactions

RopeFreqBase, RopeFreqScale — apply on top of the chosen scaling.
YarnExtFactor, YarnAttnFactor, YarnBetaFast, YarnBetaSlow, YarnOrigCtx — only used when RopeScalingType = Yarn.
ContextSize — larger than the model’s training window requires RoPE scaling.

What’s next

YarnOrigCtx — the model’s native context length.
Long context tuning — practical recipes.
Context parameters hub — all context knobs.

Net: RopeFreqBase

Thu, 23 Apr 2026 00:00:00 +0000

RopeFreqBase is the base frequency (often denoted θ, theta) used in RoPE’s positional encoding. Overriding it changes how position indices map to rotation angles. Most users leave this at the model default.

Quick reference


Type	`float?`
Default	`null` (use model metadata; equivalent to the value from the GGUF file)
Range	Typical `10000.0` – `10000000.0`
Category	Position encoding
Field on	`ContextParameters.RopeFreqBase`

What it does

In RoPE, each attention head’s frequency vector is built from freq_base^(-k/d) for k across dimensions. The default freq_base = 10000.0 works for contexts in the tens of thousands of tokens. Very long contexts sometimes use much larger bases — for example, RopeFreqBase = 1_000_000 for 128K-trained models.

null (default) — use the model’s trained value from GGUF metadata.
A positive float — override. Use only when you know the correct value for your target context.

Built-in presets leave this as null — the metadata is usually correct.

When to change it

Scenario	Value
Default (recommended)	`null`
Running a model in a non-standard context regime	Set per upstream model card
Extending a short-context model with hand-tuned RoPE	Non-trivial — refer to papers

Changing RopeFreqBase without understanding the effect usually degrades quality. Prefer RopeScalingType approaches for context extension.

Example

var preset = new Qwen25Preset();
preset.ContextParameters.RopeFreqBase = 1_000_000f;  // only if the model documents this

Interactions

RopeScalingType — overall scaling algorithm.
RopeFreqScale — multiplicative scaler applied on top.

What’s next

RopeScalingType — algorithm selector.
RopeFreqScale — scale factor.
Context parameters hub — all context knobs.

Net: RopeFreqScale

Thu, 23 Apr 2026 00:00:00 +0000

RopeFreqScale is a multiplicative scaling factor applied to RoPE frequencies. It implements simple linear scaling of positions — equivalent to Linear RopeScalingType at the value set here.

Quick reference


Type	`float?`
Default	`null` (use model default)
Range	`0` – `1.0`; `< 1.0` stretches the context window
Category	Position encoding
Field on	`ContextParameters.RopeFreqScale`

What it does

RoPE frequencies are scaled by RopeFreqScale. A scale of 1.0 is no scaling. Smaller values stretch the effective context:

RopeFreqScale = 1.0 — no scaling.
RopeFreqScale = 0.5 — effective context doubled (2× training window). Moderate quality loss.
RopeFreqScale = 0.25 — 4× training window. More quality loss.

This is the simplest context-extension approach. More sophisticated algorithms (Yarn, LongRope) produce better quality at the same effective extension.

When to change it

Scenario	Value
Default (use model metadata)	`null`
Simple 2× extension	`0.5` with `RopeScalingType = Linear`
Prefer better algorithms	Use `RopeScalingType = Yarn` instead

Modern built-in presets target models that already ship with proper scaling metadata. Override only when adapting a model without adequate metadata.

Example

using Aspose.LLM.Abstractions.Models;

var preset = new Qwen25Preset();
preset.ContextParameters.RopeScalingType = RopeScalingType.Linear;
preset.ContextParameters.RopeFreqScale = 0.5f;  // 2x linear extension

Interactions

RopeScalingType — Linear uses this scale; Yarn/LongRope have their own knobs.
RopeFreqBase — base frequency.
ContextSize — the target extended context size.

What’s next

RopeScalingType — algorithm selector.
YarnOrigCtx — better long-context extension via YaRN.
Long context tuning — practical recipes.

Net: YarnExtFactor

Thu, 23 Apr 2026 00:00:00 +0000

YarnExtFactor is the YaRN extrapolation mix factor. It blends between base RoPE and NTK-aware scaling in the YaRN algorithm. Relevant only when RopeScalingType is Yarn.

Quick reference


Type	`float?`
Default	`null` (negative / model default)
Range	`0.0` – `1.0`; negative means “from model”
Category	YaRN position encoding
Field on	`ContextParameters.YarnExtFactor`

What it does

YaRN combines position interpolation and extrapolation. YarnExtFactor controls the mix between the two. The model’s GGUF metadata usually sets this correctly for the intended scaling factor; overriding is rarely useful.

null or negative — use the model’s value.
0.0 — pure interpolation.
1.0 — pure extrapolation.
Intermediate — blend.

When to change it

Scenario	Value
Default (recommended)	`null`
Experimental YaRN tuning	Per upstream YaRN paper recipe

Example

var preset = new Llama32Preset();
preset.ContextParameters.RopeScalingType = RopeScalingType.Yarn;
// preset.ContextParameters.YarnExtFactor = null; // default — use model's value

Interactions

RopeScalingType — must be Yarn.
YarnAttnFactor, YarnBetaFast, YarnBetaSlow, YarnOrigCtx — other YaRN knobs.

What’s next

YarnOrigCtx — the most commonly set YaRN knob.
RopeScalingType — enables YaRN.
Context parameters hub — all context knobs.

Net: YarnAttnFactor

Thu, 23 Apr 2026 00:00:00 +0000

YarnAttnFactor scales attention logit magnitudes as part of the YaRN algorithm. It compensates for the attention-softmax becoming too flat at extreme positions. Relevant only when RopeScalingType is Yarn.

Quick reference


Type	`float?`
Default	`null` (use model default)
Range	Typical `1.0` – `1.5`
Category	YaRN position encoding
Field on	`ContextParameters.YarnAttnFactor`

What it does

As positions grow beyond the training window, YaRN mathematically applies a scaling to attention magnitudes. YarnAttnFactor controls this. The YaRN paper derives a value like 0.1 × log(scale) + 1.0 as a reasonable choice; the model’s metadata usually carries the correct value.

null — use model default (recommended).
Specific float — override.

When to change it

Scenario	Value
Default	`null`
Research / YaRN tuning	Per YaRN paper formula

Rarely touched in production.

Example

var preset = new Llama32Preset();
preset.ContextParameters.RopeScalingType = RopeScalingType.Yarn;
// preset.ContextParameters.YarnAttnFactor = null; // default — from model

Interactions

RopeScalingType — must be Yarn.
Other YaRN knobs operate together.

What’s next

YarnOrigCtx — the primary YaRN field you might touch.
Context parameters hub — all context knobs.

Net: YarnBetaFast

Thu, 23 Apr 2026 00:00:00 +0000

YarnBetaFast is the “fast” boundary of YaRN’s correction range — the position-dimension index below which no correction is applied. Relevant only when RopeScalingType is Yarn.

Quick reference


Type	`float?`
Default	`null` (use model default)
Range	Typical `30` – `50` (dimension index)
Category	YaRN position encoding
Field on	`ContextParameters.YarnBetaFast`

What it does

YaRN applies different treatment to different RoPE dimensions based on their frequency (wavelength). Below YarnBetaFast, positions are treated with raw extrapolation (no correction). Between YarnBetaFast and YarnBetaSlow, YaRN blends the two regimes.

null — use model default (typical value in YaRN papers is 32).
Specific float — override.

Rarely touched. The model’s metadata carries sensible values.

When to change it

Scenario	Value
Default	`null`
Research / YaRN tuning	Per paper recipe

Example

// Default — do not override.
var preset = new Llama32Preset();
preset.ContextParameters.RopeScalingType = RopeScalingType.Yarn;

Interactions

RopeScalingType — must be Yarn.
YarnBetaSlow — upper boundary of the blend range.

What’s next

YarnBetaSlow — companion upper boundary.
Context parameters hub — all context knobs.

Net: YarnBetaSlow

Thu, 23 Apr 2026 00:00:00 +0000

YarnBetaSlow is the “slow” boundary of YaRN’s correction range — the dimension index above which interpolation is fully applied. Relevant only when RopeScalingType is Yarn.

Quick reference


Type	`float?`
Default	`null` (use model default)
Range	Typical `1`
Category	YaRN position encoding
Field on	`ContextParameters.YarnBetaSlow`

What it does

Pairs with YarnBetaFast to define the transition window between extrapolation and interpolation inside YaRN:

Below YarnBetaFast: pure extrapolation.
Between YarnBetaFast and YarnBetaSlow: blend.
Above YarnBetaSlow: pure interpolation.

Typical YaRN recipe values: YarnBetaFast = 32, YarnBetaSlow = 1.

When to change it

Scenario	Value
Default (recommended)	`null`
Research / YaRN tuning	Per paper recipe

Example

var preset = new Llama32Preset();
preset.ContextParameters.RopeScalingType = RopeScalingType.Yarn;
// preset.ContextParameters.YarnBetaSlow = null; // default

Interactions

YarnBetaFast — lower boundary.
RopeScalingType — must be Yarn.

What’s next

YarnBetaFast — companion lower boundary.
Context parameters hub — all context knobs.

Net: YarnOrigCtx

Thu, 23 Apr 2026 00:00:00 +0000

YarnOrigCtx tells YaRN the model’s original trained context length. The scaling factor is derived from the ratio between the target ContextSize and YarnOrigCtx. This is the most commonly set YaRN knob.

Quick reference


Type	`uint?`
Default	`null` (use model default from GGUF metadata)
Range	Positive integer; the model’s trained max context
Category	YaRN position encoding
Field on	`ContextParameters.YarnOrigCtx`

What it does

YaRN extends context by a factor ContextSize / YarnOrigCtx. If the model was trained at 8K and you target 32K, the scale factor is 32768 / 8192 = 4. YaRN’s quality depends on knowing both values accurately.

null (default) — YaRN reads the value from the GGUF metadata. This is what most presets use.
Specific integer — override. Set to the model’s actual training context length when the metadata is missing or wrong.

When to change it

Scenario	Value
Default (GGUF has correct metadata)	`null`
Custom GGUF without trained-context metadata	Set to the model’s native context
Running an older model at extended context	Set to the original training window

For a Qwen 2.5 7B model (trained at 32K) targeting 128K:

preset.ContextParameters.ContextSize = 131072;
preset.ContextParameters.RopeScalingType = RopeScalingType.Yarn;
preset.ContextParameters.YarnOrigCtx = 32768;  // Qwen2.5 training context

Example

using Aspose.LLM.Abstractions.Models;

var preset = new Qwen25Preset();
preset.ContextParameters.ContextSize = 65536;
preset.ContextParameters.RopeScalingType = RopeScalingType.Yarn;
preset.ContextParameters.YarnOrigCtx = 32768;

using var api = AsposeLLMApi.Create(preset);

Interactions

ContextSize — target; scale = ContextSize / YarnOrigCtx.
RopeScalingType — must be Yarn.
Other YaRN knobs (YarnExtFactor, YarnAttnFactor, YarnBetaFast, YarnBetaSlow) — usually null; model defaults work.

What’s next

RopeScalingType — enables YaRN.
ContextSize — the extended target.
Long context tuning — full recipe.

Net: AttentionType

Thu, 23 Apr 2026 00:00:00 +0000

AttentionType selects between causal (autoregressive) and non-causal (bidirectional) attention. Standard chat models use causal attention; some embedding models use non-causal.

Quick reference


Type	`AttentionType?` enum
Default	`null` (use model default)
Values	`Unspecified`, `Causal`, `NonCausal`
Category	Attention
Field on	`ContextParameters.AttentionType`

What it does

Value	Behavior
`Unspecified` (`-1`)	Use the model’s metadata-declared type.
`Causal` (`0`)	Each token attends only to earlier tokens. Standard for chat / autoregressive generation.
`NonCausal` (`1`)	Each token attends to all tokens. Used for some embedding models and masked-language workflows.

All built-in chat presets use Causal implicitly (via model metadata). Change to NonCausal only for embedding extraction with a model trained for bidirectional attention.

When to change it

Scenario	Value
Default — chat / text generation	`Unspecified` (model wins)
Bidirectional embedding extraction	`NonCausal`

Example

using Aspose.LLM.Abstractions.Models;

var preset = new Qwen25Preset();
preset.ContextParameters.AttentionType = AttentionType.NonCausal;
preset.ContextParameters.Embeddings = true;
preset.ContextParameters.PoolingType = PoolingType.Mean;
// Embedding-only configuration. Chat generation is not meaningful here.

Interactions

Embeddings — embedding extraction usually pairs with NonCausal.
PoolingType — how embeddings are pooled.

What’s next

Embeddings — extraction mode flag.
PoolingType — embedding pooling.
Context parameters hub — all context knobs.

Net: FlashAttentionMode

Thu, 23 Apr 2026 00:00:00 +0000

FlashAttentionMode controls flash attention — a fused-kernel optimization that reduces memory usage and speeds up attention, especially at long contexts. Prefer FlashAttentionMode over the legacy FlashAttention boolean.

Quick reference


Type	`FlashAttentionType?` enum
Default	`null` (use model / runtime default)
Values	`Auto` (-1), `Disabled` (0), `Enabled` (1)
Category	Attention
Field on	`ContextParameters.FlashAttentionMode`

What it does

Flash attention implements attention in a single fused kernel that tiles the computation. This avoids materializing the full N × N attention matrix in memory, which is a big win at long contexts.

Value	Behavior
`Auto` (`-1`)	Runtime picks based on backend support. Recommended.
`Disabled` (`0`)	Never use flash attention. Slower and more memory at long contexts.
`Enabled` (`1`)	Force flash attention. Fails on backends that do not support it.

On long contexts (> 8K tokens), flash attention is typically 20-40 % faster and meaningfully reduces peak memory. At short contexts the benefit is small.

When to change it

Scenario	Value
Default (recommended)	`Auto`
Explicit opt-in for benchmarking	`Enabled`
Debugging / disabling suspected flash-attention bug	`Disabled`
Backend without flash attention support	Runtime picks `Disabled` automatically when `Auto`

Example

using Aspose.LLM.Abstractions.Models;

var preset = new Qwen25Preset();
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;
preset.ContextParameters.ContextSize = 32768;  // long context benefits most from FA

using var api = AsposeLLMApi.Create(preset);

Interactions

ContextSize — larger contexts benefit more from flash attention.
TypeK, TypeV — flash attention works with quantized KV cache.
FlashAttention — legacy boolean; prefer FlashAttentionMode.
Acceleration backends — CUDA, Metal, HIP, Vulkan all support flash attention on recent drivers.

What’s next

FlashAttention — legacy field; documented for completeness.
ContextSize — the axis where FA matters most.
Long context tuning — practical recipe.

Net: FlashAttention (legacy)

Thu, 23 Apr 2026 00:00:00 +0000

FlashAttention is the legacy boolean toggle for flash attention. It predates the more granular FlashAttentionMode enum. Prefer FlashAttentionMode for new code.

Quick reference


Type	`bool?`
Default	`null` (use model default)
Category	Attention (legacy)
Field on	`ContextParameters.FlashAttention`

What it does

null — no explicit override; runtime / model default applies.
true — request flash attention (equivalent to FlashAttentionMode = Enabled).
false — disable flash attention (equivalent to FlashAttentionMode = Disabled).

FlashAttentionMode supersedes this field. When both are set, consult SDK behavior — to avoid ambiguity, set only one.

When to change it

Scenario	Value
Default — prefer `FlashAttentionMode` instead	`null`
Legacy code using this field	Keep for backwards compatibility

For new code, use FlashAttentionMode which offers the three-way Auto / Disabled / Enabled choice.

Example

// Legacy style (kept for compatibility):
preset.ContextParameters.FlashAttention = true;

// Preferred (modern):
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;

Interactions

FlashAttentionMode — newer enum replacement.

What’s next

FlashAttentionMode — recommended replacement.
Context parameters hub — all context knobs.

Net: PoolingType

Thu, 23 Apr 2026 00:00:00 +0000

PoolingType selects the strategy the engine uses to reduce per-token embeddings to a single vector for the full input. Relevant only when Embeddings is true.

Quick reference


Type	`PoolingType?` enum
Default	`null` (use model default)
Values	`Unspecified`, `None`, `Mean`, `Cls`, `Last`, `Rank`
Category	Embeddings
Field on	`ContextParameters.PoolingType`

What it does

Value	Behavior
`Unspecified` (`-1`)	Use model default.
`None` (`0`)	Return per-token embeddings without reduction.
`Mean` (`1`)	Average all token embeddings. Good default for sentence-level semantic similarity.
`Cls` (`2`)	Use the first (CLS) token’s embedding. Common for BERT-family.
`Last` (`3`)	Use the last token’s embedding. Common for causal-LM embeddings.
`Rank` (`4`)	Rank-based pooling (experimental).

Pick the pooling strategy the model was trained with. Mismatched pooling produces embeddings of degraded quality.

When to change it

Scenario	Value
Default chat — not used	`null`
Causal-LM embeddings	`Last`
BERT-style embedder	`Cls`
Sentence-transformer-style	`Mean`

Example

using Aspose.LLM.Abstractions.Models;

var preset = new Qwen25Preset();
preset.ContextParameters.Embeddings = true;
preset.ContextParameters.PoolingType = PoolingType.Mean;
preset.ContextParameters.AttentionType = AttentionType.NonCausal;

using var api = AsposeLLMApi.Create(preset);
// Embedding-only configuration.

Interactions

Embeddings — must be true for PoolingType to take effect.
AttentionType — usually NonCausal with embedding-specific pooling.

What’s next

Embeddings — flag that enables this pipeline.
AttentionType — companion choice.
Context parameters hub — all context knobs.

Net: Embeddings

Thu, 23 Apr 2026 00:00:00 +0000

Embeddings is a boolean flag. When true, the engine extracts embedding vectors alongside (or instead of) logits. Use it with a PoolingType that matches the model’s training regime.

Quick reference


Type	`bool?`
Default	`null` (disabled)
Category	Embeddings
Field on	`ContextParameters.Embeddings`

What it does

null or false — standard generation mode. Only logits are produced; no embedding extraction.
true — the engine configures the pipeline to output embeddings per input.

Embeddings are typically used for semantic search, clustering, classification, or as retrieval keys in RAG systems. The SDK’s current chat API (SendMessageAsync) focuses on text generation; embedding workflows require reaching into the Engine and ChatSession APIs directly.

When to change it

Scenario	Value
Default chat	`null`
Extract embeddings	`true`, paired with `PoolingType` and often `AttentionType = NonCausal`

A dedicated use case for embeddings is on the documentation roadmap but not covered in this version.

Example

using Aspose.LLM.Abstractions.Models;

var preset = new Qwen25Preset();
preset.ContextParameters.Embeddings = true;
preset.ContextParameters.PoolingType = PoolingType.Mean;
preset.ContextParameters.AttentionType = AttentionType.NonCausal;

using var api = AsposeLLMApi.Create(preset);
// Direct chat methods do not surface embeddings; use Engine/ChatSession internals.

Interactions

PoolingType — reducer for token-level embeddings.
AttentionType — usually NonCausal for embedding-only models.

What’s next

PoolingType — pooling strategy.
AttentionType — attention direction.
Context parameters hub — all context knobs.

Net: TypeK

Thu, 23 Apr 2026 00:00:00 +0000

TypeK is the data type used to store the K (keys) tensor of the KV cache. Choosing a smaller dtype reduces KV cache memory at the cost of slight quality loss.

Quick reference


Type	`GgmlType?` enum (39 values)
Default	`null` (use native default, usually `F16`)
Common values	`F32`, `F16`, `BF16`, `Q8_0`, `Q5_1`, `Q4_0`
Category	KV cache
Field on	`ContextParameters.TypeK`

What it does

For each layer, the engine stores one K tensor of shape (heads × seq_len × head_dim). The dtype controls memory per element:

Dtype	Bits	Relative size	Quality impact
`F32`	32	1.0×	None. Rarely worth the memory.
`F16`	16	0.5×	Default. Minimal impact.
`BF16`	16	0.5×	Alternative to F16. Slightly different numerical range.
`Q8_0`	8	0.25×	Very small quality loss for substantial savings.
`Q5_1`	5	~0.16×	Noticeable at long contexts.
`Q4_0`	4	~0.125×	Aggressive; use only under strong memory pressure.

Rule of thumb: K is more sensitive to precision than V. Prefer to quantize V more aggressively than K.

When to change it

Scenario	Value
Default	`null` (F16)
Long context, mild memory pressure	`F16` (keep K high precision)
Long context, tight memory	`Q8_0`
Extreme memory constraint	`Q5_1` (accept quality drop)

Example

using Aspose.LLM.Abstractions.Models;

var preset = new Qwen25Preset();
preset.ContextParameters.TypeK = GgmlType.F16;   // keep K precise
preset.ContextParameters.TypeV = GgmlType.Q8_0;  // quantize V

using var api = AsposeLLMApi.Create(preset);

Interactions

TypeV — companion V-cache dtype.
ContextSize — cache cost scales with context; quantization pays off more at long contexts.
FlashAttentionMode — works with quantized KV on recent backends.

What’s next

TypeV — V-cache dtype.
Low memory tuning — practical quantization recipes.
Context parameters hub — all context knobs.

Net: TypeV

Thu, 23 Apr 2026 00:00:00 +0000

TypeV is the data type for the V (values) tensor of the KV cache. V tolerates more aggressive quantization than K — Q8_0 is often a safe default when memory is tight.

Quick reference


Type	`GgmlType?` enum (39 values)
Default	`null` (use native default, usually `F16`)
Common values	`F32`, `F16`, `BF16`, `Q8_0`, `Q5_1`, `Q4_0`
Category	KV cache
Field on	`ContextParameters.TypeV`

What it does

Identical mechanics to TypeK — stored shape, memory scaling, and options are the same. The difference is sensitivity: attention output depends on V through a weighted sum, which averages over many tokens, so quantization error averages out. Attention scores depend on K through a dot product where individual errors survive more.

Rule of thumb: Quantize V one step more aggressively than K.

Configuration	K	V
Default balanced	`F16`	`F16`
Save memory with minimal quality loss	`F16`	`Q8_0`
Tight memory	`Q8_0`	`Q8_0`
Very tight	`Q8_0`	`Q5_1`
Extreme	`Q8_0`	`Q4_0`

When to change it

Scenario	Value
Default	`null` (F16)
Save memory with minimal quality loss	`Q8_0`
Tight memory	`Q5_1`
Extreme memory pressure	`Q4_0`

V quantization at Q8_0 is often indistinguishable from F16 on benchmark tasks. Start there when memory is tight.

Example

using Aspose.LLM.Abstractions.Models;

var preset = new Qwen25Preset();
preset.ContextParameters.TypeV = GgmlType.Q8_0;
// V quantized; K left at default F16.

using var api = AsposeLLMApi.Create(preset);

Interactions

TypeK — companion K-cache dtype.
ContextSize — memory savings scale with context.
FlashAttentionMode — compatible with quantized V.

What’s next

TypeK — companion.
Low memory tuning — recipes for tight-memory deployments.
Context parameters hub — all context knobs.

Net: OffloadKqv

Thu, 23 Apr 2026 00:00:00 +0000

OffloadKqv controls whether the KQV (attention) computation and the KV cache itself live on GPU memory. On GPU-enabled builds, the default is to offload; disable only when fighting for VRAM.

Quick reference


Type	`bool?`
Default	`null` (uses native default — typically `true` on GPU builds)
Category	KV cache / GPU
Field on	`ContextParameters.OffloadKqv`

What it does

true — KV cache tensors and attention computation live on the GPU. Benefits throughput; uses more VRAM.
false — KV cache stays on CPU even when layers are offloaded via GpuLayers. Reduces VRAM usage; slower because GPU must read KV from host memory.
null — native default (usually true on GPU builds, irrelevant on CPU builds).

Disabling is useful on GPUs where the weights fit but the KV cache would push you into OOM at long contexts.

When to change it

Scenario	Value
Default GPU inference	`null` (true)
Short on VRAM at long context — trade speed for memory	`false`
CPU-only	`null` (irrelevant)

Example

var preset = new Qwen25Preset();
preset.BaseModelInferenceParameters.GpuLayers = 999;    // full offload
preset.ContextParameters.OffloadKqv = false;            // but keep KV on CPU to save VRAM
preset.ContextParameters.ContextSize = 131072;          // long context

Interactions

GpuLayers — with OffloadKqv = false, GPU layers access KV from CPU — slower but saves VRAM.
TypeK, TypeV — quantizing KV reduces memory regardless of placement.
FlashAttentionMode — FA reduces KV memory pressure.

What’s next

TypeK, TypeV — KV dtype.
GpuLayers — weight offload.
Out of memory troubleshooting — memory pressure recipes.

Net: DefragThreshold

Thu, 23 Apr 2026 00:00:00 +0000

DefragThreshold is the fraction of KV cache holes above which the engine triggers defragmentation. Useful for long-running services where repeated cleanup creates fragmentation.

Quick reference


Type	`float?`
Default	`null` (disabled — same as negative value)
Range	`< 0` = disabled; `0.0` – `1.0` enables
Category	KV cache maintenance
Field on	`ContextParameters.DefragThreshold`

What it does

When messages are evicted from the KV cache (by CacheCleanupStrategy), their slots become holes. Over many cycles, the cache may hold scattered used slots interspersed with holes, wasting capacity.

If DefragThreshold is set, the engine monitors the hole fraction. When it crosses the threshold, the engine compacts the cache — moves live tokens together and frees the tail.

null or negative — disabled. Cache is never compacted.
0.1 – 0.5 — typical active values. Compact when 10-50 % of the cache is holes.

When to change it

Scenario	Value
Default (short-lived or bounded sessions)	`null`
Long-running service with many evictions	`0.3`
Aggressive compaction	`0.1`

Compaction has a one-time cost when triggered. For bursty workloads where cache usage oscillates, defrag helps sustained throughput.

Example

var preset = new Qwen25Preset();
preset.ContextParameters.DefragThreshold = 0.3f;
// Compact when >30 % of KV slots are holes.

Interactions

CacheCleanupStrategy — the policy that creates the holes defrag compacts.
ContextSize — larger caches benefit more from defrag.

What’s next

Cache management — cleanup strategies and compaction together.
Context parameters hub — all context knobs.

Net: SwaFull

Thu, 23 Apr 2026 00:00:00 +0000

SwaFull controls whether the engine stores the full, uncompressed SWA (sliding-window attention) cache for models that use sliding-window attention. Only relevant for models with SWA layers.

Quick reference


Type	`bool?`
Default	`null` (use native default)
Category	KV cache (SWA-specific)
Field on	`ContextParameters.SwaFull`

What it does

Sliding-window attention (used by some Mistral, Gemma, and other architectures) attends only to a bounded recent window. The engine can store this window either:

Compressed (SwaFull = false or null) — smaller memory footprint, typical default.
Full (SwaFull = true) — uncompressed, larger memory footprint, may be faster in specific workloads.

For models without SWA, this field has no effect.

When to change it

Scenario	Value
Default	`null`
Benchmarking SWA performance	`true` to test uncompressed path
Memory constrained on SWA model	`null` or `false`

Few models currently on the built-in preset list use SWA extensively. If you are unsure, leave null.

Example

var preset = new Qwen25Preset();  // not SWA — SwaFull has no effect
preset.ContextParameters.SwaFull = null;  // default

Interactions

Only relevant for SWA-architected models.
TypeK, TypeV — the dtype applies regardless.

What’s next

Context parameters hub — all context knobs.
Supported presets — check which presets use SWA.

Net: KvUnified

Thu, 23 Apr 2026 00:00:00 +0000

KvUnified is an internal llama.cpp flag controlling whether the engine uses a single unified buffer for the KV cache across input sequences during attention. Leave at default unless specifically instructed by SDK guidance.

Quick reference


Type	`bool?`
Default	`null` (use native default)
Category	KV cache (internal)
Field on	`ContextParameters.KvUnified`

What it does

The unified buffer layout can optimize some multi-sequence scenarios by colocating K and V for all sequences in one memory block. Whether this helps or hurts depends on backend and workload.

null — native default. Usually correct.
true — force unified buffer.
false — force separate buffers.

When to change it

Scenario	Value
Default	`null`
Specific backend-tuning guidance from SDK docs	As instructed

Most workloads never touch this.

Example

var preset = new Qwen25Preset();
// preset.ContextParameters.KvUnified = null; // default

Interactions

NSeqMax — multi-sequence scenarios may interact with unified-buffer layout.

What’s next

Context parameters hub — all context knobs.

Net: OpOffload

Thu, 23 Apr 2026 00:00:00 +0000

OpOffload toggles offloading of host-side tensor operations to the GPU device. This is supplementary to GpuLayers and affects specific small operations that would otherwise run on CPU even with GPU offload active.

Quick reference


Type	`bool?`
Default	`null` (use native default)
Category	GPU offload (auxiliary)
Field on	`ContextParameters.OpOffload`

What it does

Some tensor operations (embedding lookups, small reductions) are relatively cheap and traditionally run on the host. OpOffload lets the engine offload them to the device too, in exchange for minimal host-device overhead.

null — native default. Modern GPU backends usually benefit from true.
true — offload.
false — keep on host.

When to change it

Scenario	Value
Default	`null`
Benchmarking GPU-centric paths	`true`
Debugging device-specific issues	`false`

Example

var preset = new Qwen25Preset();
preset.BaseModelInferenceParameters.GpuLayers = 999;
preset.ContextParameters.OpOffload = true;  // ensure all operations on device

Interactions

GpuLayers — primary offload control.
OffloadKqv — KV specific.

What’s next

GpuLayers — primary layer offload.
Context parameters hub — all context knobs.

Net: NoPerf

Thu, 23 Apr 2026 00:00:00 +0000

NoPerf disables collection of performance timings inside the native layer. The savings are small but non-zero; useful in very high-throughput production loops.

Quick reference


Type	`bool?`
Default	`null` (timings collected)
Category	Performance
Field on	`ContextParameters.NoPerf`

What it does

null or false — timings collected. Minor overhead per call but you can inspect them via native logging.
true — timings disabled. Slightly faster; nothing to inspect.

This is a micro-optimization. The savings are usually below measurement noise on a single request. On a high-throughput server processing many requests per second, the savings add up.

When to change it

Scenario	Value
Default (keeps timings for debugging)	`null`
High-throughput production squeezing every last cycle	`true`
Debugging performance	`null` or `false`

Example

var preset = new Qwen25Preset();
preset.ContextParameters.NoPerf = true;           // production
preset.EngineParameters.EnableDebugLogging = false;

Interactions

EnableDebugLogging — turning on debug logs defeats the purpose of NoPerf = true.

What’s next

Performance issues troubleshooting — throughput-focused tuning.
Context parameters hub — all context knobs.