Context parameters

ContextParameters mirrors llama_context_params in llama.cpp. It controls the shape of the runtime context — how many tokens the model can attend to, how batching is sized, how threads are split, how RoPE scaling stretches the context window, how flash attention is used, and how the KV cache is stored.

This is the largest parameter bag. Most fields are nullable; null means “use the native default” or “derive from the model”. Touch these values only when you know what you are changing.

Class reference

namespace Aspose.LLM.Abstractions.Models;

public partial class ContextParameters
{
    // Context size and batching
    public int? ContextSize { get; set; }
    public uint? NBatch { get; set; }
    public uint? NUbatch { get; set; }
    public uint? NSeqMax { get; set; }

    // Threading
    public int? NThreads { get; set; }
    public int? NThreadsBatch { get; set; }

    // RoPE scaling
    public RopeScalingType? RopeScalingType { get; set; }
    public float? RopeFreqBase { get; set; }
    public float? RopeFreqScale { get; set; }

    // YaRN
    public float? YarnExtFactor { get; set; }
    public float? YarnAttnFactor { get; set; }
    public float? YarnBetaFast { get; set; }
    public float? YarnBetaSlow { get; set; }
    public uint? YarnOrigCtx { get; set; }

    // Attention
    public AttentionType? AttentionType { get; set; }
    public FlashAttentionType? FlashAttentionMode { get; set; }
    public bool? FlashAttention { get; set; } // legacy

    // Pooling and embeddings
    public PoolingType? PoolingType { get; set; }
    public bool? Embeddings { get; set; }

    // KV cache
    public GgmlType? TypeK { get; set; }
    public GgmlType? TypeV { get; set; }
    public bool? OffloadKqv { get; set; }
    public float? DefragThreshold { get; set; }
    public bool? SwaFull { get; set; }
    public bool? KvUnified { get; set; }

    // Other
    public bool? OpOffload { get; set; }
    public bool? NoPerf { get; set; }
}

Detailed field reference

Each field has a dedicated page with full defaults, scenario tables, code examples, and interactions. The rest of this page is an inline overview of the same content; follow the links for the deeper treatment.

Context size and batching: ContextSize, NBatch, NUbatch, NSeqMax.

Threading: NThreads, NThreadsBatch.

RoPE and YaRN: RopeScalingType, RopeFreqBase, RopeFreqScale, YarnExtFactor, YarnAttnFactor, YarnBetaFast, YarnBetaSlow, YarnOrigCtx.

Attention: AttentionType, FlashAttentionMode, FlashAttention (legacy).

Pooling and embeddings: PoolingType, Embeddings.

KV cache: TypeK, TypeV, OffloadKqv, DefragThreshold, SwaFull, KvUnified.

Other: OpOffload, NoPerf.

Context size and batching

`ContextSize`

Length of the context window in tokens — the maximum number of tokens the model sees at once. Set to null (or 0) to use the model’s maximum from its GGUF metadata.

Built-in presets pre-set this: Qwen25Preset uses 32 768, Llama32Preset uses 131 072, Oss20Preset uses 131 072. See Supported presets for each default.

Trade-off: larger context allows longer conversations and documents, but the KV cache size scales with ContextSize × model-depth. Going from 32K to 131K quadruples KV memory.

preset.ContextParameters.ContextSize = 8192; // save memory for short conversations

`NBatch`

Logical maximum batch size — the largest number of tokens submitted in one llama_decode call. Affects prompt-processing throughput.

Typical values: 512 - 4096. Larger NBatch speeds up prompt processing but needs more temporary memory.

Built-in presets use NBatch between 2 048 and 4 096 depending on model and context size.

`NUbatch`

Physical maximum batch size — the largest chunk actually processed per kernel call. Normally NUbatch ≤ NBatch. NUbatch ≈ NBatch for simplicity on most deployments; different values apply only to specific multi-sequence scenarios.

`NSeqMax`

Maximum number of distinct sequences (for recurrent models) handled in parallel. For standard transformer models, leave at null or 1.

Threading

`NThreads`

CPU threads used during generation (token-by-token decode). When null, the engine falls back to EngineParameters.DefaultThreads.

Override when:

Running multiple concurrent inferences per process and you want to cap each.
Benchmarking to find the sweet spot for your model and CPU.

preset.ContextParameters.NThreads = 8;

`NThreadsBatch`

Threads used for batch (prompt) processing. Often different from NThreads because prompt processing parallelizes better. Typical production setting: NThreadsBatch ≈ ProcessorCount and NThreads ≈ half.

RoPE scaling

RoPE (rotary position embedding) is how transformer models encode token positions. Scaling lets you use a larger context than the model was trained on, at some accuracy cost.

`RopeScalingType`

Scaling algorithm.

Value	Behavior
`Unspecified` (`-1`)	Use the model default (usually what you want).
`None`	Disable RoPE scaling.
`Linear`	Linear interpolation scaling.
`Yarn`	YaRN scaling — better quality at long contexts.
`LongRope`	LongRoPE scaling for very extended contexts.

`RopeFreqBase` and `RopeFreqScale`

Override the RoPE frequency base and scaling factor. null or 0 means “from model metadata”. Most users leave these alone; advanced users override when extending context beyond the trained window.

YaRN

YaRN is a specific RoPE-scaling recipe. All five fields activate only when RopeScalingType = Yarn.

Field	Role	Default behavior
`YarnExtFactor`	Extrapolation mix factor	Negative or null = from model
`YarnAttnFactor`	Magnitude scaling factor	Null = from model
`YarnBetaFast`	Low correction dim	Null = from model
`YarnBetaSlow`	High correction dim	Null = from model
`YarnOrigCtx`	Original training context length	Null = from model

Typical usage: the preset author (or you, for a custom preset) picks Yarn and sets YarnOrigCtx to the model’s training context; other YaRN fields default from the model’s metadata.

Attention

`AttentionType`

Causal (autoregressive) versus non-causal (bidirectional) attention.

Value	Behavior
`Unspecified`	Model default.
`Causal`	Standard chat/inference attention.
`NonCausal`	Bidirectional; used for embedding models.

Leave Unspecified unless you know why you are changing it.

`FlashAttentionMode`

Flash attention is a fused-kernel optimization that reduces memory and speeds up long-context inference.

Value	Behavior
`Auto` (`-1`)	Runtime picks based on support. Recommended.
`Disabled`	Never use flash attention.
`Enabled`	Force flash attention (fails on backends that do not support it).

preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;

Flash attention is especially effective on long contexts (32K+). For short contexts, the speed-up is small.

`FlashAttention`

Legacy boolean flag. Prefer FlashAttentionMode. Kept for backwards compatibility with earlier SDK code.

Pooling and embeddings

`PoolingType`

Only relevant when generating embeddings (not normal chat).

Value	Behavior
`Unspecified` / `None`	No pooling.
`Mean`	Average the token embeddings.
`Cls`	Use the CLS token’s embedding.
`Last`	Use the last token’s embedding.
`Rank`	Rank-based pooling (experimental).

`Embeddings`

When true, the engine extracts embeddings alongside logits. Combined with a suitable PoolingType, this turns the model into an embedding generator.

Embedding workflows are not covered by the chat API (SendMessageAsync). A dedicated use case will be added in a future release.

KV cache

The KV cache stores the keys and values of every token the model has seen. Its size grows with context, and its precision affects both accuracy and memory.

`TypeK` and `TypeV`

Data type for the K and V tensors in the cache. Lower precision saves memory but can degrade output quality.

Common values (full enum has 30+ entries; see the API reference for the complete list):

Type	Bits per value	Relative size	Notes
`F32`	32	1.0×	Full precision. Rarely worth the memory.
`F16`	16	0.5×	Default on many builds. Good balance.
`BF16`	16	0.5×	Alternative to F16; slightly different range.
`Q8_0`	8	0.25×	Very minor quality loss for substantial memory savings.
`Q5_1`	5	~0.16×	More compact; quality drop noticeable on long contexts.
`Q4_0`	4	~0.125×	Aggressive quantization; only for memory-tight deployments.

Rule of thumb: quantize V more aggressively than K — V is less sensitive to precision.

preset.ContextParameters.TypeK = GgmlType.F16;
preset.ContextParameters.TypeV = GgmlType.Q8_0;

`OffloadKqv`

When true, the KQV computation (attention) and the KV cache itself live on the GPU. When false, they stay on CPU even if layers are offloaded. Default varies by build; usually true on GPU-enabled builds.

Disable only when you are fighting for a sliver of VRAM and willing to trade throughput.

`DefragThreshold`

Threshold (fraction of holes) above which the engine defragments the KV cache. Negative = disabled (default).

Set to 0.1 - 0.5 for long-running services that churn sessions — keeps KV memory compact over time.

`SwaFull`

Relevant for models using sliding-window attention (SWA). When true, stores the full SWA cache instead of the compressed form. Costs more memory, can be faster for certain workloads.

`KvUnified`

When true, uses a unified buffer across input sequences for attention. Implementation detail; leave at the default.

Other

`OpOffload`

Offload host tensor operations to the device. Supplementary to GpuLayers in ModelInferenceParameters. Leave null unless you know why.

`NoPerf`

When true, the engine stops collecting performance timings. A micro-optimization for high-throughput production — shaves a small amount of overhead.

Typical recipes

Minimize memory on a tight GPU

Short context and aggressive KV quantization:

preset.ContextParameters.ContextSize = 4096;
preset.ContextParameters.TypeK = GgmlType.Q8_0;
preset.ContextParameters.TypeV = GgmlType.Q4_0;
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;

Extended context with YaRN scaling

preset.ContextParameters.ContextSize = 131072;
preset.ContextParameters.RopeScalingType = RopeScalingType.Yarn;
preset.ContextParameters.YarnOrigCtx = 32768;   // the model's original training context
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;

High-throughput production

Lean on flash attention and optimized KV:

preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;
preset.ContextParameters.TypeK = GgmlType.F16;
preset.ContextParameters.TypeV = GgmlType.Q8_0;
preset.ContextParameters.DefragThreshold = 0.3f;
preset.ContextParameters.NoPerf = true;
preset.ContextParameters.NThreads = Environment.ProcessorCount;
preset.ContextParameters.NThreadsBatch = Environment.ProcessorCount;

Benchmark-focused (reproducible)

preset.ContextParameters.ContextSize = 8192;
preset.ContextParameters.NBatch = 2048;
preset.ContextParameters.NUbatch = 2048;
preset.ContextParameters.NThreads = 8;
preset.ContextParameters.NThreadsBatch = 16;
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;
preset.ContextParameters.NoPerf = false; // collect timings

Embedding extraction

preset.ContextParameters.Embeddings = true;
preset.ContextParameters.PoolingType = PoolingType.Mean;
preset.ContextParameters.AttentionType = AttentionType.NonCausal;

Chat APIs are not the right surface for embedding workflows — this recipe is preparation for a dedicated embeddings API that will be covered in a future release.

What’s next

Model inference parameters — GPU layers and tensor split that complement context KV settings.
Chat parameters — per-session max tokens and cache cleanup strategy.
Sampler parameters — how the engine picks tokens within the context.

Model inference parameters Chat parameters

Context parameters

Class reference

Detailed field reference

Context size and batching

ContextSize

NBatch

NUbatch

NSeqMax

Threading

NThreads

NThreadsBatch