Context parameters

ContextParameters mirrors llama_context_params in llama.cpp. It controls the shape of the runtime context — how many tokens the model can attend to, how batching is sized, how threads are split, how RoPE scaling stretches the context window, how flash attention is used, and how the KV cache is stored.

This is the largest parameter bag. Most fields are nullable; null means “use the native default” or “derive from the model”. Touch these values only when you know what you are changing.

Class reference

namespace Aspose.LLM.Abstractions.Models;

public partial class ContextParameters
{
    // Context size and batching
    public int? ContextSize { get; set; }
    public uint? NBatch { get; set; }
    public uint? NUbatch { get; set; }
    public uint? NSeqMax { get; set; }

    // Threading
    public int? NThreads { get; set; }
    public int? NThreadsBatch { get; set; }

    // RoPE scaling
    public RopeScalingType? RopeScalingType { get; set; }
    public float? RopeFreqBase { get; set; }
    public float? RopeFreqScale { get; set; }

    // YaRN
    public float? YarnExtFactor { get; set; }
    public float? YarnAttnFactor { get; set; }
    public float? YarnBetaFast { get; set; }
    public float? YarnBetaSlow { get; set; }
    public uint? YarnOrigCtx { get; set; }

    // Attention
    public AttentionType? AttentionType { get; set; }
    public FlashAttentionType? FlashAttentionMode { get; set; }
    public bool? FlashAttention { get; set; } // legacy

    // Pooling and embeddings
    public PoolingType? PoolingType { get; set; }
    public bool? Embeddings { get; set; }

    // KV cache
    public GgmlType? TypeK { get; set; }
    public GgmlType? TypeV { get; set; }
    public bool? OffloadKqv { get; set; }
    public float? DefragThreshold { get; set; }
    public bool? SwaFull { get; set; }
    public bool? KvUnified { get; set; }

    // Other
    public bool? OpOffload { get; set; }
    public bool? NoPerf { get; set; }
}

Detailed field reference

Each field has a dedicated page with full defaults, scenario tables, code examples, and interactions. The rest of this page is an inline overview of the same content; follow the links for the deeper treatment.

Context size and batching: ContextSize, NBatch, NUbatch, NSeqMax.

Threading: NThreads, NThreadsBatch.

RoPE and YaRN: RopeScalingType, RopeFreqBase, RopeFreqScale, YarnExtFactor, YarnAttnFactor, YarnBetaFast, YarnBetaSlow, YarnOrigCtx.

Attention: AttentionType, FlashAttentionMode, FlashAttention (legacy).

Pooling and embeddings: PoolingType, Embeddings.

KV cache: TypeK, TypeV, OffloadKqv, DefragThreshold, SwaFull, KvUnified.

Other: OpOffload, NoPerf.

Context size and batching

ContextSize

Length of the context window in tokens — the maximum number of tokens the model sees at once. Set to null (or 0) to use the model’s maximum from its GGUF metadata.

Built-in presets pre-set this: Qwen25Preset uses 32 768, Llama32Preset uses 131 072, Oss20Preset uses 131 072. See Supported presets for each default.

Trade-off: larger context allows longer conversations and documents, but the KV cache size scales with ContextSize × model-depth. Going from 32K to 131K quadruples KV memory.

preset.ContextParameters.ContextSize = 8192; // save memory for short conversations

NBatch

Logical maximum batch size — the largest number of tokens submitted in one llama_decode call. Affects prompt-processing throughput.

Typical values: 512 - 4096. Larger NBatch speeds up prompt processing but needs more temporary memory.

Built-in presets use NBatch between 2 048 and 4 096 depending on model and context size.

NUbatch

Physical maximum batch size — the largest chunk actually processed per kernel call. Normally NUbatch ≤ NBatch. NUbatchNBatch for simplicity on most deployments; different values apply only to specific multi-sequence scenarios.

NSeqMax

Maximum number of distinct sequences (for recurrent models) handled in parallel. For standard transformer models, leave at null or 1.

Threading

NThreads

CPU threads used during generation (token-by-token decode). When null, the engine falls back to EngineParameters.DefaultThreads.

Override when:

  • Running multiple concurrent inferences per process and you want to cap each.
  • Benchmarking to find the sweet spot for your model and CPU.
preset.ContextParameters.NThreads = 8;

NThreadsBatch

Threads used for batch (prompt) processing. Often different from NThreads because prompt processing parallelizes better. Typical production setting: NThreadsBatchProcessorCount and NThreads ≈ half.

RoPE scaling

RoPE (rotary position embedding) is how transformer models encode token positions. Scaling lets you use a larger context than the model was trained on, at some accuracy cost.

RopeScalingType

Scaling algorithm.

Value Behavior
Unspecified (-1) Use the model default (usually what you want).
None Disable RoPE scaling.
Linear Linear interpolation scaling.
Yarn YaRN scaling — better quality at long contexts.
LongRope LongRoPE scaling for very extended contexts.

RopeFreqBase and RopeFreqScale

Override the RoPE frequency base and scaling factor. null or 0 means “from model metadata”. Most users leave these alone; advanced users override when extending context beyond the trained window.

YaRN

YaRN is a specific RoPE-scaling recipe. All five fields activate only when RopeScalingType = Yarn.

Field Role Default behavior
YarnExtFactor Extrapolation mix factor Negative or null = from model
YarnAttnFactor Magnitude scaling factor Null = from model
YarnBetaFast Low correction dim Null = from model
YarnBetaSlow High correction dim Null = from model
YarnOrigCtx Original training context length Null = from model

Typical usage: the preset author (or you, for a custom preset) picks Yarn and sets YarnOrigCtx to the model’s training context; other YaRN fields default from the model’s metadata.

Attention

AttentionType

Causal (autoregressive) versus non-causal (bidirectional) attention.

Value Behavior
Unspecified Model default.
Causal Standard chat/inference attention.
NonCausal Bidirectional; used for embedding models.

Leave Unspecified unless you know why you are changing it.

FlashAttentionMode

Flash attention is a fused-kernel optimization that reduces memory and speeds up long-context inference.

Value Behavior
Auto (-1) Runtime picks based on support. Recommended.
Disabled Never use flash attention.
Enabled Force flash attention (fails on backends that do not support it).
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;

Flash attention is especially effective on long contexts (32K+). For short contexts, the speed-up is small.

FlashAttention

Legacy boolean flag. Prefer FlashAttentionMode. Kept for backwards compatibility with earlier SDK code.

Pooling and embeddings

PoolingType

Only relevant when generating embeddings (not normal chat).

Value Behavior
Unspecified / None No pooling.
Mean Average the token embeddings.
Cls Use the CLS token’s embedding.
Last Use the last token’s embedding.
Rank Rank-based pooling (experimental).

Embeddings

When true, the engine extracts embeddings alongside logits. Combined with a suitable PoolingType, this turns the model into an embedding generator.

Embedding workflows are not covered by the chat API (SendMessageAsync). A dedicated use case will be added in a future release.

KV cache

The KV cache stores the keys and values of every token the model has seen. Its size grows with context, and its precision affects both accuracy and memory.

TypeK and TypeV

Data type for the K and V tensors in the cache. Lower precision saves memory but can degrade output quality.

Common values (full enum has 30+ entries; see the API reference for the complete list):

Type Bits per value Relative size Notes
F32 32 1.0× Full precision. Rarely worth the memory.
F16 16 0.5× Default on many builds. Good balance.
BF16 16 0.5× Alternative to F16; slightly different range.
Q8_0 8 0.25× Very minor quality loss for substantial memory savings.
Q5_1 5 ~0.16× More compact; quality drop noticeable on long contexts.
Q4_0 4 ~0.125× Aggressive quantization; only for memory-tight deployments.

Rule of thumb: quantize V more aggressively than K — V is less sensitive to precision.

preset.ContextParameters.TypeK = GgmlType.F16;
preset.ContextParameters.TypeV = GgmlType.Q8_0;

OffloadKqv

When true, the KQV computation (attention) and the KV cache itself live on the GPU. When false, they stay on CPU even if layers are offloaded. Default varies by build; usually true on GPU-enabled builds.

Disable only when you are fighting for a sliver of VRAM and willing to trade throughput.

DefragThreshold

Threshold (fraction of holes) above which the engine defragments the KV cache. Negative = disabled (default).

Set to 0.1 - 0.5 for long-running services that churn sessions — keeps KV memory compact over time.

SwaFull

Relevant for models using sliding-window attention (SWA). When true, stores the full SWA cache instead of the compressed form. Costs more memory, can be faster for certain workloads.

KvUnified

When true, uses a unified buffer across input sequences for attention. Implementation detail; leave at the default.

Other

OpOffload

Offload host tensor operations to the device. Supplementary to GpuLayers in ModelInferenceParameters. Leave null unless you know why.

NoPerf

When true, the engine stops collecting performance timings. A micro-optimization for high-throughput production — shaves a small amount of overhead.

Typical recipes

Minimize memory on a tight GPU

Short context and aggressive KV quantization:

preset.ContextParameters.ContextSize = 4096;
preset.ContextParameters.TypeK = GgmlType.Q8_0;
preset.ContextParameters.TypeV = GgmlType.Q4_0;
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;

Extended context with YaRN scaling

preset.ContextParameters.ContextSize = 131072;
preset.ContextParameters.RopeScalingType = RopeScalingType.Yarn;
preset.ContextParameters.YarnOrigCtx = 32768;   // the model's original training context
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;

High-throughput production

Lean on flash attention and optimized KV:

preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;
preset.ContextParameters.TypeK = GgmlType.F16;
preset.ContextParameters.TypeV = GgmlType.Q8_0;
preset.ContextParameters.DefragThreshold = 0.3f;
preset.ContextParameters.NoPerf = true;
preset.ContextParameters.NThreads = Environment.ProcessorCount;
preset.ContextParameters.NThreadsBatch = Environment.ProcessorCount;

Benchmark-focused (reproducible)

preset.ContextParameters.ContextSize = 8192;
preset.ContextParameters.NBatch = 2048;
preset.ContextParameters.NUbatch = 2048;
preset.ContextParameters.NThreads = 8;
preset.ContextParameters.NThreadsBatch = 16;
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;
preset.ContextParameters.NoPerf = false; // collect timings

Embedding extraction

preset.ContextParameters.Embeddings = true;
preset.ContextParameters.PoolingType = PoolingType.Mean;
preset.ContextParameters.AttentionType = AttentionType.NonCausal;

Chat APIs are not the right surface for embedding workflows — this recipe is preparation for a dedicated embeddings API that will be covered in a future release.

What’s next