Documentation – Parameters

Net: Model source parameters

Thu, 23 Apr 2026 00:00:00 +0000

ModelSourceParameters tells the engine where to find the model file. The same class type is used in two places on every preset:

BaseModelSourceParameters — the main language model (text or vision).
MmprojSourceParameters — the vision projector (mmproj) for vision presets. Ignored for text-only presets.

Class reference

namespace Aspose.LLM.Abstractions.Parameters;

public class ModelSourceParameters
{
    public string? ModelFilePath { get; set; }          // priority 1 — local path
    public string? AsposeModelId { get; set; }          // priority 2 — Aspose catalog
    public string? HuggingFaceRepoId { get; set; }      // priority 3 — HF repo ID
    public string? HuggingFaceFileName { get; set; }    //             HF file within the repo
}

All four properties are nullable. Leave a property null to defer to the next priority source.

Detailed field reference

Each field has a dedicated page with full defaults, scenario tables, code examples, and interactions.

Resolution order

At AsposeLLMApi.Create, the engine walks the sources in priority order and uses the first one that resolves.

Priority	Field(s)	Behavior
1	`ModelFilePath`	If set to a non-null non-empty path, load directly from disk. No download.
2	`AsposeModelId`	If set, download the named model from Aspose’s internal catalog into the cache.
3	`HuggingFaceRepoId` + `HuggingFaceFileName`	If both set, download the named file from the Hugging Face repo into the cache.

The engine throws an exception if none of the three resolve.

Where models are cached

Downloaded models live in EngineParameters.ModelCachePath. The default on Windows is %LOCALAPPDATA%\Aspose.LLM\models; on Linux and macOS it is the equivalent LocalApplicationData folder. Change the cache location per preset on EngineParameters.ModelCachePath.

When a model has already been downloaded, subsequent runs load from the cache. Delete the cache to force a re-download.

Typical recipes

Use a built-in preset’s model source

Built-in presets set HuggingFaceRepoId and HuggingFaceFileName on BaseModelSourceParameters in their constructor — no changes needed:

var preset = new Qwen25Preset();
// BaseModelSourceParameters.HuggingFaceRepoId = "bartowski/Qwen2.5-7B-Instruct-GGUF"
// BaseModelSourceParameters.HuggingFaceFileName = "Qwen2.5-7B-Instruct-Q4_K_M.gguf"

using var api = AsposeLLMApi.Create(preset);

See Supported presets for the exact values per preset.

Load a model from a local file

Override ModelFilePath to skip the download entirely. Useful for air-gapped environments or when you already have the GGUF on disk.

var preset = new Qwen25Preset();
preset.BaseModelSourceParameters.ModelFilePath = @"C:\models\Qwen2.5-7B-Instruct-Q4_K_M.gguf";
// ModelFilePath wins over the preset's Hugging Face values.

using var api = AsposeLLMApi.Create(preset);

Switch to a different Hugging Face quantization

Keep the preset but change the file name to a different quantization available in the same repo:

var preset = new Qwen25Preset();
preset.BaseModelSourceParameters.HuggingFaceFileName = "Qwen2.5-7B-Instruct-Q8_0.gguf";

using var api = AsposeLLMApi.Create(preset);

Q8_0 is larger than Q4_K_M (about 2× the file size) but retains more of the original model’s quality.

Bring a custom model from Hugging Face

For a model that does not have a built-in preset, create a custom preset and set BaseModelSourceParameters from scratch:

public class MyCustomPreset : PresetCoreBase
{
    public MyCustomPreset()
    {
        BaseModelSourceParameters.HuggingFaceRepoId = "your-org/your-model-GGUF";
        BaseModelSourceParameters.HuggingFaceFileName = "your-model.Q4_K_M.gguf";
    }
}

See Custom preset for the full pattern.

Configure a vision projector

Vision presets also set MmprojSourceParameters — the projector follows the same resolution rules as the base model:

var preset = new Qwen3VL2BPreset();
// BaseModelSourceParameters.HuggingFaceRepoId     = "Qwen/Qwen3-VL-2B-Instruct-GGUF"
// BaseModelSourceParameters.HuggingFaceFileName   = "Qwen3VL-2B-Instruct-Q4_K_M.gguf"
// MmprojSourceParameters.HuggingFaceRepoId        = "Qwen/Qwen3-VL-2B-Instruct-GGUF"
// MmprojSourceParameters.HuggingFaceFileName      = "mmproj-Qwen3VL-2B-Instruct-Q8_0.gguf"

// Override the projector to a local file:
preset.MmprojSourceParameters.ModelFilePath = @"C:\models\custom-mmproj.gguf";
preset.MmprojSourceParameters.HuggingFaceRepoId = null;
preset.MmprojSourceParameters.HuggingFaceFileName = null;

using var api = AsposeLLMApi.Create(preset);

Leaving the Hugging Face fields on an override is harmless — ModelFilePath wins — but clearing them keeps the intent obvious.

What’s next

Engine parameters — change the cache location.
Binary manager parameters — pre-populate native binaries for offline scenarios.
Custom preset — full preset-customization patterns.

Net: Model inference parameters

Thu, 23 Apr 2026 00:00:00 +0000

ModelInferenceParameters controls how the engine loads a model into memory: how many layers to offload to GPU, how to split across multiple GPUs, whether to use memory mapping, and how to override GGUF metadata at runtime.

Most fields are nullable — a null value means “use the native default”. Set an explicit value only when you need to override.

Class reference

namespace Aspose.LLM.Abstractions.Parameters;

public class ModelInferenceParameters
{
    public int? GpuLayers { get; set; }
    public bool? UseMemoryMapping { get; set; }
    public bool? UseMemoryLocking { get; set; }
    public int? MainGpu { get; set; }
    public LlamaSplitMode? SplitMode { get; set; }
    public bool? VocabOnly { get; set; }
    public bool? CheckTensors { get; set; }
    public bool? UseExtraBuffers { get; set; }
    public float[]? TensorSplit { get; set; }
    public ModelKeyValueOverride[]? KvOverrides { get; set; }
}

Detailed field reference

Each field has a dedicated page with full defaults, scenario tables, code examples, and interactions. The rest of this page is an inline overview of the same content; follow the links for the deeper treatment.

Load knobs: GpuLayers, UseMemoryMapping, UseMemoryLocking, MainGpu, SplitMode.

Other knobs: VocabOnly, CheckTensors, UseExtraBuffers, TensorSplit, KvOverrides.

Fields

Field	Type	Default	Purpose
`GpuLayers`	`int?`	native default	Number of model layers to offload to GPU. `0` = CPU only, `999` = full offload.
`UseMemoryMapping`	`bool?`	native default (`true`)	Map the GGUF file instead of reading it in. Reduces startup time and memory copying.
`UseMemoryLocking`	`bool?`	native default (`false`)	Lock model memory to prevent OS paging. Needs `mlock` / `VirtualLock` privileges.
`MainGpu`	`int?`	`0`	Index of the GPU used when `SplitMode` is `None`.
`SplitMode`	`LlamaSplitMode?`	native default	How to split the model across multiple GPUs.
`VocabOnly`	`bool?`	native default (`false`)	Load only the vocabulary without weights. Used for tokenizer-only scenarios.
`CheckTensors`	`bool?`	native default (`false`)	Validate tensor data on load. Adds startup time; helpful for diagnosing corrupted GGUF files.
`UseExtraBuffers`	`bool?`	native default	Use extra buffer types for weight repacking. Advanced.
`TensorSplit`	`float[]?`	equal split	Proportions per GPU when `SplitMode` splits across devices.
`KvOverrides`	`ModelKeyValueOverride[]?`	none	Runtime overrides for GGUF metadata keys.

`GpuLayers`

Controls GPU offload. Each transformer layer lives either in system RAM (CPU inference) or GPU VRAM (GPU inference). Partial offload is supported: you can put the first N layers on the GPU and keep the rest on the CPU.

Value	Behavior
`0`	CPU only. No GPU memory allocated for model weights.
A layer count	Offload the first N layers to GPU, keep the remaining on CPU.
`999` (or any value ≥ the model’s layer count)	Full GPU offload. Idiomatic way to request “put everything on the GPU”.
`null`	Use the native default.

Pair with a GPU-capable BinaryManagerParameters.PreferredAcceleration — setting GpuLayers = 999 on a CPU-only binary silently keeps the model on the CPU.

preset.BaseModelInferenceParameters.GpuLayers = 999;
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;

`UseMemoryMapping`

When true (default on most platforms), the engine memory-maps the GGUF file so the OS streams it in on demand. This reduces startup time and avoids copying the full model into RAM before inference.

Set to false only for rare scenarios — network file systems that do not support mmap, or environments where you want the file fully loaded before the first token. Disabling mmap doubles peak memory during load (the read buffer plus the mapped copy).

`UseMemoryLocking`

When true, the engine calls mlock (Linux/macOS) or VirtualLock (Windows) on the model’s memory so the OS does not page it out. Requires elevated privileges or raised ulimits.

Leave null or false in most deployments. Enable only when the model is swapped out under memory pressure and you have the system tuning to support locking.

`MainGpu` and `SplitMode`

On multi-GPU hosts, SplitMode decides how to place the model and MainGpu selects the primary device when the mode is not splitting.

`SplitMode`	Behavior
`LLAMA_SPLIT_MODE_NONE`	Single GPU. Whole model on device `MainGpu`.
`LLAMA_SPLIT_MODE_LAYER`	Split layers across GPUs. KV cache follows layers.
`LLAMA_SPLIT_MODE_ROW`	Split layers and rows across GPUs. Uses tensor parallelism where supported.

using Aspose.LLM.Abstractions.Parameters;

// Single GPU, use device 1:
preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_NONE;
preset.BaseModelInferenceParameters.MainGpu = 1;

// Split across all GPUs, layer mode:
preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_LAYER;
preset.BaseModelInferenceParameters.TensorSplit = null; // equal distribution

`TensorSplit`

Proportion of the model placed on each GPU. The array length should match the number of available GPUs.

null — equal distribution.
Explicit array — values are normalized to sum to 1. For example, [2, 1] on two GPUs places 67 % on GPU 0 and 33 % on GPU 1.

Useful when GPUs have different memory sizes (for example, a 24 GB card paired with a 12 GB card):

preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_LAYER;
preset.BaseModelInferenceParameters.TensorSplit = new float[] { 2.0f, 1.0f };

`VocabOnly`

When true, the engine loads only the model’s vocabulary, skipping the weights. The model is not usable for generation in this state — it is a tokenizer-only configuration for rare tooling scenarios.

Leave null (or false) for any normal inference use.

`CheckTensors`

When true, the engine validates every tensor’s data during load. Adds significant startup time but catches corrupted GGUF files early. Useful when testing a new download or a model from an untrusted source; leave null in production.

`UseExtraBuffers`

Enables additional buffer types used by weight repacking paths in llama.cpp. Advanced; most users should leave it null.

`KvOverrides`

Overrides specific keys in the GGUF metadata at load time. Each override targets a single metadata key and provides the new typed value.

preset.BaseModelInferenceParameters.KvOverrides = new[]
{
    new ModelKeyValueOverride
    {
        Key = "llama.rope.scaling.type",
        Type = ModelKvOverrideType.String,
        StringValue = "yarn",
    },
    new ModelKeyValueOverride
    {
        Key = "llama.context_length",
        Type = ModelKvOverrideType.Int,
        IntValue = 131072,
    },
};

The ModelKeyValueOverride class carries one of four typed values depending on Type:

`Type`	Value field	Example keys
`Int`	`IntValue`	`llama.context_length`, `llama.embedding_length`
`Float`	`FloatValue`	`llama.rope.freq_base`, `llama.rope.scaling.factor`
`Bool`	`BoolValue`	Model-specific boolean flags
`String`	`StringValue`	`general.architecture`, `llama.rope.scaling.type`

Use overrides with care — wrong metadata makes the model load incorrectly or silently produce garbage.

Typical recipes

CPU-only inference

var preset = new Qwen25Preset();
preset.BaseModelInferenceParameters.GpuLayers = 0;

Full GPU offload

var preset = new Qwen25Preset();
preset.BaseModelInferenceParameters.GpuLayers = 999;
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;

Partial offload on a memory-constrained GPU

A 12 GB GPU cannot fit a full 8B Q4_K_M model plus KV cache. Offload the first 28 layers and keep the rest on the CPU:

var preset = new Qwen3Preset(); // 8B, 32 layers
preset.BaseModelInferenceParameters.GpuLayers = 28;
preset.BaseModelInferenceParameters.UseMemoryMapping = true;

Benchmark to find the right split for your hardware — “offload until VRAM is ~1-2 GB short of full”.

Two unequal GPUs

A 24 GB GPU paired with a 12 GB GPU:

preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_LAYER;
preset.BaseModelInferenceParameters.TensorSplit = new float[] { 2.0f, 1.0f };
preset.BaseModelInferenceParameters.GpuLayers = 999;

Validate a suspect GGUF

preset.BaseModelInferenceParameters.CheckTensors = true;
// After a clean load, switch back to the default for production.

What’s next

Binary manager parameters — pair with PreferredAcceleration to select the right native binary.
Context parameters — KV cache configuration and batch sizes that interact with GPU offload.
System requirements — GPU backends and their driver / runtime requirements.

Net: Context parameters

Thu, 23 Apr 2026 00:00:00 +0000

ContextParameters mirrors llama_context_params in llama.cpp. It controls the shape of the runtime context — how many tokens the model can attend to, how batching is sized, how threads are split, how RoPE scaling stretches the context window, how flash attention is used, and how the KV cache is stored.

This is the largest parameter bag. Most fields are nullable; null means “use the native default” or “derive from the model”. Touch these values only when you know what you are changing.

Class reference

namespace Aspose.LLM.Abstractions.Models;

public partial class ContextParameters
{
    // Context size and batching
    public int? ContextSize { get; set; }
    public uint? NBatch { get; set; }
    public uint? NUbatch { get; set; }
    public uint? NSeqMax { get; set; }

    // Threading
    public int? NThreads { get; set; }
    public int? NThreadsBatch { get; set; }

    // RoPE scaling
    public RopeScalingType? RopeScalingType { get; set; }
    public float? RopeFreqBase { get; set; }
    public float? RopeFreqScale { get; set; }

    // YaRN
    public float? YarnExtFactor { get; set; }
    public float? YarnAttnFactor { get; set; }
    public float? YarnBetaFast { get; set; }
    public float? YarnBetaSlow { get; set; }
    public uint? YarnOrigCtx { get; set; }

    // Attention
    public AttentionType? AttentionType { get; set; }
    public FlashAttentionType? FlashAttentionMode { get; set; }
    public bool? FlashAttention { get; set; } // legacy

    // Pooling and embeddings
    public PoolingType? PoolingType { get; set; }
    public bool? Embeddings { get; set; }

    // KV cache
    public GgmlType? TypeK { get; set; }
    public GgmlType? TypeV { get; set; }
    public bool? OffloadKqv { get; set; }
    public float? DefragThreshold { get; set; }
    public bool? SwaFull { get; set; }
    public bool? KvUnified { get; set; }

    // Other
    public bool? OpOffload { get; set; }
    public bool? NoPerf { get; set; }
}

Detailed field reference

Context size and batching: ContextSize, NBatch, NUbatch, NSeqMax.

Threading: NThreads, NThreadsBatch.

RoPE and YaRN: RopeScalingType, RopeFreqBase, RopeFreqScale, YarnExtFactor, YarnAttnFactor, YarnBetaFast, YarnBetaSlow, YarnOrigCtx.

Attention: AttentionType, FlashAttentionMode, FlashAttention (legacy).

Pooling and embeddings: PoolingType, Embeddings.

KV cache: TypeK, TypeV, OffloadKqv, DefragThreshold, SwaFull, KvUnified.

Other: OpOffload, NoPerf.

Context size and batching

`ContextSize`

Length of the context window in tokens — the maximum number of tokens the model sees at once. Set to null (or 0) to use the model’s maximum from its GGUF metadata.

Built-in presets pre-set this: Qwen25Preset uses 32 768, Llama32Preset uses 131 072, Oss20Preset uses 131 072. See Supported presets for each default.

Trade-off: larger context allows longer conversations and documents, but the KV cache size scales with ContextSize × model-depth. Going from 32K to 131K quadruples KV memory.

preset.ContextParameters.ContextSize = 8192; // save memory for short conversations

`NBatch`

Logical maximum batch size — the largest number of tokens submitted in one llama_decode call. Affects prompt-processing throughput.

Typical values: 512 - 4096. Larger NBatch speeds up prompt processing but needs more temporary memory.

Built-in presets use NBatch between 2 048 and 4 096 depending on model and context size.

`NUbatch`

Physical maximum batch size — the largest chunk actually processed per kernel call. Normally NUbatch ≤ NBatch. NUbatch ≈ NBatch for simplicity on most deployments; different values apply only to specific multi-sequence scenarios.

`NSeqMax`

Maximum number of distinct sequences (for recurrent models) handled in parallel. For standard transformer models, leave at null or 1.

Threading

`NThreads`

CPU threads used during generation (token-by-token decode). When null, the engine falls back to EngineParameters.DefaultThreads.

Override when:

Running multiple concurrent inferences per process and you want to cap each.
Benchmarking to find the sweet spot for your model and CPU.

preset.ContextParameters.NThreads = 8;

`NThreadsBatch`

Threads used for batch (prompt) processing. Often different from NThreads because prompt processing parallelizes better. Typical production setting: NThreadsBatch ≈ ProcessorCount and NThreads ≈ half.

RoPE scaling

RoPE (rotary position embedding) is how transformer models encode token positions. Scaling lets you use a larger context than the model was trained on, at some accuracy cost.

`RopeScalingType`

Scaling algorithm.

Value	Behavior
`Unspecified` (`-1`)	Use the model default (usually what you want).
`None`	Disable RoPE scaling.
`Linear`	Linear interpolation scaling.
`Yarn`	YaRN scaling — better quality at long contexts.
`LongRope`	LongRoPE scaling for very extended contexts.

`RopeFreqBase` and `RopeFreqScale`

Override the RoPE frequency base and scaling factor. null or 0 means “from model metadata”. Most users leave these alone; advanced users override when extending context beyond the trained window.

YaRN

YaRN is a specific RoPE-scaling recipe. All five fields activate only when RopeScalingType = Yarn.

Field	Role	Default behavior
`YarnExtFactor`	Extrapolation mix factor	Negative or null = from model
`YarnAttnFactor`	Magnitude scaling factor	Null = from model
`YarnBetaFast`	Low correction dim	Null = from model
`YarnBetaSlow`	High correction dim	Null = from model
`YarnOrigCtx`	Original training context length	Null = from model

Typical usage: the preset author (or you, for a custom preset) picks Yarn and sets YarnOrigCtx to the model’s training context; other YaRN fields default from the model’s metadata.

Attention

`AttentionType`

Causal (autoregressive) versus non-causal (bidirectional) attention.

Value	Behavior
`Unspecified`	Model default.
`Causal`	Standard chat/inference attention.
`NonCausal`	Bidirectional; used for embedding models.

Leave Unspecified unless you know why you are changing it.

`FlashAttentionMode`

Flash attention is a fused-kernel optimization that reduces memory and speeds up long-context inference.

Value	Behavior
`Auto` (`-1`)	Runtime picks based on support. Recommended.
`Disabled`	Never use flash attention.
`Enabled`	Force flash attention (fails on backends that do not support it).

preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;

Flash attention is especially effective on long contexts (32K+). For short contexts, the speed-up is small.

`FlashAttention`

Legacy boolean flag. Prefer FlashAttentionMode. Kept for backwards compatibility with earlier SDK code.

Pooling and embeddings

`PoolingType`

Only relevant when generating embeddings (not normal chat).

Value	Behavior
`Unspecified` / `None`	No pooling.
`Mean`	Average the token embeddings.
`Cls`	Use the CLS token’s embedding.
`Last`	Use the last token’s embedding.
`Rank`	Rank-based pooling (experimental).

`Embeddings`

When true, the engine extracts embeddings alongside logits. Combined with a suitable PoolingType, this turns the model into an embedding generator.

Embedding workflows are not covered by the chat API (SendMessageAsync). A dedicated use case will be added in a future release.

KV cache

The KV cache stores the keys and values of every token the model has seen. Its size grows with context, and its precision affects both accuracy and memory.

`TypeK` and `TypeV`

Data type for the K and V tensors in the cache. Lower precision saves memory but can degrade output quality.

Common values (full enum has 30+ entries; see the API reference for the complete list):

Type	Bits per value	Relative size	Notes
`F32`	32	1.0×	Full precision. Rarely worth the memory.
`F16`	16	0.5×	Default on many builds. Good balance.
`BF16`	16	0.5×	Alternative to F16; slightly different range.
`Q8_0`	8	0.25×	Very minor quality loss for substantial memory savings.
`Q5_1`	5	~0.16×	More compact; quality drop noticeable on long contexts.
`Q4_0`	4	~0.125×	Aggressive quantization; only for memory-tight deployments.

Rule of thumb: quantize V more aggressively than K — V is less sensitive to precision.

preset.ContextParameters.TypeK = GgmlType.F16;
preset.ContextParameters.TypeV = GgmlType.Q8_0;

`OffloadKqv`

When true, the KQV computation (attention) and the KV cache itself live on the GPU. When false, they stay on CPU even if layers are offloaded. Default varies by build; usually true on GPU-enabled builds.

Disable only when you are fighting for a sliver of VRAM and willing to trade throughput.

`DefragThreshold`

Threshold (fraction of holes) above which the engine defragments the KV cache. Negative = disabled (default).

Set to 0.1 - 0.5 for long-running services that churn sessions — keeps KV memory compact over time.

`SwaFull`

Relevant for models using sliding-window attention (SWA). When true, stores the full SWA cache instead of the compressed form. Costs more memory, can be faster for certain workloads.

`KvUnified`

When true, uses a unified buffer across input sequences for attention. Implementation detail; leave at the default.

Other

`OpOffload`

Offload host tensor operations to the device. Supplementary to GpuLayers in ModelInferenceParameters. Leave null unless you know why.

`NoPerf`

When true, the engine stops collecting performance timings. A micro-optimization for high-throughput production — shaves a small amount of overhead.

Typical recipes

Minimize memory on a tight GPU

Short context and aggressive KV quantization:

preset.ContextParameters.ContextSize = 4096;
preset.ContextParameters.TypeK = GgmlType.Q8_0;
preset.ContextParameters.TypeV = GgmlType.Q4_0;
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;

Extended context with YaRN scaling

preset.ContextParameters.ContextSize = 131072;
preset.ContextParameters.RopeScalingType = RopeScalingType.Yarn;
preset.ContextParameters.YarnOrigCtx = 32768;   // the model's original training context
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;

High-throughput production

Lean on flash attention and optimized KV:

preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;
preset.ContextParameters.TypeK = GgmlType.F16;
preset.ContextParameters.TypeV = GgmlType.Q8_0;
preset.ContextParameters.DefragThreshold = 0.3f;
preset.ContextParameters.NoPerf = true;
preset.ContextParameters.NThreads = Environment.ProcessorCount;
preset.ContextParameters.NThreadsBatch = Environment.ProcessorCount;

Benchmark-focused (reproducible)

preset.ContextParameters.ContextSize = 8192;
preset.ContextParameters.NBatch = 2048;
preset.ContextParameters.NUbatch = 2048;
preset.ContextParameters.NThreads = 8;
preset.ContextParameters.NThreadsBatch = 16;
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;
preset.ContextParameters.NoPerf = false; // collect timings

Embedding extraction

preset.ContextParameters.Embeddings = true;
preset.ContextParameters.PoolingType = PoolingType.Mean;
preset.ContextParameters.AttentionType = AttentionType.NonCausal;

Chat APIs are not the right surface for embedding workflows — this recipe is preparation for a dedicated embeddings API that will be covered in a future release.

What’s next

Model inference parameters — GPU layers and tensor split that complement context KV settings.
Chat parameters — per-session max tokens and cache cleanup strategy.
Sampler parameters — how the engine picks tokens within the context.

Net: Chat parameters

Thu, 23 Apr 2026 00:00:00 +0000

ChatParameters holds the settings that apply to every chat session created from a preset: the system prompt, the initial conversation history, the per-response token budget, and how the KV cache is pruned when it overflows.

Class reference

namespace Aspose.LLM.Abstractions.Models;

public class ChatParameters
{
    public string SystemPrompt { get; set; } = "";
    public List<ChatMessage>? History { get; set; }
    public int MaxTokens { get; set; } = 2048;
    public CacheCleanupStrategy CacheCleanupStrategy { get; set; }
        = CacheCleanupStrategy.RemoveOldestMessages;
}

public enum CacheCleanupStrategy
{
    KeepSystemPromptOnly,
    KeepSystemPromptAndHalf,
    KeepSystemPromptAndFirstUserMessage,
    KeepSystemPromptAndLastUserMessage,
    RemoveOldestMessages,
}

Detailed field reference

Each field has a dedicated page with full defaults, scenario tables, code examples, and interactions.

Fields

Field	Type	Default	Purpose
`SystemPrompt`	`string`	`""`	Default system prompt applied to every new session created from this preset.
`History`	`List<ChatMessage>?`	`null`	Initial chat history injected into newly created sessions.
`MaxTokens`	`int`	`2048`	Hard cap on tokens generated per single assistant response.
`CacheCleanupStrategy`	enum	`RemoveOldestMessages`	Policy for trimming the KV cache when context fills up.

`SystemPrompt`

The default system prompt for sessions created from this preset. Applied at session-creation time — both when you call StartNewChatAsync and when SendMessageAsync creates the current session implicitly.

Empty string is a valid choice for presets whose chat template does not want a system turn (some Gemma variants, for example). To disable the system turn entirely, keep this empty:

preset.ChatParameters.SystemPrompt = "";

For consistent behavior across sessions, set it in your preset subclass or before calling AsposeLLMApi.Create:

preset.ChatParameters.SystemPrompt =
    "You are a precise technical assistant. Cite sources and say when you do not know.";

`History`

Optional pre-filled conversation history. When set, new sessions start with these turns already in the KV cache — useful for few-shot priming or for restoring a context that you assembled in your application.

Leave null for a blank session:

preset.ChatParameters.History = null;

Or inject turns:

using Aspose.LLM.Abstractions.Models;

preset.ChatParameters.History = new List<ChatMessage>
{
    ChatMessage.CreateUserMessage("What is the capital of France?"),
    ChatMessage.CreateAssistantMessage("Paris."),
};

The history is applied to every new session created from this preset. If you only want to prime one session, construct it and send the priming turns manually instead of setting History.

`MaxTokens`

Upper bound on tokens the engine generates for a single assistant response. The default 2048 fits most general-purpose models and prompts.

Reasoning-model budget. Qwen3, DeepSeek-R1, and other chain-of-thought models emit a hidden <think>…</think> block before the actual answer. That block alone routinely uses 300-500 tokens for even trivial questions. Set MaxTokens to at least 512 — ideally 1024-2048 — when using such models, or the response is truncated mid-reasoning and you get no visible answer. The /no_think directive (Qwen3) is unreliable across versions; raise the token budget instead.

Pick based on the task:

Scenario	Recommended `MaxTokens`
Short answers, classifications	128 - 256
Conversational replies	512 - 1024
Essays, long summaries	2048 - 4096
Reasoning-model output (Qwen3, DeepSeek-R1)	1024 - 4096
Code generation	1024 - 4096

Raising MaxTokens does not cost anything upfront — it is a cap, not an allocation. The engine generates as many tokens as needed up to this limit.

`CacheCleanupStrategy`

Policy the engine applies when the current session’s KV cache would overflow ContextParameters.ContextSize. The engine trims automatically as generation proceeds; you can also trigger explicit trimming with AsposeLLMApi.ForceCacheCleanup(strategy).

Strategy	Keeps	When to use
`RemoveOldestMessages` (default)	System prompt + the most recent turns	General-purpose. Preserves recency, drops the earliest history.
`KeepSystemPromptOnly`	System prompt only	Hard reset of session context. Useful between topics in a long-running session.
`KeepSystemPromptAndHalf`	System prompt + the second half of history	Balanced recall and room for new turns.
`KeepSystemPromptAndFirstUserMessage`	System prompt + the first user turn	Recall-heavy tasks where the original ask matters (long analyses, debugging sessions).
`KeepSystemPromptAndLastUserMessage`	System prompt + the most recent user turn	Focus on the current question, drop the middle.

See Chat sessions — Manage the KV cache for the call-site details and runtime semantics.

Typical recipes

Concise assistant with moderate output

var preset = new Qwen25Preset();
preset.ChatParameters.SystemPrompt =
    "You are a concise assistant. Each answer fits in two short sentences.";
preset.ChatParameters.MaxTokens = 256;

Long-form technical writer

var preset = new Qwen25Preset();
preset.ChatParameters.SystemPrompt =
    "You are a technical writer. Produce detailed, well-structured paragraphs.";
preset.ChatParameters.MaxTokens = 4096;
preset.ChatParameters.CacheCleanupStrategy = CacheCleanupStrategy.KeepSystemPromptAndFirstUserMessage;

Reasoning model with enough budget

var preset = new DeepseekR1Qwen3Preset();
preset.ChatParameters.SystemPrompt = "You are a careful reasoning assistant.";
preset.ChatParameters.MaxTokens = 2048; // leaves room for <think> plus final answer

Few-shot priming via history

using Aspose.LLM.Abstractions.Models;

var preset = new Qwen25Preset();
preset.ChatParameters.History = new List<ChatMessage>
{
    ChatMessage.CreateUserMessage("Translate to French: The cat sits on the mat."),
    ChatMessage.CreateAssistantMessage("Le chat est assis sur le tapis."),
    ChatMessage.CreateUserMessage("Translate to French: The dog runs in the park."),
    ChatMessage.CreateAssistantMessage("Le chien court dans le parc."),
};
// Now every new session starts with these examples.

What’s next

Chat sessions — how the engine uses these parameters at runtime.
Context parameters — the ContextSize field that drives cache cleanup timing.
Multi-turn chat — runnable example showing cache management in practice.

Net: Sampler parameters

Thu, 23 Apr 2026 00:00:00 +0000

SamplerParameters controls how the engine picks each next token during generation. It exposes the full sampler surface of llama.cpp common parameters: the classic temperature and nucleus sampling knobs, repetition penalties, specialized filters (DRY, XTC, dynatemp, Mirostat), reproducibility controls, and fine-grained logit bias.

Every built-in preset sets sensible defaults for its model family. Override specific fields before AsposeLLMApi.Create to tune behavior.

Class reference

namespace Aspose.LLM.Abstractions.Parameters;

public class SamplerParameters
{
    // Core sampling
    public float Temperature { get; set; } = 0.7f;
    public float TopP { get; set; } = 0.9f;
    public float TopK { get; set; } = 40;
    public float MinP { get; set; } = 0.05f;

    // Reproducibility
    public int Seed { get; set; } = unchecked((int)0xFFFFFFFF); // time-based
    public int MinKeep { get; set; } = 1;

    // Repetition controls
    public int PenaltyContextSize { get; set; } = -1; // -1 = full context
    public float RepetitionPenalty { get; set; } = 1.1f;
    public float PresencePenalty { get; set; } = 0f;
    public float FrequencyPenalty { get; set; } = 0f;

    // Advanced filters
    public float TypicalP { get; set; } = -1.0f;   // disabled if <= 0
    public float TopNSigma { get; set; } = -1.0f;  // disabled if <= 0

    // Dynamic temperature
    public float DynatempRange { get; set; } = 0.0f; // 0 disables
    public float DynatempExponent { get; set; } = 1.0f;

    // XTC (Exclude Top Choices)
    public float XtcProbability { get; set; } = -1.0f; // disabled if <= 0
    public float XtcThreshold { get; set; } = 0.0f;

    // DRY (Don't Repeat Yourself)
    public float DryMultiplier { get; set; } = -1.0f; // disabled if <= 0
    public float DryBase { get; set; } = 1.75f;
    public int DryAllowedLength { get; set; } = 2;
    public int DryPenaltyLastN { get; set; } = 0;
    public List<string> DrySequenceBreakers { get; set; } = new() { "\n", ":", "\"", "*" };

    // Mirostat
    public int Mirostat { get; set; } = 0; // 0=off, 1=Mirostat 1.0, 2=Mirostat 2.0
    public float MirostatTau { get; set; } = 5.0f;
    public float MirostatEta { get; set; } = 0.1f;

    // Fine-grained controls
    public Dictionary<int, float> LogitBias { get; set; } = new();
    public bool EnableInfill { get; set; } = false;
}

Detailed field reference

Core sampling: Temperature, TopP, TopK, MinP.

Reproducibility: Seed, MinKeep.

Repetition controls: PenaltyContextSize, RepetitionPenalty, PresencePenalty, FrequencyPenalty.

Advanced filters: TypicalP, TopNSigma.

Dynamic temperature: DynatempRange, DynatempExponent.

XTC (Exclude Top Choices): XtcProbability, XtcThreshold.

DRY (Don’t Repeat Yourself): DryMultiplier, DryBase, DryAllowedLength, DryPenaltyLastN, DrySequenceBreakers.

Mirostat: Mirostat, MirostatTau, MirostatEta.

Fine-grained controls: LogitBias, EnableInfill.

Core sampling

These four fields govern the baseline sampling pipeline. Start with these; rarely do you need the advanced filters.

`Temperature`

Default 0.7. Scales the logits before sampling. Higher values flatten the probability distribution (more random output); lower values sharpen it (more deterministic).

Value	Effect
`0.0`	Greedy sampling — always pick the most likely token. Fully deterministic.
`0.1 - 0.3`	Precise, low-creativity output. Good for code, structured data, classification.
`0.7` (default)	General-purpose balance of accuracy and variety.
`0.8 - 1.0`	More creative, more varied output.
`> 1.0`	High randomness. Useful for brainstorming; risks incoherent output.

`TopP`

Default 0.9. Nucleus sampling threshold. Only the smallest set of tokens whose cumulative probability exceeds TopP is considered for sampling. Lower TopP is more conservative.

0.9 (default) — balanced; a small tail of unlikely tokens is kept.
0.7 - 0.8 — more conservative; drops more of the tail.
1.0 — disabled; all tokens are candidates.

`TopK`

Default 40. Consider only the top K most likely tokens per step. Works alongside TopP.

40 (default) — reasonable upper bound for most models.
20 - 30 — more conservative.
0 or a very large number — disabled; TopP alone filters.

`MinP`

Default 0.05. Minimum probability relative to the top token. A token is kept only if its probability is at least MinP × p(top). Useful when the tail distribution has very low-probability tokens you never want to sample.

0.05 (default) — reasonable.
0.1 — stricter; drops more tail.
0.0 — disabled.

Reproducibility

`Seed`

Default 0xFFFFFFFF — a sentinel that llama.cpp maps to a time-based (non-deterministic) seed. For reproducible output, set a specific integer.

preset.SamplerParameters.Seed = 42;

Two runs with the same seed, the same prompt, and identical parameters produce the same output.

`MinKeep`

Default 1. Minimum number of candidate tokens kept after the filters (TopK, TopP, MinP, TypicalP) are applied. Guarantees the sampler always has at least MinKeep tokens to choose from, preventing empty-candidate edge cases.

Leave at 1 unless you have a specific reason.

Repetition controls

The engine penalizes tokens that appeared recently to avoid loops and verbatim repetition.

`PenaltyContextSize`

Default -1 (= full context). Number of recent tokens considered for repetition penalties.

-1 — use the model’s full context size.
A positive integer — only the last N tokens contribute.

Smaller windows make penalties more local; larger windows spread them across the whole conversation.

`RepetitionPenalty`

Default 1.1. Multiplicative penalty applied to recently-seen tokens. Values > 1 make repeats less likely; 1.0 disables repetition penalty.

1.0 — no penalty.
1.05 - 1.15 (default range) — gentle anti-repetition.
1.2 - 1.3 — aggressive; risks under-generating common words like “the” or “and”.

`PresencePenalty`

Default 0 (disabled). Additive penalty for tokens that have appeared at least once in the penalty window. Biases the model toward new tokens not yet used.

preset.SamplerParameters.PresencePenalty = 0.6f;

`FrequencyPenalty`

Default 0 (disabled). Additive penalty proportional to how many times a token has appeared. Stronger version of presence penalty.

preset.SamplerParameters.FrequencyPenalty = 0.3f;

Typical ranges for both penalties: 0.0 - 1.0. Combine with RepetitionPenalty to shape repetition behavior; tune one at a time.

Advanced filters

`TypicalP`

Default -1 (disabled). Locally-typical sampling — keeps tokens whose log-probability is close to the expected entropy. Alternative to nucleus sampling; rarely needed when TopP is set.

Enable with a value in (0, 1], e.g. 0.95.

`TopNSigma`

Default -1 (disabled). Keeps tokens within N standard deviations of the top logit. Experimental filter from recent llama.cpp versions.

Dynamic temperature (`DynatempRange`, `DynatempExponent`)

Dynamically adjusts temperature per step based on token entropy. When entropy is low (the model is confident), temperature drops; when entropy is high (the model is uncertain), temperature rises.

DynatempRange default 0 — dynatemp disabled.
Set DynatempRange > 0 to enable. Typical values 0.2 - 0.5.
DynatempExponent default 1.0 controls the shape of the entropy-to-temperature curve.

preset.SamplerParameters.Temperature = 0.8f;
preset.SamplerParameters.DynatempRange = 0.3f;
// Effective temperature varies in [0.5, 1.1] based on per-step entropy.

XTC — Exclude Top Choices

XTC randomly excludes the top tokens during sampling at a configurable probability. Useful for injecting diversity without raising overall temperature.

XtcProbability default -1 (disabled). Probability of applying XTC at each step.
XtcThreshold — minimum probability below which tokens are not excluded.

DRY — Don’t Repeat Yourself

DRY detects and penalizes verbatim string repetition (word-for-word copies from earlier in the context). Useful for creative writing; often too aggressive for code.

Field	Default	Purpose
`DryMultiplier`	`-1` (disabled)	Strength of the penalty. Enable with a value like `0.8`.
`DryBase`	`1.75`	Base of the exponential penalty for consecutive repeated tokens.
`DryAllowedLength`	`2`	Minimum repeat length before DRY engages.
`DryPenaltyLastN`	`0`	Number of trailing tokens checked (0 = all).
`DrySequenceBreakers`	`["\n", ":", "\"", "*"]`	Tokens that reset the repeat detector.

preset.SamplerParameters.DryMultiplier = 0.8f;
preset.SamplerParameters.DryBase = 1.75f;
preset.SamplerParameters.DryAllowedLength = 3;

Mirostat

Adaptive sampler that targets a specific output entropy (perplexity). Alternative to temperature + nucleus sampling.

Mirostat = 0 (default) — disabled.
Mirostat = 1 — Mirostat 1.0.
Mirostat = 2 — Mirostat 2.0 (usually preferred).
MirostatTau = 5.0 — target entropy. Lower = more deterministic.
MirostatEta = 0.1 — learning rate.

When Mirostat is on, TopP, TopK, and MinP are effectively bypassed — Mirostat manages the full distribution itself.

preset.SamplerParameters.Mirostat = 2;
preset.SamplerParameters.MirostatTau = 4.5f;

Fine-grained controls

`LogitBias`

Per-token logit adjustment. Keys are token IDs, values are additive biases applied before sampling.

// Strongly bias against token 1234:
preset.SamplerParameters.LogitBias[1234] = -100f;
// Slightly favor token 5678:
preset.SamplerParameters.LogitBias[5678] = +2f;

A bias of -100 (or lower) effectively bans a token. A bias of +2 to +5 noticeably favors it. Obtaining token IDs requires the model’s tokenizer — not exposed in this parameter bag but available via the API reference.

`EnableInfill`

Default false. Enables the INFILL sampler variant used for fill-in-the-middle code completion (models that support it). Leave off for normal chat.

Typical recipes

Deterministic output (for tests and reproducible demos)

var preset = new Qwen25Preset();
preset.SamplerParameters.Temperature = 0.0f; // greedy
preset.SamplerParameters.Seed = 42;

Balanced chat assistant

var preset = new Qwen25Preset();
// Defaults are already tuned:
// Temperature 0.7, TopP 0.9, TopK 40, MinP 0.05, RepetitionPenalty 1.1

Creative writing

var preset = new Qwen25Preset();
preset.SamplerParameters.Temperature = 0.9f;
preset.SamplerParameters.TopP = 0.95f;
preset.SamplerParameters.MinP = 0.02f;
preset.SamplerParameters.DryMultiplier = 0.8f; // discourage verbatim repetition

Tight technical output (code, structured data)

var preset = new Qwen25Preset();
preset.SamplerParameters.Temperature = 0.2f;
preset.SamplerParameters.TopP = 0.95f;
preset.SamplerParameters.RepetitionPenalty = 1.05f; // gentler for code patterns

Precision via Mirostat

var preset = new Qwen25Preset();
preset.SamplerParameters.Mirostat = 2;
preset.SamplerParameters.MirostatTau = 5.0f;
preset.SamplerParameters.MirostatEta = 0.1f;
// Temperature, TopP, TopK are now ignored.

What’s next

Chat parameters — max tokens and system prompt shape what the sampler generates into.
Context parameters — the context window the sampler reads penalties from.
Custom preset — full customization patterns.

Net: Engine parameters

Thu, 23 Apr 2026 00:00:00 +0000

EngineParameters holds engine-wide settings that apply across the entire AsposeLLMApi lifecycle. Unlike the per-session parameter bags, these values are read once during engine construction and not re-evaluated per request.

Class reference

namespace Aspose.LLM.Core.DependencyInjection;

public class EngineParameters
{
    public string? ModelCachePath { get; set; }    // default: <LocalAppData>/Aspose.LLM/models
    public bool EnableDebugLogging { get; set; }   // default: false
    public string LogDirectoryPath { get; set; }   // default: "logs/log.txt"
    public int DefaultThreads { get; set; }        // default: Environment.ProcessorCount - 1
}

All four properties have working defaults. Override only when the default does not match your deployment.

Detailed field reference

Each field has a dedicated page with full defaults, scenario tables, code examples, and interactions.

Fields

Field	Type	Default	Purpose
`ModelCachePath`	`string?`	`<LocalAppData>/Aspose.LLM/models`	Folder where downloaded model files are stored.
`EnableDebugLogging`	`bool`	`false`	Enables native-level debug logs from `llama.cpp`. Verbose — use for diagnosis, not in production.
`LogDirectoryPath`	`string`	`"logs/log.txt"`	File path for native log output. Despite the name, this is a full file path, not a directory.
`DefaultThreads`	`int`	`ProcessorCount - 1`	Default thread count used when a parameter bag does not specify its own threading.

`ModelCachePath`

Points to the folder where downloaded GGUF files are cached. On first run, the engine:

Checks whether the model file already exists under this folder.
If not, downloads it from the source defined in ModelSourceParameters.
Loads the cached file on subsequent runs.

Typical reasons to override:

Shared model cache across multiple applications or users.
Faster disk — point to an SSD when your LocalApplicationData is on an HDD.
Constrained disk layout — put models on a data drive, code on the system drive.
Docker / container scenarios — mount the cache as a volume so models survive restarts.

Example — use a shared cache under D:\models:

var preset = new Qwen25Preset();
preset.EngineParameters.ModelCachePath = @"D:\models";

using var api = AsposeLLMApi.Create(preset);

`EnableDebugLogging`

When true, the native llama.cpp layer emits verbose logs — useful when diagnosing inference errors, template mismatches, or KV cache issues. Combine with an ILogger passed to AsposeLLMApi.Create(preset, logger) to capture the output:

using Microsoft.Extensions.Logging;

using var loggerFactory = LoggerFactory.Create(builder =>
    builder.AddConsole().SetMinimumLevel(LogLevel.Debug));
var logger = loggerFactory.CreateLogger<Program>();

var preset = new Qwen25Preset();
preset.EngineParameters.EnableDebugLogging = true;

using var api = AsposeLLMApi.Create(preset, logger);

Leave this false in production; debug logs materially affect throughput.

`LogDirectoryPath`

The path used by the native log writer. Defaults to logs/log.txt relative to the working directory. The name suggests a directory, but the value is a full file path.

Change this when:

Your app has a specific logging convention (for example, centralized log folder).
You need per-environment log files (logs/dev.log, logs/prod.log).
The working directory is not writable in your deployment.

preset.EngineParameters.LogDirectoryPath = @"C:\logs\aspose-llm\app.log";

`DefaultThreads`

Thread count used when ContextParameters.NThreads is not set explicitly. The default is Environment.ProcessorCount - 1 — one fewer than the total logical cores — to leave one core for the rest of your application.

Override in two situations:

Dedicated inference machine — use Environment.ProcessorCount for maximum throughput.
Tight envelope (containers, laaS) — use a fixed smaller number to stay inside a CPU quota.

preset.EngineParameters.DefaultThreads = 4;

For finer control over threading during generation, set ContextParameters.NThreads and NThreadsBatch directly — those override DefaultThreads when set.

Typical recipes

Development with debug logs

var preset = new Qwen25Preset();
preset.EngineParameters.EnableDebugLogging = true;
preset.EngineParameters.LogDirectoryPath = "logs/debug.log";

Production on a dedicated server

var preset = new Qwen25Preset();
preset.EngineParameters.ModelCachePath = @"/var/lib/aspose-llm/models";
preset.EngineParameters.EnableDebugLogging = false;
preset.EngineParameters.LogDirectoryPath = @"/var/log/aspose-llm/app.log";
preset.EngineParameters.DefaultThreads = Environment.ProcessorCount;

Container with mounted model volume

var preset = new Qwen25Preset();
preset.EngineParameters.ModelCachePath = "/models";           // volume mount
preset.EngineParameters.LogDirectoryPath = "/var/log/app.log"; // volume mount
preset.EngineParameters.DefaultThreads = 4;                   // CPU quota

What’s next

Context parameters — threads per inference call, batch sizes.
Binary manager parameters — where native llama.cpp binaries are cached.
Logging and diagnostics — ILogger integration and native log tags (planned reference page in a future release).

Net: Binary manager parameters

Thu, 23 Apr 2026 00:00:00 +0000

BinaryManagerParameters controls how the SDK obtains the native llama.cpp binaries (libllama, libmtmd, libggml-*). On first use, the engine downloads the matching build from GitHub for your platform and acceleration backend, caches it, and reuses the cache on subsequent runs.

Class reference

namespace Aspose.LLM.Abstractions.Parameters;

public class BinaryManagerParameters
{
    public string Owner { get; set; } = "ggml-org";
    public string Repo { get; set; } = "llama.cpp";
    public string ReleaseTag { get; set; } = "b8816";
    public string BinaryPath { get; set; }              // <LocalAppData>/Aspose.LLM/runtimes
    public SystemSpec? SystemSpecification { get; set; }
    public AccelerationType? PreferredAcceleration { get; set; }
}

Detailed field reference

Each field has a dedicated page with full defaults, scenario tables, code examples, and interactions.

Fields

Field	Type	Default	Purpose
`Owner`	`string`	`"ggml-org"`	GitHub repository owner for `llama.cpp` releases.
`Repo`	`string`	`"llama.cpp"`	GitHub repository name.
`ReleaseTag`	`string`	`"b8816"` (as of SDK v26.5.0)	Specific `llama.cpp` release to pin.
`BinaryPath`	`string`	`<LocalAppData>/Aspose.LLM/runtimes`	Local cache for downloaded native binaries.
`SystemSpecification`	`SystemSpec?`	`null` (auto-detect)	Override the detected OS / architecture / acceleration capabilities.
`PreferredAcceleration`	`AccelerationType?`	`null` (auto-select)	Force a specific acceleration backend (CUDA, HIP, Metal, Vulkan, CPU variants).

`Owner` and `Repo`

Together they form github.com/<Owner>/<Repo>/releases/.... The defaults target the upstream llama.cpp repository. Change them only if you mirror releases to a fork that stays byte-compatible with upstream — for example, in an air-gapped enterprise setup that syncs selected releases into a private GitHub Enterprise instance.

`ReleaseTag`

Pins a specific upstream release. As of SDK v26.5.0, the default is b8816. Each release tag corresponds to native binaries with matching P/Invoke signatures; changing the tag without a matching SDK version is unsupported.

Override only when:

You are testing a newer upstream release against the current SDK (development only).
You are locked to an older release because of a validated deployment.

preset.BinaryManagerParameters.ReleaseTag = "b8816";

The SDK’s P/Invoke layer is validated against the default ReleaseTag. Pinning a different tag can produce runtime errors if upstream changed a struct layout or function signature. Do not ship custom tags to production without a migration pass — see the llama-cpp-migration workflow used by the Aspose team.

`BinaryPath`

Folder where downloaded binaries live. The default is <LocalAppData>/Aspose.LLM/runtimes — %LOCALAPPDATA%\Aspose.LLM\runtimes on Windows and the equivalent LocalApplicationData folder elsewhere.

Override when:

Shared cache across multiple applications or services on the same host.
Read-only root filesystem — point the cache at a writable volume.
Pre-populated deployment — bundle the binaries with your application and point BinaryPath at them to skip the download on first run.

preset.BinaryManagerParameters.BinaryPath = @"/var/lib/aspose-llm/runtimes";

`SystemSpecification`

When null (the default), the SDK detects the host’s OS, architecture, and available accelerations at engine construction. Override with an explicit SystemSpec only for diagnostics or cross-platform binary preparation — leaving this null is correct for normal deployments.

`PreferredAcceleration`

When null, the SDK picks the best available acceleration for the host in this order: CUDA → HIP → Metal → Vulkan → CPU (AVX level best to worst). Set an explicit value to override the selection.

Supported values (see Aspose.LLM.Abstractions.Acceleration.AccelerationType):

Value	Platform	Notes
`CUDA`	Windows, Linux	NVIDIA GPUs.
`HIP`	Linux	AMD GPUs via ROCm.
`Metal`	macOS (Apple Silicon)	M-series chips.
`Vulkan`	Windows, Linux	Cross-platform GPU.
`AVX512`	Any x64 with AVX-512	Fastest CPU path.
`AVX2`	Any x64 with AVX2	Default CPU fallback.
`AVX`	Older x64	Slower.
`NoAVX`	Very old CPUs	Last-resort compatibility.
`Kompute`, `OpenCL`, `SYCL`, `OpenBLAS`	Platform-dependent	Less common; verify availability for your target.

The enum has additional values (None) used internally — avoid setting them explicitly.

Typical recipes

Force CPU-only execution

using Aspose.LLM.Abstractions.Acceleration;

var preset = new Qwen25Preset();
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.AVX2;
preset.BaseModelInferenceParameters.GpuLayers = 0; // complement on the inference side

Force CUDA on a multi-GPU box

var preset = new Qwen25Preset();
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
preset.BaseModelInferenceParameters.GpuLayers = 999;
// Pick a specific GPU via BaseModelInferenceParameters.MainGpu = N;

Pre-populated offline deployment

Download the binaries on a connected machine, copy the cache into your deployment, and point BinaryPath at it on the offline host:

var preset = new Qwen25Preset();
preset.BinaryManagerParameters.BinaryPath = @"/opt/aspose-llm/runtimes";
// Also point EngineParameters.ModelCachePath at a pre-populated model cache.

Shared cache across services

preset.BinaryManagerParameters.BinaryPath = @"/srv/shared/aspose-llm/runtimes";

Make sure every service using this cache runs the same SDK version — version mismatches produce binary incompatibilities.

What’s next

System requirements — what runtimes and hardware the binaries support.
Model inference parameters — complement PreferredAcceleration with GpuLayers and split settings.
Architecture — what happens during first-run binary deployment.

Net: Multimodal context parameters

Thu, 23 Apr 2026 00:00:00 +0000

MultimodalContextParameters — exposed on the preset as MtmdContextParameters — configures the mtmd context used by vision presets to evaluate image tokens. The base text model is configured by ContextParameters; this bag covers only the multimodal layer.

Only vision presets use these settings. On text-only presets the bag is instantiated but has no effect.

Class reference

namespace Aspose.LLM.Abstractions.Parameters;

public class MultimodalContextParameters
{
    public bool? UseGpu { get; set; }
    public bool? PrintTimings { get; set; }
    public int? ThreadCount { get; set; }
    public int? Verbosity { get; set; }
    public string? MediaMarker { get; set; }
}

Every field is nullable. A null value means “use the native mtmd default” — override only when you have a specific reason.

Detailed field reference

Each field has a dedicated page with full defaults, scenario tables, code examples, and interactions.

Fields

Field	Type	Default	Purpose
`UseGpu`	`bool?`	native default	Whether to offload the vision projector to GPU.
`PrintTimings`	`bool?`	native default	Emit per-step timing diagnostics from the `mtmd` layer.
`ThreadCount`	`int?`	native default	Threads used by `mtmd` processing.
`Verbosity`	`int?`	native default	Log level for the `mtmd` layer.
`MediaMarker`	`string?`	native default	Placeholder token text that marks image positions in the prompt.

`UseGpu`

Controls whether the vision projector (mmproj) runs on the GPU alongside the base model. The mmproj is typically small (200 MB - 2 GB), so GPU offload is fast even on modest hardware.

null — delegate to mtmd’s auto-detection (currently: GPU if available).
true — force GPU.
false — force CPU. Use when you have limited GPU memory and want to spend it entirely on the base model.

preset.MtmdContextParameters.UseGpu = false; // keep GPU memory for the base model

`PrintTimings`

Enables mtmd’s built-in per-step timing logs — the time spent tokenizing images, running the projector, and evaluating chunks. Useful for diagnosing slow first-response latency on vision queries.

preset.MtmdContextParameters.PrintTimings = true;

Leave this null (off) in production. Timing logs add overhead and flood the output.

`ThreadCount`

Threads used for CPU-side mtmd work (image preprocessing, CPU portions of the projector). When null, mtmd follows its own heuristic — usually half the logical cores.

Override when:

The rest of your application needs more cores and mtmd is single-shot work.
You run multiple vision requests concurrently and want to cap each one’s CPU footprint.

preset.MtmdContextParameters.ThreadCount = 2;

`Verbosity`

Log verbosity for the mtmd layer. The native layer accepts an integer; the typical mapping is:

Value	Level
`0`	Error
`1`	Warn
`2`	Info
`3`	Debug

preset.MtmdContextParameters.Verbosity = 3; // debug — useful when images are tokenized unexpectedly

Keep verbosity low in production (0 or 1). Higher levels emit tagged lines that need post-processing to be useful — see the parse_mm_logs.zsh helper script in the Aspose.LLM SDK repository.

`MediaMarker`

Placeholder text used in the chat template to mark where images are inserted. The default is the chat-template-specific marker (different per model family — LLaVA, Qwen-VL, Gemma-Vision, and others have different tokens). Override only if you understand the model’s prompt format and need a non-standard marker.

preset.MtmdContextParameters.MediaMarker = "<|image|>";

In nearly all cases, leave this null. The correct marker is selected automatically from the model’s metadata.

Typical recipes

Default vision configuration

var preset = new Qwen3VL2BPreset();
// MtmdContextParameters stays at defaults — all fields null.

using var api = AsposeLLMApi.Create(preset);

Debug slow image processing

var preset = new Qwen3VL2BPreset();
preset.MtmdContextParameters.PrintTimings = true;
preset.MtmdContextParameters.Verbosity = 3;

using var api = AsposeLLMApi.Create(preset, logger);
// Inspect logs for per-stage mtmd timings.

CPU-only projector to save GPU memory

var preset = new Qwen3VL2BPreset();
preset.MtmdContextParameters.UseGpu = false;                  // projector on CPU
preset.BaseModelInferenceParameters.GpuLayers = 999;          // base model fully on GPU

On a GPU tight for memory, keeping the projector on CPU trades some first-token latency for more headroom for the base model and KV cache.

What’s next

Supported presets — vision — built-in vision presets and their mmproj sources.
Model source parameters — configure the vision projector’s download source.
Attaching images — vision use case (planned in a future release).

Documentation – Parameters

Net: Model source parameters

Class reference

Detailed field reference

Resolution order

Where models are cached

Typical recipes

Use a built-in preset’s model source

Load a model from a local file

Switch to a different Hugging Face quantization

Bring a custom model from Hugging Face

Configure a vision projector

What’s next

Net: Model inference parameters

Class reference

Detailed field reference

Fields

GpuLayers

UseMemoryMapping

UseMemoryLocking

MainGpu and SplitMode

TensorSplit

VocabOnly

CheckTensors

UseExtraBuffers

KvOverrides

Typical recipes

CPU-only inference

Full GPU offload

Partial offload on a memory-constrained GPU

Two unequal GPUs

Validate a suspect GGUF

What’s next

Net: Context parameters

Class reference

Detailed field reference

Context size and batching

ContextSize

NBatch

NUbatch

NSeqMax

Threading

NThreads

NThreadsBatch

RoPE scaling

RopeScalingType

RopeFreqBase and RopeFreqScale

YaRN

Attention

AttentionType

FlashAttentionMode

FlashAttention

Pooling and embeddings

PoolingType

Embeddings

KV cache

TypeK and TypeV

OffloadKqv

DefragThreshold

SwaFull

KvUnified

Other

OpOffload

NoPerf

Typical recipes

Minimize memory on a tight GPU

Extended context with YaRN scaling

High-throughput production

Benchmark-focused (reproducible)

Embedding extraction

What’s next

Net: Chat parameters

Class reference

Detailed field reference

Fields

SystemPrompt

History

MaxTokens

CacheCleanupStrategy

Typical recipes

`GpuLayers`

`UseMemoryMapping`

`UseMemoryLocking`

`MainGpu` and `SplitMode`

`TensorSplit`

`VocabOnly`

`CheckTensors`

`UseExtraBuffers`

`KvOverrides`

`ContextSize`

`NBatch`

`NUbatch`

`NSeqMax`

`NThreads`

`NThreadsBatch`

`RopeScalingType`

`RopeFreqBase` and `RopeFreqScale`

`AttentionType`

`FlashAttentionMode`

`FlashAttention`

`PoolingType`

`Embeddings`

`TypeK` and `TypeV`

`OffloadKqv`

`DefragThreshold`

`SwaFull`

`KvUnified`

`OpOffload`

`NoPerf`

`SystemPrompt`

`History`

`MaxTokens`

`CacheCleanupStrategy`

`Temperature`

`TopP`

`TopK`

`MinP`

`Seed`

`MinKeep`

`PenaltyContextSize`

`RepetitionPenalty`

`PresencePenalty`

`FrequencyPenalty`

`TypicalP`

`TopNSigma`

Dynamic temperature (`DynatempRange`, `DynatempExponent`)

`LogitBias`

`EnableInfill`

`ModelCachePath`

`EnableDebugLogging`

`LogDirectoryPath`

`DefaultThreads`

`Owner` and `Repo`

`ReleaseTag`

`BinaryPath`

`SystemSpecification`

`PreferredAcceleration`