Documentation – Model inference parameters

Net: GpuLayers

Thu, 23 Apr 2026 00:00:00 +0000

GpuLayers sets how many transformer layers run on GPU. Each offloaded layer lives in VRAM; the rest stay in system RAM. The primary knob for GPU acceleration.

Quick reference


Type	`int?`
Default	`null` (use native default)
Range	`0` = CPU only; `999` = full offload; partial values offload first N layers
Category	Load / offload
Field on	`ModelInferenceParameters.GpuLayers`

What it does

Each transformer layer either lives in VRAM (GPU inference for that layer) or in system RAM (CPU inference). The engine processes each token through all layers in sequence, using the respective backend at each step.

Value	Behavior
`0`	CPU only. No VRAM allocated for weights.
`N` (where 1 ≤ N < model’s layer count)	Partial offload. First N layers on GPU.
`≥ model's layer count` (idiomatic `999`)	Full offload. All layers on GPU.
`null`	Native default; usually matches “all layers” on GPU-capable builds.

Partial offload is useful when the model doesn’t fit entirely in VRAM. The transition between GPU and CPU layers costs some throughput but lets you run larger models than would otherwise fit.

When to change it

Scenario	Value
CPU-only inference	`0`
Full GPU offload (idiomatic)	`999`
8B model on 12 GB GPU	Typically `24` – `32` (verify per model)
70B model on 24 GB GPU	Partial; pair with quantization
Apple Silicon Metal	`999` (unified memory — no separate budget)

Pair with a GPU-capable binary via BinaryManagerParameters.PreferredAcceleration.

Example

using Aspose.LLM.Abstractions.Acceleration;

var preset = new Qwen25Preset();
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
preset.BaseModelInferenceParameters.GpuLayers = 999;  // full offload

using var api = AsposeLLMApi.Create(preset);

Partial offload for a memory-tight GPU:

preset.BaseModelInferenceParameters.GpuLayers = 24;
preset.ContextParameters.OffloadKqv = false;  // keep KV on CPU to save more VRAM

Interactions

BinaryManagerParameters.PreferredAcceleration — must point at a GPU-capable backend.
MainGpu — which GPU (for single-GPU mode).
SplitMode — how to split across multiple GPUs.
TensorSplit — per-GPU allocation ratios.
OffloadKqv — related, but for KV cache not weights.

What’s next

MainGpu — single-GPU selector.
SplitMode — multi-GPU.
Acceleration overview — backend-specific setup.
GPU deployment use case — runnable example.

Net: UseMemoryMapping

Thu, 23 Apr 2026 00:00:00 +0000

UseMemoryMapping controls whether the engine memory-maps the GGUF file instead of reading it into RAM. Memory mapping lets the OS stream the model on demand, which cuts startup time and peak memory.

Quick reference


Type	`bool?`
Default	`null` (native default — usually `true`)
Category	Model loading
Field on	`ModelInferenceParameters.UseMemoryMapping`

What it does

true (default) — the OS maps the GGUF file into address space. Pages are brought into memory on first access. Startup time is fast; peak memory is bounded by the working set.
false — the engine reads the full file into RAM before model init. Startup is slower; peak memory doubles during load (read buffer + allocation).
null — native default; behaves as true on most platforms.

Memory mapping is preferred unless your filesystem does not support mmap (some network filesystems, container volume drivers).

When to change it

Scenario	Value
Default (recommended)	`null`
Network filesystem without mmap support	`false`
Need full model loaded into RAM upfront	`false`
Debugging mmap-specific issues	`false`

Example

var preset = new Qwen25Preset();
preset.BaseModelInferenceParameters.UseMemoryMapping = true;  // default, shown explicitly

For an NFS-mounted model:

preset.BaseModelInferenceParameters.UseMemoryMapping = false;
// Loading takes longer, but works on filesystems where mmap fails.

Interactions

UseMemoryLocking — lock working set to prevent paging.
GpuLayers — offloaded layers are copied from the mapped file to GPU memory.

What’s next

UseMemoryLocking — prevent paging.
GpuLayers — GPU offload.
Model inference hub — all inference knobs.

Net: UseMemoryLocking

Thu, 23 Apr 2026 00:00:00 +0000

UseMemoryLocking requests the OS to lock model memory pages, preventing them from being swapped out. Requires elevated privileges or raised ulimits.

Quick reference


Type	`bool?`
Default	`null` (native default — usually `false`)
Category	Model loading
Field on	`ModelInferenceParameters.UseMemoryLocking`

What it does

true — the engine calls mlock (Linux/macOS) or VirtualLock (Windows) on the model memory. The OS will not page it out.
false or null — no locking. OS may page model memory under pressure.

Paging inference model memory is catastrophic for performance — suddenly generation stalls for seconds while the kernel pages weights back from disk. UseMemoryLocking = true prevents that.

Cost: requires appropriate privileges. On Linux, the user must have sufficient RLIMIT_MEMLOCK (raise via ulimit -l or /etc/security/limits.conf). On Windows, the process needs “Lock Pages in Memory” permission.

When to change it

Scenario	Value
Default	`null` (disabled)
Shared host under memory pressure	`true` (requires privilege)
Container without memlock capability	`null` (do not attempt)
Dedicated inference machine with ample RAM	`null` (unnecessary)

Example

var preset = new Qwen25Preset();
preset.BaseModelInferenceParameters.UseMemoryLocking = true;
// Requires the process to have the required OS-level privilege.

Linux ulimit bump (at the shell, before running):

ulimit -l unlimited
dotnet run

Interactions

UseMemoryMapping — with mmap on, mlock locks the mapped pages as they fault in.
System-level configuration — mlock availability depends on OS limits.

What’s next

UseMemoryMapping — companion load-time knob.
Model inference hub — all inference knobs.

Net: MainGpu

Thu, 23 Apr 2026 00:00:00 +0000

MainGpu selects the GPU device index used when SplitMode is None. On multi-GPU hosts, this picks which device holds the entire model.

Quick reference


Type	`int?`
Default	`null` (use device 0)
Range	`0` to (GPU count - 1)
Category	GPU configuration
Field on	`ModelInferenceParameters.MainGpu`

What it does

null or 0 — use GPU 0.
1, 2, etc. — use that GPU.

MainGpu is ignored when SplitMode is LAYER or ROW — those modes distribute the model across multiple GPUs without a single “main” device.

On single-GPU hosts, the field is effectively always 0.

When to change it

Scenario	Value
Single GPU or default	`null` or `0`
Multi-GPU, pin model to specific device	That device index

Use the standard environment variables to constrain visibility globally:

CUDA: CUDA_VISIBLE_DEVICES=1 makes device 1 appear as device 0 to the process.
HIP: ROCR_VISIBLE_DEVICES=1 similarly.

Example

using Aspose.LLM.Abstractions.Parameters;

var preset = new Qwen25Preset();
preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_NONE;
preset.BaseModelInferenceParameters.MainGpu = 1;  // use GPU 1
preset.BaseModelInferenceParameters.GpuLayers = 999;

using var api = AsposeLLMApi.Create(preset);

Interactions

SplitMode — MainGpu only applies when mode is None.
GpuLayers — layers go to MainGpu when split is None.

What’s next

SplitMode — multi-GPU distribution.
GpuLayers — primary offload control.
CUDA acceleration — multi-GPU NVIDIA setup.

Net: SplitMode

Thu, 23 Apr 2026 00:00:00 +0000

SplitMode determines how the model is distributed across multiple GPUs. Essential only on multi-GPU hosts; on single-GPU systems, None is correct.

Quick reference


Type	`LlamaSplitMode?` enum
Default	`null` (use native default)
Values	`LLAMA_SPLIT_MODE_NONE`, `LLAMA_SPLIT_MODE_LAYER`, `LLAMA_SPLIT_MODE_ROW`
Category	GPU distribution
Field on	`ModelInferenceParameters.SplitMode`

What it does

Value	Behavior
`LLAMA_SPLIT_MODE_NONE` (`0`)	Single GPU. Whole model on `MainGpu`.
`LLAMA_SPLIT_MODE_LAYER` (`1`)	Split layers across GPUs. KV cache follows layers. Good default for multi-GPU.
`LLAMA_SPLIT_MODE_ROW` (`2`)	Split layers and rows across GPUs. Uses tensor parallelism when supported. Fastest on high-bandwidth GPU interconnects (NVLink).

On fully connected multi-GPU setups (NVLink, consumer PCIe with good topology), ROW is often fastest. On PCIe-only consumer setups, LAYER is safer.

When to change it

Scenario	Value
Single GPU	`LLAMA_SPLIT_MODE_NONE` (or `null`)
Multi-GPU default	`LLAMA_SPLIT_MODE_LAYER`
High-bandwidth multi-GPU (NVLink)	`LLAMA_SPLIT_MODE_ROW`
Testing multi-GPU setup	Start with `LAYER`, try `ROW` if stable

Example

using Aspose.LLM.Abstractions.Parameters;

var preset = new Qwen25Preset();
preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_LAYER;
preset.BaseModelInferenceParameters.GpuLayers = 999;
preset.BaseModelInferenceParameters.TensorSplit = new float[] { 2.0f, 1.0f };
// 2:1 split for unequal-VRAM GPUs (e.g., 24 GB + 12 GB).

using var api = AsposeLLMApi.Create(preset);

Interactions

MainGpu — only applies when SplitMode = None.
TensorSplit — per-device allocation; applies to LAYER / ROW.
GpuLayers — total layers on GPUs; distributed per split mode.
HIP / Vulkan — support both split modes with varying driver maturity; test your specific setup.

What’s next

TensorSplit — fine-grained per-GPU ratios.
MainGpu — single-GPU device index.
CUDA acceleration — multi-GPU NVIDIA setup.

Net: VocabOnly

Thu, 23 Apr 2026 00:00:00 +0000

VocabOnly loads just the model’s vocabulary and tokenizer without loading the weights. The resulting model cannot generate output — it is a tokenizer-only configuration.

Quick reference


Type	`bool?`
Default	`null` (false — load full model)
Category	Model loading
Field on	`ModelInferenceParameters.VocabOnly`

What it does

null or false — load the full model (vocabulary + weights). Required for inference.
true — load only vocabulary data. The model is loaded with no weights; chat methods are not meaningful.

Use VocabOnly = true only for tokenizer-level operations — for example, probing token IDs to populate LogitBias without paying the cost of loading weights.

When to change it

Scenario	Value
Normal chat inference	`null` or `false`
Tokenizer-only probing	`true`

Rare in practice. Most applications load the full model.

Example

var preset = new Qwen25Preset();
preset.BaseModelInferenceParameters.VocabOnly = true;
// Tokenizer-only mode. Chat methods will not function; use only for
// token-ID discovery.

Interactions

Other ModelInferenceParameters fields (GpuLayers, TensorSplit, etc.) are largely irrelevant in VocabOnly mode.

What’s next

LogitBias — use case for token-ID probing.
Model inference hub — all inference knobs.

Net: CheckTensors

Thu, 23 Apr 2026 00:00:00 +0000

CheckTensors validates every tensor in the GGUF file during load. Adds noticeable startup time but catches corrupted or truncated models early — cleaner than a mysterious runtime error later.

Quick reference


Type	`bool?`
Default	`null` (false — no validation)
Category	Model loading
Field on	`ModelInferenceParameters.CheckTensors`

What it does

null or false — default. Skip validation; trust the GGUF file.
true — walk every tensor, verify shape and data consistency. Fails early with a clear error if the file is corrupted.

Validation can add 10-30 seconds to startup on a 7B model; longer on larger models. Not intended for production. Use when you have a suspicion about file integrity.

When to change it

Scenario	Value
Default	`null`
First-run validation after download from an untrusted source	`true`
Debugging a corrupted GGUF suspicion	`true`
Production (after validation passed once)	`null`

Example

var preset = new Qwen25Preset();
preset.BaseModelInferenceParameters.CheckTensors = true;
// Startup takes longer; validates every tensor against shape + data consistency.

using var api = AsposeLLMApi.Create(preset);

Interactions

UseMemoryMapping — validation walks mapped or read-in data; either mode works.
GpuLayers — validation happens on host memory before offload.

What’s next

Model not loading troubleshooting — when CheckTensors helps.
Model inference hub — all inference knobs.

Net: UseExtraBuffers

Thu, 23 Apr 2026 00:00:00 +0000

UseExtraBuffers is an advanced llama.cpp flag that enables extra buffer types used by the weight-repacking path. Rarely tuned in practice; leave at default unless specifically instructed.

Quick reference


Type	`bool?`
Default	`null` (use native default)
Category	Advanced
Field on	`ModelInferenceParameters.UseExtraBuffers`

What it does

Internal to llama.cpp. Controls whether the engine uses additional buffer types during weight repacking for specific hardware paths. The exact behavior depends on the backend and release tag.

null — native default. Correct for almost all users.
true / false — override. Not useful without specific backend expertise.

When to change it

Scenario	Value
Default	`null`
Backend-specific advice from SDK docs	As instructed

Do not speculate. If you are not sure whether you need this flag, you do not need it.

Example

var preset = new Qwen25Preset();
// preset.BaseModelInferenceParameters.UseExtraBuffers = null; // default

Interactions

Backend-specific. Effects vary by acceleration variant.

What’s next

Model inference hub — all inference knobs.

Net: TensorSplit

Thu, 23 Apr 2026 00:00:00 +0000

TensorSplit is an array of floats, one per GPU, that controls the proportion of the model placed on each GPU during multi-GPU split. Values are normalized — [2.0, 1.0] means 2/3 on GPU 0 and 1/3 on GPU 1.

Quick reference


Type	`float[]?`
Default	`null` (equal distribution across GPUs)
Range	Array length = GPU count; values positive
Category	Multi-GPU configuration
Field on	`ModelInferenceParameters.TensorSplit`

What it does

When SplitMode is LAYER or ROW, the engine distributes layers (or row blocks) across GPUs according to TensorSplit. Each GPU gets a share proportional to its entry in the array.

null — equal distribution. Splits evenly regardless of VRAM.
[2.0, 1.0] — 2:1 split. First GPU gets twice the share.
[1.0, 1.0, 1.0] — explicit equal across 3 GPUs.
[3.0, 2.0, 1.0] — 3:2:1 split across 3 GPUs (50 %, 33 %, 17 %).

The array length should match the number of GPUs visible to the process (after any CUDA_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES filtering).

When to change it

Scenario	Value
Single GPU	Not applicable
Multi-GPU, equal VRAM	`null` (equal default is correct)
Multi-GPU, unequal VRAM (24 GB + 12 GB)	`[2.0, 1.0]`
Multi-GPU with one GPU reserved for other work	Smaller share for that GPU

Example

using Aspose.LLM.Abstractions.Parameters;

var preset = new Qwen25Preset();
preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_LAYER;
preset.BaseModelInferenceParameters.GpuLayers = 999;
preset.BaseModelInferenceParameters.TensorSplit = new float[] { 2.0f, 1.0f };
// 2/3 of layers on GPU 0 (larger VRAM), 1/3 on GPU 1.

using var api = AsposeLLMApi.Create(preset);

Interactions

SplitMode — TensorSplit applies only when mode is LAYER or ROW.
GpuLayers — total layers on GPUs are distributed per TensorSplit.
MainGpu — ignored when TensorSplit is active.

What’s next

SplitMode — split strategy selector.
CUDA multi-GPU — NVIDIA multi-GPU setup.
GPU deployment use case — runnable example.

Net: KvOverrides

Thu, 23 Apr 2026 00:00:00 +0000

KvOverrides lets you patch specific keys in the GGUF metadata at load time. Each override targets one metadata key and provides a typed replacement value. Use to fix missing or incorrect metadata on a custom GGUF without rebuilding the file.

Quick reference


Type	`ModelKeyValueOverride[]?`
Default	`null` (no overrides)
Category	Model metadata
Field on	`ModelInferenceParameters.KvOverrides`

What it does

The engine reads model configuration from GGUF metadata at load. KvOverrides intercepts specific keys and substitutes your values. Common targets: context length, RoPE frequency base, RoPE scaling type.

Each override has:

Field	Type
`Key`	`string` — metadata key (e.g., `llama.context_length`)
`Type`	`ModelKvOverrideType` — `Int`, `Float`, `Bool`, `String`
`IntValue`, `FloatValue`, `BoolValue`, `StringValue`	typed value slots

Only the slot matching Type is read.

When to change it

Scenario	Value
Default — trust GGUF metadata	`null`
GGUF missing expected metadata	Single override for each missing key
Force a specific YaRN/RoPE recipe	Overrides for `llama.rope.*` keys
Diagnostic — test different metadata	Temporary overrides

Wrong overrides silently break the model. Only patch metadata you have a clear reason to change.

Example

using Aspose.LLM.Abstractions.Parameters;

var preset = new Qwen25Preset();
preset.BaseModelInferenceParameters.KvOverrides = new[]
{
    new ModelKeyValueOverride
    {
        Key = "llama.rope.scaling.type",
        Type = ModelKvOverrideType.String,
        StringValue = "yarn",
    },
    new ModelKeyValueOverride
    {
        Key = "llama.context_length",
        Type = ModelKvOverrideType.Int,
        IntValue = 131072,
    },
};

using var api = AsposeLLMApi.Create(preset);

Common override keys

Key	Type	Notes
`llama.context_length`	`Int`	Declared training context length
`llama.embedding_length`	`Int`	Hidden size
`llama.rope.freq_base`	`Float`	RoPE theta
`llama.rope.scaling.type`	`String`	`"none"`, `"linear"`, `"yarn"`, `"longrope"`
`llama.rope.scaling.factor`	`Float`	Scaling multiplier
`general.architecture`	`String`	Model family name

Exact key names vary by architecture. Inspect the model’s metadata with a tool like gguf-dump from llama.cpp before overriding.

Interactions

ContextParameters.RopeScalingType — overriding llama.rope.scaling.type via KvOverrides has similar effect.
ContextParameters.ContextSize — at load time, KvOverrides of llama.context_length defines what the runtime treats as the trained window.

What’s next

RopeScalingType — alternative way to control scaling.
Long context tuning — when KvOverrides helps.
Bring your own GGUF — custom-model workflows.