Model inference parameters

ModelInferenceParameters controls how the engine loads a model into memory: how many layers to offload to GPU, how to split across multiple GPUs, whether to use memory mapping, and how to override GGUF metadata at runtime.

Most fields are nullable — a null value means “use the native default”. Set an explicit value only when you need to override.

Class reference

namespace Aspose.LLM.Abstractions.Parameters;

public class ModelInferenceParameters
{
    public int? GpuLayers { get; set; }
    public bool? UseMemoryMapping { get; set; }
    public bool? UseMemoryLocking { get; set; }
    public int? MainGpu { get; set; }
    public LlamaSplitMode? SplitMode { get; set; }
    public bool? VocabOnly { get; set; }
    public bool? CheckTensors { get; set; }
    public bool? UseExtraBuffers { get; set; }
    public float[]? TensorSplit { get; set; }
    public ModelKeyValueOverride[]? KvOverrides { get; set; }
}

Detailed field reference

Each field has a dedicated page with full defaults, scenario tables, code examples, and interactions. The rest of this page is an inline overview of the same content; follow the links for the deeper treatment.

Load knobs: GpuLayers, UseMemoryMapping, UseMemoryLocking, MainGpu, SplitMode.

Other knobs: VocabOnly, CheckTensors, UseExtraBuffers, TensorSplit, KvOverrides.

Fields

Field	Type	Default	Purpose
`GpuLayers`	`int?`	native default	Number of model layers to offload to GPU. `0` = CPU only, `999` = full offload.
`UseMemoryMapping`	`bool?`	native default (`true`)	Map the GGUF file instead of reading it in. Reduces startup time and memory copying.
`UseMemoryLocking`	`bool?`	native default (`false`)	Lock model memory to prevent OS paging. Needs `mlock` / `VirtualLock` privileges.
`MainGpu`	`int?`	`0`	Index of the GPU used when `SplitMode` is `None`.
`SplitMode`	`LlamaSplitMode?`	native default	How to split the model across multiple GPUs.
`VocabOnly`	`bool?`	native default (`false`)	Load only the vocabulary without weights. Used for tokenizer-only scenarios.
`CheckTensors`	`bool?`	native default (`false`)	Validate tensor data on load. Adds startup time; helpful for diagnosing corrupted GGUF files.
`UseExtraBuffers`	`bool?`	native default	Use extra buffer types for weight repacking. Advanced.
`TensorSplit`	`float[]?`	equal split	Proportions per GPU when `SplitMode` splits across devices.
`KvOverrides`	`ModelKeyValueOverride[]?`	none	Runtime overrides for GGUF metadata keys.

`GpuLayers`

Controls GPU offload. Each transformer layer lives either in system RAM (CPU inference) or GPU VRAM (GPU inference). Partial offload is supported: you can put the first N layers on the GPU and keep the rest on the CPU.

Value	Behavior
`0`	CPU only. No GPU memory allocated for model weights.
A layer count	Offload the first N layers to GPU, keep the remaining on CPU.
`999` (or any value ≥ the model’s layer count)	Full GPU offload. Idiomatic way to request “put everything on the GPU”.
`null`	Use the native default.

Pair with a GPU-capable BinaryManagerParameters.PreferredAcceleration — setting GpuLayers = 999 on a CPU-only binary silently keeps the model on the CPU.

preset.BaseModelInferenceParameters.GpuLayers = 999;
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;

`UseMemoryMapping`

When true (default on most platforms), the engine memory-maps the GGUF file so the OS streams it in on demand. This reduces startup time and avoids copying the full model into RAM before inference.

Set to false only for rare scenarios — network file systems that do not support mmap, or environments where you want the file fully loaded before the first token. Disabling mmap doubles peak memory during load (the read buffer plus the mapped copy).

`UseMemoryLocking`

When true, the engine calls mlock (Linux/macOS) or VirtualLock (Windows) on the model’s memory so the OS does not page it out. Requires elevated privileges or raised ulimits.

Leave null or false in most deployments. Enable only when the model is swapped out under memory pressure and you have the system tuning to support locking.

`MainGpu` and `SplitMode`

On multi-GPU hosts, SplitMode decides how to place the model and MainGpu selects the primary device when the mode is not splitting.

`SplitMode`	Behavior
`LLAMA_SPLIT_MODE_NONE`	Single GPU. Whole model on device `MainGpu`.
`LLAMA_SPLIT_MODE_LAYER`	Split layers across GPUs. KV cache follows layers.
`LLAMA_SPLIT_MODE_ROW`	Split layers and rows across GPUs. Uses tensor parallelism where supported.

using Aspose.LLM.Abstractions.Parameters;

// Single GPU, use device 1:
preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_NONE;
preset.BaseModelInferenceParameters.MainGpu = 1;

// Split across all GPUs, layer mode:
preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_LAYER;
preset.BaseModelInferenceParameters.TensorSplit = null; // equal distribution

`TensorSplit`

Proportion of the model placed on each GPU. The array length should match the number of available GPUs.

null — equal distribution.
Explicit array — values are normalized to sum to 1. For example, [2, 1] on two GPUs places 67 % on GPU 0 and 33 % on GPU 1.

Useful when GPUs have different memory sizes (for example, a 24 GB card paired with a 12 GB card):

preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_LAYER;
preset.BaseModelInferenceParameters.TensorSplit = new float[] { 2.0f, 1.0f };

`VocabOnly`

When true, the engine loads only the model’s vocabulary, skipping the weights. The model is not usable for generation in this state — it is a tokenizer-only configuration for rare tooling scenarios.

Leave null (or false) for any normal inference use.

`CheckTensors`

When true, the engine validates every tensor’s data during load. Adds significant startup time but catches corrupted GGUF files early. Useful when testing a new download or a model from an untrusted source; leave null in production.

`UseExtraBuffers`

Enables additional buffer types used by weight repacking paths in llama.cpp. Advanced; most users should leave it null.

`KvOverrides`

Overrides specific keys in the GGUF metadata at load time. Each override targets a single metadata key and provides the new typed value.

preset.BaseModelInferenceParameters.KvOverrides = new[]
{
    new ModelKeyValueOverride
    {
        Key = "llama.rope.scaling.type",
        Type = ModelKvOverrideType.String,
        StringValue = "yarn",
    },
    new ModelKeyValueOverride
    {
        Key = "llama.context_length",
        Type = ModelKvOverrideType.Int,
        IntValue = 131072,
    },
};

The ModelKeyValueOverride class carries one of four typed values depending on Type:

`Type`	Value field	Example keys
`Int`	`IntValue`	`llama.context_length`, `llama.embedding_length`
`Float`	`FloatValue`	`llama.rope.freq_base`, `llama.rope.scaling.factor`
`Bool`	`BoolValue`	Model-specific boolean flags
`String`	`StringValue`	`general.architecture`, `llama.rope.scaling.type`

Use overrides with care — wrong metadata makes the model load incorrectly or silently produce garbage.

Typical recipes

CPU-only inference

var preset = new Qwen25Preset();
preset.BaseModelInferenceParameters.GpuLayers = 0;

Full GPU offload

var preset = new Qwen25Preset();
preset.BaseModelInferenceParameters.GpuLayers = 999;
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;

Partial offload on a memory-constrained GPU

A 12 GB GPU cannot fit a full 8B Q4_K_M model plus KV cache. Offload the first 28 layers and keep the rest on the CPU:

var preset = new Qwen3Preset(); // 8B, 32 layers
preset.BaseModelInferenceParameters.GpuLayers = 28;
preset.BaseModelInferenceParameters.UseMemoryMapping = true;

Benchmark to find the right split for your hardware — “offload until VRAM is ~1-2 GB short of full”.

Two unequal GPUs

A 24 GB GPU paired with a 12 GB GPU:

preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_LAYER;
preset.BaseModelInferenceParameters.TensorSplit = new float[] { 2.0f, 1.0f };
preset.BaseModelInferenceParameters.GpuLayers = 999;

Validate a suspect GGUF

preset.BaseModelInferenceParameters.CheckTensors = true;
// After a clean load, switch back to the default for production.

What’s next

Binary manager parameters — pair with PreferredAcceleration to select the right native binary.
Context parameters — KV cache configuration and batch sizes that interact with GPU offload.
System requirements — GPU backends and their driver / runtime requirements.

Model source parameters Context parameters

Model inference parameters

Class reference

Detailed field reference

Fields

GpuLayers

UseMemoryMapping

UseMemoryLocking

MainGpu and SplitMode

TensorSplit

VocabOnly

CheckTensors

UseExtraBuffers

KvOverrides