GpuLayers

GpuLayers sets how many transformer layers run on GPU. Each offloaded layer lives in VRAM; the rest stay in system RAM. The primary knob for GPU acceleration.

Quick reference


Type	`int?`
Default	`null` (use native default)
Range	`0` = CPU only; `999` = full offload; partial values offload first N layers
Category	Load / offload
Field on	`ModelInferenceParameters.GpuLayers`

What it does

Each transformer layer either lives in VRAM (GPU inference for that layer) or in system RAM (CPU inference). The engine processes each token through all layers in sequence, using the respective backend at each step.

Value	Behavior
`0`	CPU only. No VRAM allocated for weights.
`N` (where 1 ≤ N < model’s layer count)	Partial offload. First N layers on GPU.
`≥ model's layer count` (idiomatic `999`)	Full offload. All layers on GPU.
`null`	Native default; usually matches “all layers” on GPU-capable builds.

Partial offload is useful when the model doesn’t fit entirely in VRAM. The transition between GPU and CPU layers costs some throughput but lets you run larger models than would otherwise fit.

When to change it

Scenario	Value
CPU-only inference	`0`
Full GPU offload (idiomatic)	`999`
8B model on 12 GB GPU	Typically `24` – `32` (verify per model)
70B model on 24 GB GPU	Partial; pair with quantization
Apple Silicon Metal	`999` (unified memory — no separate budget)

Pair with a GPU-capable binary via BinaryManagerParameters.PreferredAcceleration.

Example

using Aspose.LLM.Abstractions.Acceleration;

var preset = new Qwen25Preset();
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
preset.BaseModelInferenceParameters.GpuLayers = 999;  // full offload

using var api = AsposeLLMApi.Create(preset);

Partial offload for a memory-tight GPU:

preset.BaseModelInferenceParameters.GpuLayers = 24;
preset.ContextParameters.OffloadKqv = false;  // keep KV on CPU to save more VRAM

Interactions

BinaryManagerParameters.PreferredAcceleration — must point at a GPU-capable backend.
MainGpu — which GPU (for single-GPU mode).
SplitMode — how to split across multiple GPUs.
TensorSplit — per-GPU allocation ratios.
OffloadKqv — related, but for KV cache not weights.

What’s next

MainGpu — single-GPU selector.
SplitMode — multi-GPU.
Acceleration overview — backend-specific setup.
GPU deployment use case — runnable example.

UseMemoryMapping