Acceleration

Aspose.LLM for .NET wraps llama.cpp and uses its native binaries for every inference. Those binaries ship per-platform, per-acceleration variants — one for each GPU backend, plus CPU variants at different AVX levels. The SDK downloads the matching variant on first use and caches it locally.

You do not configure backends at compile time — the choice is made at runtime by BinaryManager. You can let the SDK auto-detect the best option for your host, or force a specific backend via BinaryManagerParameters.PreferredAcceleration.

Supported backends

Backend	Platforms	GPU vendor	Typical priority
CUDA	Windows, Linux	NVIDIA	Highest on NVIDIA hosts
HIP / ROCm	Linux	AMD	High on AMD hosts
Metal	macOS (Apple Silicon)	Apple	Highest on Apple Silicon
Vulkan	Windows, Linux	Any Vulkan-capable GPU	Fallback GPU; cross-vendor
CPU	All	—	Fallback when no GPU is available

Auto-detection

When BinaryManagerParameters.PreferredAcceleration is null (the default), the SDK picks the best available backend for your host in this priority order:

CUDA — if an NVIDIA GPU with a recent driver is present.
HIP — if a ROCm-capable AMD GPU is present.
Metal — on Apple Silicon.
Vulkan — if a Vulkan-capable GPU is present.
CPU — with the highest AVX level available (AVX512 > AVX2 > AVX > NoAVX).

The detection runs during AsposeLLMApi.Create, before the native binary download. The result is reflected in the downloaded asset’s name.

Forcing a backend

Set BinaryManagerParameters.PreferredAcceleration to an AccelerationType value:

using Aspose.LLM.Abstractions.Acceleration;

var preset = new Qwen25Preset();
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
preset.BaseModelInferenceParameters.GpuLayers = 999;

using var api = AsposeLLMApi.Create(preset);

The full enum:

public enum AccelerationType
{
    None,
    CUDA,
    HIP,
    Metal,
    Vulkan,
    Kompute,
    OpenCL,
    SYCL,
    AVX512,
    AVX2,
    AVX,
    OpenBLAS,
    NoAVX,
}

Kompute, OpenCL, SYCL, and OpenBLAS are included for completeness — verify availability for your target; they are less common.

Matching GPU offload

PreferredAcceleration chooses the binary. BaseModelInferenceParameters.GpuLayers chooses how many layers run on the GPU. Both settings need to agree:

PreferredAcceleration	Recommended GpuLayers
CUDA, HIP, Metal, Vulkan	`999` (full offload) or a partial count fitting VRAM
AVX512, AVX2, AVX, NoAVX	`0` (CPU only)

Setting a GPU binary with GpuLayers = 0 works but wastes the GPU — the model runs on CPU anyway. Setting a CPU binary with GpuLayers = 999 silently keeps the model on CPU since there is no GPU runtime available.

Memory by backend

GPU inference needs VRAM for model weights (proportional to GpuLayers), plus the KV cache (proportional to ContextSize × layers on GPU), plus intermediate buffers.

Backend	Memory location	Constraint
CUDA	VRAM on the chosen NVIDIA GPU (`MainGpu` + `TensorSplit`)	GPU memory
HIP	VRAM on the chosen AMD GPU	GPU memory
Metal	Unified memory (RAM/VRAM shared on Apple Silicon)	System RAM
Vulkan	VRAM on chosen GPU	GPU memory
CPU	System RAM	System RAM

See System requirements for per-preset memory ranges.

First-run download size

Each acceleration variant is a different archive on the GitHub release. Sizes are approximate and vary by release:

Backend	Typical download
CUDA	400-800 MB
HIP	300-500 MB
Metal	100-200 MB
Vulkan	200-400 MB
CPU (AVX2/AVX512)	80-150 MB

Once downloaded, the variant is cached at BinaryManagerParameters.BinaryPath. Changing PreferredAcceleration triggers a new download on next Create.

What’s next

CUDA — NVIDIA GPUs.
HIP / ROCm — AMD GPUs.
Metal — Apple Silicon.
Vulkan — cross-platform GPU.
CPU — when no GPU is available.
Binary manager parameters — PreferredAcceleration and friends.

API reference Dependency injection