Acceleration

Aspose.LLM for .NET wraps llama.cpp and uses its native binaries for every inference. Those binaries ship per-platform, per-acceleration variants — one for each GPU backend, plus CPU variants at different AVX levels. The SDK downloads the matching variant on first use and caches it locally.

You do not configure backends at compile time — the choice is made at runtime by BinaryManager. You can let the SDK auto-detect the best option for your host, or force a specific backend via BinaryManagerParameters.PreferredAcceleration.

Supported backends

Backend Platforms GPU vendor Typical priority
CUDA Windows, Linux NVIDIA Highest on NVIDIA hosts
HIP / ROCm Linux AMD High on AMD hosts
Metal macOS (Apple Silicon) Apple Highest on Apple Silicon
Vulkan Windows, Linux Any Vulkan-capable GPU Fallback GPU; cross-vendor
CPU All Fallback when no GPU is available

Auto-detection

When BinaryManagerParameters.PreferredAcceleration is null (the default), the SDK picks the best available backend for your host in this priority order:

  1. CUDA — if an NVIDIA GPU with a recent driver is present.
  2. HIP — if a ROCm-capable AMD GPU is present.
  3. Metal — on Apple Silicon.
  4. Vulkan — if a Vulkan-capable GPU is present.
  5. CPU — with the highest AVX level available (AVX512 > AVX2 > AVX > NoAVX).

The detection runs during AsposeLLMApi.Create, before the native binary download. The result is reflected in the downloaded asset’s name.

Forcing a backend

Set BinaryManagerParameters.PreferredAcceleration to an AccelerationType value:

using Aspose.LLM.Abstractions.Acceleration;

var preset = new Qwen25Preset();
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
preset.BaseModelInferenceParameters.GpuLayers = 999;

using var api = AsposeLLMApi.Create(preset);

The full enum:

public enum AccelerationType
{
    None,
    CUDA,
    HIP,
    Metal,
    Vulkan,
    Kompute,
    OpenCL,
    SYCL,
    AVX512,
    AVX2,
    AVX,
    OpenBLAS,
    NoAVX,
}

Kompute, OpenCL, SYCL, and OpenBLAS are included for completeness — verify availability for your target; they are less common.

Matching GPU offload

PreferredAcceleration chooses the binary. BaseModelInferenceParameters.GpuLayers chooses how many layers run on the GPU. Both settings need to agree:

PreferredAcceleration Recommended GpuLayers
CUDA, HIP, Metal, Vulkan 999 (full offload) or a partial count fitting VRAM
AVX512, AVX2, AVX, NoAVX 0 (CPU only)

Setting a GPU binary with GpuLayers = 0 works but wastes the GPU — the model runs on CPU anyway. Setting a CPU binary with GpuLayers = 999 silently keeps the model on CPU since there is no GPU runtime available.

Memory by backend

GPU inference needs VRAM for model weights (proportional to GpuLayers), plus the KV cache (proportional to ContextSize × layers on GPU), plus intermediate buffers.

Backend Memory location Constraint
CUDA VRAM on the chosen NVIDIA GPU (MainGpu + TensorSplit) GPU memory
HIP VRAM on the chosen AMD GPU GPU memory
Metal Unified memory (RAM/VRAM shared on Apple Silicon) System RAM
Vulkan VRAM on chosen GPU GPU memory
CPU System RAM System RAM

See System requirements for per-preset memory ranges.

First-run download size

Each acceleration variant is a different archive on the GitHub release. Sizes are approximate and vary by release:

Backend Typical download
CUDA 400-800 MB
HIP 300-500 MB
Metal 100-200 MB
Vulkan 200-400 MB
CPU (AVX2/AVX512) 80-150 MB

Once downloaded, the variant is cached at BinaryManagerParameters.BinaryPath. Changing PreferredAcceleration triggers a new download on next Create.

What’s next