Documentation – Acceleration

Net: CUDA

Thu, 23 Apr 2026 00:00:00 +0000

CUDA is the fastest backend for Aspose.LLM for .NET on NVIDIA GPUs. It supports single-GPU and multi-GPU setups, aggressive memory offload, and the full llama.cpp feature set.

Requirements

GPU: NVIDIA with compute capability 5.0 or higher.
Driver: version 525 or later.
CUDA runtime: the one bundled with the downloaded native binary (typically CUDA 11.7 or 12.x). You do not install CUDA separately — the SDK’s binary ships with the runtime.
OS: Windows 10+ or Linux (glibc 2.28+). Not supported on macOS.

Verify driver and GPU with nvidia-smi:

nvidia-smi

Check the Driver Version (≥ 525) and the GPU model. Every modern GeForce, Quadro, Tesla, or data-center GPU qualifies.

Select CUDA

using Aspose.LLM.Abstractions.Acceleration;

var preset = new Qwen25Preset();
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
preset.BaseModelInferenceParameters.GpuLayers = 999; // full offload

using var api = AsposeLLMApi.Create(preset);

On the first run, the SDK downloads the CUDA variant of llama.cpp binaries (typically 400-800 MB) and caches it at BinaryManagerParameters.BinaryPath.

Single GPU

With one GPU, the defaults work. The engine places the offloaded layers on GPU 0.

preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
preset.BaseModelInferenceParameters.GpuLayers = 999;

Specific GPU selection

On hosts with multiple GPUs, MainGpu picks which one gets the entire model when split mode is None.

using Aspose.LLM.Abstractions.Parameters;

preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_NONE;
preset.BaseModelInferenceParameters.MainGpu = 1;    // use GPU index 1
preset.BaseModelInferenceParameters.GpuLayers = 999;

Use CUDA_VISIBLE_DEVICES environment variable to constrain which GPUs the process sees — standard NVIDIA tooling.

Multi-GPU split

Distribute the model across multiple GPUs.

preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_LAYER;
preset.BaseModelInferenceParameters.GpuLayers = 999;

Split modes:

Mode	Behavior
`LLAMA_SPLIT_MODE_NONE`	Single GPU only. Whole model on `MainGpu`.
`LLAMA_SPLIT_MODE_LAYER`	Split layers across GPUs. Good default for multi-GPU.
`LLAMA_SPLIT_MODE_ROW`	Split rows — enables tensor parallelism where supported. Fastest on setups with high-bandwidth GPU interconnects (NVLink).

For unequal GPU memory sizes, set TensorSplit to balance the load:

// 24 GB GPU + 12 GB GPU: 2:1 split.
preset.BaseModelInferenceParameters.TensorSplit = new float[] { 2.0f, 1.0f };

Values are normalized to sum to 1.

Partial offload on memory-tight GPUs

If the model does not fit entirely in VRAM, offload the first N layers and leave the rest on CPU.

preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
preset.BaseModelInferenceParameters.GpuLayers = 28;  // first 28 layers on GPU

Benchmark to find the right split — “offload until VRAM is ~1-2 GB short of full” — because the KV cache also claims VRAM proportional to GPU layer count.

Memory tips

KV cache quantization — set ContextParameters.TypeV = GgmlType.Q8_0 to halve V-cache memory with minor quality impact.
Flash Attention — ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled reduces memory at long contexts.
Shorter context — drop ContextParameters.ContextSize if you do not need the preset’s default length.

Common issues

Symptom	Likely cause	Fix
`cudaErrorInsufficientDriver` in logs	Driver too old.	Upgrade NVIDIA driver to 525+.
CUDA binary downloaded but inference is on CPU	`GpuLayers = 0` or not set.	Set `GpuLayers = 999`.
Out-of-memory on model load	VRAM too small for full offload.	Lower `GpuLayers`, enable flash attention, quantize KV.
Slower than expected on multi-GPU	`SplitMode = None`.	Switch to `LAYER` or `ROW` depending on interconnect.

What’s next

Binary manager parameters — PreferredAcceleration.
Model inference parameters — GpuLayers, SplitMode, TensorSplit, MainGpu.
Vulkan — cross-vendor GPU alternative.

Net: Metal

Thu, 23 Apr 2026 00:00:00 +0000

Metal is Aspose.LLM for .NET’s preferred backend on Apple Silicon Macs (M1, M2, M3, M4). It uses the unified memory architecture — the CPU and GPU share the same physical RAM — so there is no separate VRAM budget to worry about.

Requirements

Hardware: Apple Silicon Mac (M-series chip). Intel Macs are not supported.
OS: macOS 11 (Big Sur) or later.
No driver install: Metal ships with macOS; nothing to configure.

Select Metal

using Aspose.LLM.Abstractions.Acceleration;

var preset = new Qwen25Preset();
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.Metal;
preset.BaseModelInferenceParameters.GpuLayers = 999;

using var api = AsposeLLMApi.Create(preset);

On first run, the SDK downloads the Metal variant of llama.cpp binaries (typically 100-200 MB). Subsequent runs use the cache.

On Apple Silicon with PreferredAcceleration = null, auto-detection picks Metal by default — the explicit setting is redundant but harmless.

Unified memory

Because RAM and GPU memory are the same physical chip, GpuLayers = 999 does not create a separate VRAM claim — it tells the Metal backend to run the layers on the GPU compute units. The total memory footprint is the same whether you run on CPU or Metal, but Metal is significantly faster for matrix math.

You do not need to worry about “VRAM fits” calculations on Apple Silicon. If the model plus KV cache fit in system RAM, they fit for Metal too.

Typical memory ceilings per Mac:

Mac model	Unified memory options
M1 / M2 / M3 base	8 GB, 16 GB
M1 / M2 / M3 Pro	16 GB, 32 GB
M1 / M2 / M3 Max	32 GB, 64 GB, 96 GB
M2 / M3 Ultra	64 GB, 128 GB, 192 GB
M4 / M4 Pro / M4 Max	16 GB up to 128 GB

A 7B Q4_K_M model needs ~8-12 GB including KV cache — comfortable on any 16 GB Mac. A 70B Q4 needs 40+ GB and realistically wants an M2/M3 Ultra or M4 Max with 64 GB or more.

Single-chip only

Multi-GPU is not applicable on Apple Silicon. MainGpu, SplitMode, and TensorSplit have no effect — leave them at defaults.

Performance tips

Full offload — GpuLayers = 999 always. Partial offload is rarely useful because there is no separate VRAM budget.
Flash Attention — ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled. Metal implements flash attention kernels efficiently.
Prefer smaller quantizations — Apple Silicon’s memory bandwidth, not compute, is often the bottleneck. Q4 models run noticeably faster than Q8 on the same hardware for this reason.

Power management

Metal respects macOS power policy. On battery, macOS may throttle GPU clocks and reduce inference speed. Plug into AC for sustained throughput when benchmarking or running long jobs.

Common issues

Symptom	Likely cause	Fix
“This backend is not supported” on Intel Mac	Intel hardware.	Use CPU acceleration instead; upgrade to Apple Silicon for GPU.
Slower than expected	macOS throttling on battery.	Connect to AC power; close GPU-heavy apps.
Out-of-memory	Combined model + KV cache exceeds system RAM.	Smaller preset, lower context size, quantize KV cache.

What’s next

Model inference parameters — GpuLayers configuration.
Context parameters — flash attention and KV cache settings.
Binary manager parameters — PreferredAcceleration.

Net: Vulkan

Thu, 23 Apr 2026 00:00:00 +0000

Vulkan is a cross-vendor GPU backend that works on most modern discrete and integrated GPUs. It does not need CUDA or ROCm installed — only a recent graphics driver with Vulkan support. Use it when CUDA or HIP are unavailable, or when you want one codepath that works across NVIDIA, AMD, and Intel hardware.

Requirements

GPU: any GPU with Vulkan 1.2 or later support. This covers most discrete GPUs from 2018+ and integrated GPUs from Intel (Xe, UHD), AMD (RDNA, Vega), and NVIDIA (Maxwell+).
Driver:
- NVIDIA: 470 or later.
- AMD: modern Adrenalin or Mesa RADV on Linux.
- Intel: current Arc / Xe driver.
OS: Windows 10+ or Linux (glibc 2.28+). Not supported on macOS — Metal is the macOS equivalent.

Verify Vulkan support:

Windows: install Vulkan SDK’s vulkaninfoSDK.exe or run a Vulkan demo app.
Linux: vulkaninfo from the vulkan-tools package.

Select Vulkan

using Aspose.LLM.Abstractions.Acceleration;

var preset = new Qwen25Preset();
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.Vulkan;
preset.BaseModelInferenceParameters.GpuLayers = 999;

using var api = AsposeLLMApi.Create(preset);

The SDK downloads the Vulkan variant (typically 200-400 MB) on first run.

When to prefer Vulkan

Cross-vendor deployments — the same binary works on NVIDIA, AMD, and Intel GPUs without swapping backends.
Integrated GPUs — Intel Xe, AMD integrated Radeon, and NVIDIA integrated GPUs work via Vulkan, no vendor-specific runtime.
Containers without CUDA/ROCm — Vulkan runs without NVIDIA Container Toolkit or ROCm Docker setup.
Older hardware — many GPUs that no longer get CUDA updates still have Vulkan drivers.

When to prefer CUDA or HIP instead

For dedicated NVIDIA or AMD workloads, vendor-specific backends are usually faster:

CUDA is typically 20-40 % faster than Vulkan on NVIDIA.
HIP is faster than Vulkan on AMD for most workloads.

Use Vulkan when portability trumps raw speed.

Multi-GPU

Vulkan supports multi-GPU setups via SplitMode and TensorSplit like CUDA, but driver support for multi-device Vulkan is less mature. Test on your specific hardware before committing — single-GPU Vulkan is well-trodden; multi-GPU Vulkan is hit-or-miss.

preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_LAYER;
preset.BaseModelInferenceParameters.GpuLayers = 999;
// Optional TensorSplit for unequal GPU sizes.

Performance tips

Flash Attention — supported on most Vulkan drivers; enable via ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled.
Integrated GPU memory — iGPUs use system RAM for GPU memory. Short contexts and heavy KV quantization help fit within the shared pool.
Driver version matters — recent Vulkan drivers have significant perf improvements for LLM inference. Keep drivers current.

Common issues

Symptom	Likely cause	Fix
Vulkan binary downloaded but runs on CPU	`GpuLayers = 0`, or Vulkan driver missing.	Set `GpuLayers = 999`; verify Vulkan with `vulkaninfo`.
Crash at model load	Very old driver or unsupported GPU.	Update driver; fall back to CPU if hardware is pre-Vulkan-1.2.
Significantly slower than CUDA on NVIDIA	Expected — use CUDA instead for best performance.	Switch `PreferredAcceleration` to `CUDA`.

What’s next

CUDA — faster on NVIDIA.
HIP / ROCm — faster on AMD.
CPU — last-resort fallback.
Model inference parameters — GPU offload configuration.

Net: HIP / ROCm

Thu, 23 Apr 2026 00:00:00 +0000

HIP (and its underlying ROCm stack) is the AMD-specific GPU backend for Aspose.LLM for .NET. It is Linux-only and targeted at AMD Instinct data-center GPUs and recent Radeon consumer cards.

Requirements

GPU: ROCm-supported AMD GPU.
- Instinct: MI100, MI210, MI250, MI300 series.
- Radeon: RDNA 3 (RX 7900 series) and newer are officially supported. Some RDNA 2 cards (RX 6800/6900) work with HSA_OVERRIDE_GFX_VERSION workarounds.
OS: Linux with ROCm 6.x. Ubuntu 22.04 LTS and RHEL 9 are the commonly tested hosts.
Driver: install the ROCm stack via the official AMD packages. Verify with rocminfo.
No Windows support: AMD has Windows ROCm in preview, but Aspose.LLM’s HIP binaries currently target Linux only. On Windows with AMD GPUs, use Vulkan instead.

Verify ROCm:

rocminfo | grep "Name:"

The output should list your GPU by name (e.g., gfx1100 for RX 7900 XTX).

Select HIP

using Aspose.LLM.Abstractions.Acceleration;

var preset = new Qwen25Preset();
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.HIP;
preset.BaseModelInferenceParameters.GpuLayers = 999;

using var api = AsposeLLMApi.Create(preset);

The SDK downloads the HIP variant (typically 300-500 MB) on first run.

Multi-GPU

HIP supports multi-GPU across AMD cards of the same ROCm generation.

preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_LAYER;
preset.BaseModelInferenceParameters.GpuLayers = 999;
// Optionally tune TensorSplit per-GPU VRAM.

Use ROCR_VISIBLE_DEVICES to control which GPUs the process sees.

RDNA 2 workaround

Some RDNA 2 cards (e.g., RX 6800 XT, RX 6900 XT) are not officially supported by ROCm, but work with a GFX version override:

export HSA_OVERRIDE_GFX_VERSION=10.3.0
dotnet run

Apply this before starting your process. The override tricks ROCm into treating an RX 6800/6900 as a supported variant. Quality is fine; performance is slightly lower than on supported cards.

Performance tips

Flash Attention — supported on recent ROCm releases; enable via ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled.
Partial offload — for consumer Radeon cards with limited VRAM (16-24 GB), tune GpuLayers to slightly below full to leave room for KV cache.
KV quantization — aggressive TypeV = GgmlType.Q8_0 or even Q4_0 claws back meaningful VRAM on long contexts.

Windows AMD users: use Vulkan

Aspose.LLM does not ship Windows HIP binaries. If your AMD GPU is on Windows, use Vulkan — AMD’s Vulkan driver is excellent and gets you GPU acceleration without ROCm.

Common issues

Symptom	Likely cause	Fix
`rocblas_status_internal_error` on load	Incompatible ROCm version.	Match ROCm 6.x; upgrade if older.
Unsupported GPU at startup	Card not on ROCm’s support list.	Use Vulkan as fallback, or try `HSA_OVERRIDE_GFX_VERSION`.
Inference on CPU despite HIP binary	`GpuLayers = 0` or ROCm runtime missing.	Set `GpuLayers = 999`; verify with `rocminfo`.
Multi-GPU instability	Mixing different RDNA generations.	Stick to same-generation GPUs; try `LLAMA_SPLIT_MODE_LAYER`.

What’s next

Vulkan — Windows alternative for AMD.
Binary manager parameters — PreferredAcceleration selection.
Model inference parameters — GPU offload and multi-GPU settings.

Net: CPU

Thu, 23 Apr 2026 00:00:00 +0000

CPU is the fallback backend for Aspose.LLM for .NET when no supported GPU is available. It works on every supported platform and requires no special drivers or runtimes. Throughput is lower than GPU inference, but reasonable for small models (3B-7B parameters) on modern CPUs.

Requirements

CPU: any x64 CPU. ARM64 is supported on Linux and macOS (Apple Silicon). The SDK picks the optimal CPU variant automatically based on detected instruction sets.
OS: Windows 10+, Linux (glibc 2.28+), macOS 11+. Intel Macs use CPU; Apple Silicon Macs prefer Metal.

CPU variants

The SDK downloads one of four CPU variants depending on your CPU’s instruction sets:

Variant	Requires	Typical CPUs
`AVX512`	AVX-512 instructions	Intel Xeon Scalable 2nd+, recent Core (Cannon Lake+), AMD Zen 4
`AVX2` (default fallback)	AVX2 instructions	Most CPUs from 2014 onwards
`AVX`	AVX instructions	Sandy Bridge / Ivy Bridge era (2011-2013)
`NoAVX`	x64 baseline	Pre-2011 CPUs (very slow)

Auto-detection picks the highest available variant. To force a specific variant:

using Aspose.LLM.Abstractions.Acceleration;

preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.AVX2; // or AVX512, AVX, NoAVX
preset.BaseModelInferenceParameters.GpuLayers = 0;

CPU download size: typically 80-150 MB.

Thread configuration

On CPU, threading is the primary lever for throughput. Two settings:

ContextParameters.NThreads — threads for generation (token-by-token decode). Typically half of ProcessorCount.
ContextParameters.NThreadsBatch — threads for prompt processing (initial tokenization and context fill). Typically all ProcessorCount.

preset.ContextParameters.NThreads = 8;
preset.ContextParameters.NThreadsBatch = 16;

When NThreads is null, the engine falls back to EngineParameters.DefaultThreads (default: ProcessorCount - 1).

Why different thread counts

Prompt processing is embarrassingly parallel — more threads help.
Generation is sequential at the token level and bound by memory bandwidth. Beyond a certain point (often around 8-12 threads), adding threads does not help and sometimes slows generation due to NUMA or cache contention.

Benchmark on your specific hardware for the best numbers. Start with NThreads = ProcessorCount / 2 and NThreadsBatch = ProcessorCount, then tune.

Performance expectations

Rough tokens-per-second for a 7B Q4_K_M model on CPU only:

CPU class	Approximate t/s
Old x64 (2015)	1-3
Modern Core i5 / Ryzen 5	5-10
Modern Core i7 / Ryzen 7 / Xeon	8-15
High-end Core i9 / Threadripper / Xeon Platinum	12-25

These are ballpark; real numbers depend on clock speed, memory bandwidth, AVX level, and how busy the CPU is with other work.

For sustained CPU inference, expect 5-15 t/s on mainstream desktop CPUs. That is acceptable for chat UIs and batch jobs, but slow for real-time applications.

Memory

Model weights plus KV cache live in system RAM. Typical memory for a 7B Q4_K_M model with 32K context: 8-12 GB. See System requirements for per-preset estimates.

Performance tips

Use AVX512 when available — on Zen 4 or recent Xeon, AVX512 is 20-40 % faster than AVX2 for the same model.
Flash Attention — ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled helps on long contexts even on CPU.
Quantize the KV cache — TypeV = GgmlType.Q8_0 halves V-cache memory with minimal quality impact.
Smaller models first — 3B models (Llama 3.2, Phi 4 mini) run 2-3× faster than 7B models on the same CPU.
Disable other CPU-heavy work during inference. CPU throughput is sensitive to contention.

Common issues

Symptom	Likely cause	Fix
Very slow inference	Using `NoAVX` or `AVX` variant on a modern CPU.	Verify `PreferredAcceleration` auto-detection; force `AVX2` or `AVX512` explicitly.
High CPU usage, low throughput	Too many threads — contention.	Reduce `NThreads`; try `NThreads = ProcessorCount / 2`.
Stalls between tokens	Memory pressure or page thrashing.	Check memory; enable `UseMemoryMapping = true`.
Crashes on `AVX512` variant	CPU reports AVX-512 but has a bug in legacy modes.	Drop to `AVX2`.

What’s next

Model inference parameters — GpuLayers = 0 pattern for CPU-only.
Context parameters — NThreads and NThreadsBatch.
Metal — macOS-preferred alternative on Apple Silicon.