CUDA
CUDA is the fastest backend for Aspose.LLM for .NET on NVIDIA GPUs. It supports single-GPU and multi-GPU setups, aggressive memory offload, and the full llama.cpp feature set.
Requirements
- GPU: NVIDIA with compute capability 5.0 or higher.
- Driver: version 525 or later.
- CUDA runtime: the one bundled with the downloaded native binary (typically CUDA 11.7 or 12.x). You do not install CUDA separately — the SDK’s binary ships with the runtime.
- OS: Windows 10+ or Linux (glibc 2.28+). Not supported on macOS.
Verify driver and GPU with nvidia-smi:
nvidia-smi
Check the Driver Version (≥ 525) and the GPU model. Every modern GeForce, Quadro, Tesla, or data-center GPU qualifies.
Select CUDA
using Aspose.LLM.Abstractions.Acceleration;
var preset = new Qwen25Preset();
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
preset.BaseModelInferenceParameters.GpuLayers = 999; // full offload
using var api = AsposeLLMApi.Create(preset);
On the first run, the SDK downloads the CUDA variant of llama.cpp binaries (typically 400-800 MB) and caches it at BinaryManagerParameters.BinaryPath.
Single GPU
With one GPU, the defaults work. The engine places the offloaded layers on GPU 0.
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
preset.BaseModelInferenceParameters.GpuLayers = 999;
Specific GPU selection
On hosts with multiple GPUs, MainGpu picks which one gets the entire model when split mode is None.
using Aspose.LLM.Abstractions.Parameters;
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_NONE;
preset.BaseModelInferenceParameters.MainGpu = 1; // use GPU index 1
preset.BaseModelInferenceParameters.GpuLayers = 999;
Use CUDA_VISIBLE_DEVICES environment variable to constrain which GPUs the process sees — standard NVIDIA tooling.
Multi-GPU split
Distribute the model across multiple GPUs.
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_LAYER;
preset.BaseModelInferenceParameters.GpuLayers = 999;
Split modes:
| Mode | Behavior |
|---|---|
LLAMA_SPLIT_MODE_NONE |
Single GPU only. Whole model on MainGpu. |
LLAMA_SPLIT_MODE_LAYER |
Split layers across GPUs. Good default for multi-GPU. |
LLAMA_SPLIT_MODE_ROW |
Split rows — enables tensor parallelism where supported. Fastest on setups with high-bandwidth GPU interconnects (NVLink). |
For unequal GPU memory sizes, set TensorSplit to balance the load:
// 24 GB GPU + 12 GB GPU: 2:1 split.
preset.BaseModelInferenceParameters.TensorSplit = new float[] { 2.0f, 1.0f };
Values are normalized to sum to 1.
Partial offload on memory-tight GPUs
If the model does not fit entirely in VRAM, offload the first N layers and leave the rest on CPU.
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
preset.BaseModelInferenceParameters.GpuLayers = 28; // first 28 layers on GPU
Benchmark to find the right split — “offload until VRAM is ~1-2 GB short of full” — because the KV cache also claims VRAM proportional to GPU layer count.
Memory tips
- KV cache quantization — set
ContextParameters.TypeV = GgmlType.Q8_0to halve V-cache memory with minor quality impact. - Flash Attention —
ContextParameters.FlashAttentionMode = FlashAttentionType.Enabledreduces memory at long contexts. - Shorter context — drop
ContextParameters.ContextSizeif you do not need the preset’s default length.
Common issues
| Symptom | Likely cause | Fix |
|---|---|---|
cudaErrorInsufficientDriver in logs |
Driver too old. | Upgrade NVIDIA driver to 525+. |
| CUDA binary downloaded but inference is on CPU | GpuLayers = 0 or not set. |
Set GpuLayers = 999. |
| Out-of-memory on model load | VRAM too small for full offload. | Lower GpuLayers, enable flash attention, quantize KV. |
| Slower than expected on multi-GPU | SplitMode = None. |
Switch to LAYER or ROW depending on interconnect. |
What’s next
- Binary manager parameters —
PreferredAcceleration. - Model inference parameters —
GpuLayers,SplitMode,TensorSplit,MainGpu. - Vulkan — cross-vendor GPU alternative.