GPU deployment with CUDA
CUDA is the fastest backend for Aspose.LLM for .NET on NVIDIA hardware. This use case walks through a production GPU deployment — single GPU, multi-GPU, and VRAM sizing.
When to use this pattern
- Production host with NVIDIA GPU.
- You want maximum throughput for a given model size.
- Multi-GPU server where you need to split a large model.
- Cloud VM with CUDA-capable instance (e.g., g5, p3/p4).
Prerequisites
- Install the NuGet package.
- Apply a license.
- NVIDIA GPU with compute capability 5.0+.
- NVIDIA driver 525 or later.
- System requirements → GPU section for the full matrix.
Verify with nvidia-smi:
nvidia-smi
# Check Driver Version >= 525 and note GPU model + VRAM.
Single GPU, full offload
The common case: one GPU, whole model on it.
using Aspose.LLM;
using Aspose.LLM.Abstractions.Acceleration;
using Aspose.LLM.Abstractions.Parameters.Presets;
var license = new Aspose.LLM.License();
license.SetLicense("Aspose.LLM.lic");
var preset = new Qwen25Preset();
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
preset.BaseModelInferenceParameters.GpuLayers = 999; // full offload
using var api = AsposeLLMApi.Create(preset);
string reply = await api.SendMessageAsync("Explain Kubernetes in one paragraph.");
Console.WriteLine(reply);
GpuLayers = 999 is idiomatic for “put everything on the GPU”. The engine caps the value at the model’s actual layer count.
Single GPU, partial offload
For a model that does not quite fit in VRAM, offload the first N layers and keep the rest on CPU. Find the sweet spot by benchmarking.
var preset = new Qwen3Preset(); // 8B, 32 layers
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
// First 28 layers on GPU, last 4 on CPU.
preset.BaseModelInferenceParameters.GpuLayers = 28;
using var api = AsposeLLMApi.Create(preset);
The KV cache also consumes VRAM proportional to GPU-resident layers. Leave ~1-2 GB of VRAM headroom.
Multi-GPU split
Two or more NVIDIA GPUs — distribute the model across them.
using Aspose.LLM.Abstractions.Parameters;
var preset = new Oss20Preset(); // 20B — typically needs 16-24 GB
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_LAYER;
preset.BaseModelInferenceParameters.GpuLayers = 999;
// For equal-VRAM GPUs, leave TensorSplit null (equal split).
// For unequal GPUs, balance by VRAM:
preset.BaseModelInferenceParameters.TensorSplit = new float[] { 1.0f, 1.0f }; // 50/50
using var api = AsposeLLMApi.Create(preset);
Split modes:
| Mode | Good for |
|---|---|
LLAMA_SPLIT_MODE_NONE |
Single GPU; all on MainGpu. |
LLAMA_SPLIT_MODE_LAYER |
General multi-GPU. Splits layers across devices. |
LLAMA_SPLIT_MODE_ROW |
Multi-GPU with high-bandwidth interconnect (NVLink). Best throughput with tensor parallelism. |
Unequal VRAM
If the cards differ (e.g., 24 GB RTX 4090 + 12 GB RTX 3080), bias the split toward the larger:
preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_LAYER;
preset.BaseModelInferenceParameters.TensorSplit = new float[] { 2.0f, 1.0f };
// 2:1 proportion — values normalized.
Select a specific GPU
MainGpu picks the device when split mode is None. On hosts with multiple GPUs where you only want one for inference:
preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_NONE;
preset.BaseModelInferenceParameters.MainGpu = 1; // index 1
preset.BaseModelInferenceParameters.GpuLayers = 999;
Or use the standard NVIDIA env var to hide GPUs from the process:
CUDA_VISIBLE_DEVICES=1 dotnet run
VRAM sizing
VRAM budget breakdown:
- Model weights (scales with quantization and layer count).
- KV cache (scales with
ContextSize × layers on GPU). - Intermediate buffers (50-500 MB).
Example: Qwen 3 8B Q4_K_M at 32K context on full GPU offload:
- Weights: ~5 GB
- KV cache (32K tokens × 32 layers, F16 by default): ~4 GB
- Buffers: ~300 MB
- Total: ~9-10 GB → fits on a 12 GB card comfortably
To cut KV cache memory, quantize the V tensor:
preset.ContextParameters.TypeV = GgmlType.Q8_0; // half V-cache size
Full example
using Aspose.LLM;
using Aspose.LLM.Abstractions.Acceleration;
using Aspose.LLM.Abstractions.Models;
using Aspose.LLM.Abstractions.Parameters;
using Aspose.LLM.Abstractions.Parameters.Presets;
internal class CudaDemo
{
public static async Task Main()
{
var license = new Aspose.LLM.License();
license.SetLicense("Aspose.LLM.lic");
var preset = new Qwen3Preset();
// CUDA backend.
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
// Full GPU offload.
preset.BaseModelInferenceParameters.GpuLayers = 999;
preset.BaseModelInferenceParameters.SplitMode = LlamaSplitMode.LLAMA_SPLIT_MODE_NONE;
// Save VRAM on long sessions.
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;
preset.ContextParameters.TypeV = GgmlType.Q8_0;
using var api = AsposeLLMApi.Create(preset);
var sw = System.Diagnostics.Stopwatch.StartNew();
string reply = await api.SendMessageAsync(
"Write a 200-word overview of microservices architecture trade-offs.");
sw.Stop();
Console.WriteLine(reply);
Console.WriteLine($"\nGenerated in {sw.Elapsed.TotalSeconds:F1}s ({reply.Length} chars).");
}
}
Throughput expectations
Full-offload tokens-per-second for a 7B Q4_K_M model at 32K context:
| GPU | t/s |
|---|---|
| RTX 3060 (12 GB) | 40-60 |
| RTX 3090 / 4070 | 60-90 |
| RTX 4080 / A6000 | 80-120 |
| RTX 4090 | 100-140 |
| A100 40 GB | 120-160 |
| H100 | 180-250+ |
Numbers vary with batch size, context depth, driver version, and prompt length.
Cloud deployment
Common CUDA-capable cloud instances:
| Provider | Instance type | GPU |
|---|---|---|
| AWS | g5.xlarge up |
A10G |
| AWS | p3, p4 |
V100, A100 |
| Azure | NC-series |
T4, A100 |
| GCP | A2, G2 |
A100, L4 |
Ensure the image ships with NVIDIA driver 525+ and CUDA runtime. Most GPU AMIs are current; for custom base images, install nvidia-driver-525 (or later) before deploying.
Common issues
- CUDA binary downloaded, inference runs on CPU —
GpuLayers = 0or not set. Set to999. cudaErrorInsufficientDriver— upgrade driver to 525+.- Out-of-memory at load — partial offload or smaller preset.
- Slower multi-GPU than expected — switch from
LAYERtoROWif GPUs have NVLink; otherwiseLAYERis usually best.
What’s next
- CUDA acceleration — full CUDA reference.
- Model inference parameters —
GpuLayers,TensorSplit,SplitMode,MainGpu. - Context parameters — flash attention and KV dtype.