GPU not detected
You wanted the SDK to use the GPU, but inference is slow and nvidia-smi (or equivalent) shows no activity from the process. This page walks through the detection pipeline and common misconfigurations.
Symptom
- Inference runs at CPU speed (5-15 tokens/sec) despite a GPU being present.
nvidia-smidoes not list the Aspose.LLM process.rocm-smishows zero utilization.- Logs do not mention CUDA / HIP / Metal / Vulkan initialization.
Cause
The SDK picks a backend in two stages:
BinaryManagerdownloads a native binary matchingBinaryManagerParameters.PreferredAcceleration(or auto-detection). The binary dictates what GPU APIs are available.EnginerespectsBaseModelInferenceParameters.GpuLayers— if0, the model stays on CPU even if the binary supports GPU.
Either stage can silently fall back to CPU.
Resolution
1. Check the downloaded binary
Enable debug logging and look for the binary selection line in logs:
[BinaryManager] resolved asset: llama-b8816-bin-win-cuda-cu12.4-x64.zip
If the asset name says cpu or does not mention CUDA/HIP/Metal/Vulkan, the BinaryManager did not detect the GPU. Fix by forcing:
using Aspose.LLM.Abstractions.Acceleration;
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;
Then clear the binary cache at BinaryManagerParameters.BinaryPath and re-run to download the GPU variant.
2. Check the driver
On Linux / Windows with NVIDIA:
nvidia-smi
# Must show Driver Version >= 525 and the GPU model.
If nvidia-smi does not find the GPU, the driver is not installed or the GPU is not accessible (containerized environment without --gpus all flag, or the host has no NVIDIA GPU).
On Linux with AMD:
rocminfo
# Must list your GPU under Agent information.
On macOS:
system_profiler SPDisplaysDataType | grep "Chipset Model"
# Must show Apple M-series for Metal support.
3. Verify GpuLayers
Even with the right binary, GpuLayers = 0 forces CPU. Set it explicitly:
preset.BaseModelInferenceParameters.GpuLayers = 999;
999 is the idiomatic “full offload”. The engine caps at the model’s actual layer count.
4. Check for conflicting environment variables
NVIDIA environment variables can hide GPUs from the process:
echo $CUDA_VISIBLE_DEVICES
# If set to empty or -1, no GPU is visible.
Unset or set to 0 (or a valid GPU index):
unset CUDA_VISIBLE_DEVICES
# or
export CUDA_VISIBLE_DEVICES=0
For HIP: ROCR_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES play the same role.
5. Container / WSL2 specifics
Docker: you must start containers with --gpus all (NVIDIA) or --device=/dev/kfd --device=/dev/dri (AMD ROCm). Without these flags, the container has no GPU access.
WSL2 on Windows: install NVIDIA driver on the Windows side; install CUDA inside WSL following NVIDIA’s WSL2 guide. Older Windows + WSL combinations do not support CUDA in WSL — upgrade Windows 11 and WSL.
6. Fall back to Vulkan
If CUDA / HIP setup is impractical (custom kernels, container limitations), try Vulkan:
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.Vulkan;
preset.BaseModelInferenceParameters.GpuLayers = 999;
Vulkan runs on NVIDIA, AMD, and Intel GPUs with standard drivers. Performance is 20-40 % below CUDA but better than CPU.
7. Windows users with AMD — use Vulkan
Aspose.LLM does not ship HIP binaries for Windows. On Windows with AMD, Vulkan is the only GPU path.
Prevention
-
During deployment, assert GPU is active with a small probe:
// After Create, a short inference should be fast on GPU. var sw = System.Diagnostics.Stopwatch.StartNew(); string reply = await api.SendMessageAsync("Say ok."); sw.Stop(); if (sw.Elapsed.TotalSeconds > 2) _logger.LogWarning("Inference is slow - GPU may not be active."); -
Log the chosen acceleration at startup for auditability.
-
Monitor GPU utilization in production (Datadog, Prometheus) to catch silent CPU fallback.
What’s next
- Acceleration — detailed per-backend setup.
- Binary manager parameters —
PreferredAcceleration. - Model inference parameters —
GpuLayers,SplitMode,MainGpu.