Documentation – Troubleshooting

Net: Binary download fails

Thu, 23 Apr 2026 00:00:00 +0000

On first AsposeLLMApi.Create, the SDK downloads native llama.cpp binaries from GitHub. In corporate or restricted environments, the download can fail. This page covers the common causes.

Symptom

HttpRequestException or a wrapped InvalidOperationException during Create.
Log entries mentioning GitHub, BinaryManager, or HTTP status codes (403, 404, 429, 500).
Create hangs for a long time and eventually times out.

Cause

The SDK contacts github.com/ggml-org/llama.cpp/releases/tag/<ReleaseTag> to list release assets, then downloads the asset matching your platform and acceleration. Any of these can block:

Corporate firewall or proxy blocks github.com.
TLS interception strips the expected certificate chain.
GitHub rate-limit (HTTP 429) on unauthenticated requests.
The requested ReleaseTag does not exist on upstream.
Disk space insufficient for the download or extraction.

Resolution

1. Verify network access

Manually confirm GitHub is reachable from the host:

curl -I https://api.github.com/repos/ggml-org/llama.cpp/releases/tags/b8816
# Expect HTTP/2 200

If the curl fails, the problem is network; fix that first (proxy config, firewall rule).

2. Configure an HTTP proxy

On Windows, set HTTP_PROXY and HTTPS_PROXY environment variables before starting the process. On Linux / macOS, export them:

export HTTPS_PROXY=http://proxy.example.com:8080
dotnet run

.NET’s default HttpClient honors these variables.

3. Pre-populate the cache

If the host cannot reach GitHub at all, pre-download on a machine with internet access and copy the cache to the target. Full workflow: Offline deployment.

4. Check the `ReleaseTag`

Confirm BinaryManagerParameters.ReleaseTag matches a real upstream release:

preset.BinaryManagerParameters.ReleaseTag = "b8816"; // default in SDK v26.5.0

Visit https://github.com/ggml-org/llama.cpp/releases/tag/<tag> in a browser to verify.

5. Free up disk space

Binary archives are 100-500 MB; extracted trees are similar. Ensure at least 1-2 GB free at BinaryManagerParameters.BinaryPath:

# Linux
df -h ~/.local/share/Aspose.LLM/runtimes

# Windows
# Check %LOCALAPPDATA%\Aspose.LLM\runtimes free space.

6. Handle GitHub rate limits

If logs mention HTTP 429, you are hitting GitHub’s unauthenticated API limit (60 requests/hour per IP). Options:

Wait and retry.
Use an authenticated HttpClient (advanced — requires a custom IModelFileProvider in the extensibility layer).
Pre-populate the cache so subsequent runs do not hit the API.

7. TLS interception

Corporate TLS-inspection proxies replace GitHub’s certificate with a corporate one. .NET by default rejects that.

Options (choose one):

Install the corporate root certificate into the host’s certificate store.
Bypass interception for *.github.com and *.githubusercontent.com on the proxy.

Do not disable TLS validation in production — it is a security regression.

Prevention

For production: always pre-populate caches in your deployment pipeline. Do not rely on first-run downloads in production environments.
For development: keep BinaryPath and ModelCachePath on a shared network drive across your team so downloads happen once per team, not once per developer.
Pin ReleaseTag explicitly in your preset — do not rely on the default across SDK upgrades.

What’s next

Offline deployment — full pre-population workflow.
Binary manager parameters — BinaryPath, ReleaseTag, PreferredAcceleration.
Architecture — the binary deployment stage.

Net: Out of memory

Thu, 23 Apr 2026 00:00:00 +0000

Out-of-memory failures happen at model load, during long sessions, or when running several models on the same host. The remedy depends on which memory pool is exhausted.

Symptom

At Create: InvalidOperationException mentioning memory allocation, cudaErrorOutOfMemory, or rocblas memory errors.
During generation: unexpectedly slow responses (swap thrashing on Linux/macOS).
On Windows: a System.OutOfMemoryException or the process being killed.
On GPU: nvidia-smi showing the process using close to or exceeding the card’s VRAM before the failure.

Cause

Model weights plus KV cache exceed the available memory pool.
KV cache grows as sessions accumulate — each active session claims a slice of ContextSize.
Multiple sessions, long prompts, and long responses compound.
On Apple Silicon (unified memory), system RAM is the shared ceiling.

Resolution

1. Identify which pool ran out

What you see	Ran out
`cudaErrorOutOfMemory`, `hipErrorOutOfMemory`	GPU VRAM
`System.OutOfMemoryException`, Linux OOM killer, kernel panic	System RAM
Swap activity (`free -h` shows swap in use), very slow generation	RAM exhausted, OS paging

2. Quick wins

Apply these changes and re-test.

Smaller context:

preset.ContextParameters.ContextSize = 4096; // down from 32K or 131K

Quantize V cache:

preset.ContextParameters.TypeV = GgmlType.Q8_0;

Enable flash attention:

preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;

Partial GPU offload:

// Offload some layers, leave rest on CPU.
preset.BaseModelInferenceParameters.GpuLayers = 20;

See Low memory tuning for the full recipe.

3. Switch to a smaller preset

If quick wins do not help, step down to a smaller model:

From	To	Savings
`Qwen25Preset` (7B)	`Llama32Preset` (3B)	~3-4 GB
`Oss20Preset` (20B)	`Qwen25Preset` (7B)	~6-8 GB
Any F16/Q8 preset	Q4_K_M equivalent	50 %

4. Cap concurrent sessions

In multi-user hosts, cap the active session count. A back-of-envelope budget:

max_sessions = (available_memory - model_weights - overhead) / per_session_kv_budget

Use Estimate memory requirements for concrete numbers.

Evict idle sessions by periodically disposing AsposeLLMApi and recreating it. The current SDK does not provide an explicit per-session evict API.

5. Recycle the API instance

In long-running hosts, KV cache and native buffers can grow beyond predictions. Periodically:

Dispose the current AsposeLLMApi.
Wait for native memory to release.
Create a new instance.

Expect a 5-30 second restart cost on warm caches.

6. On unified memory (Apple Silicon)

There is no separate VRAM to optimize — everything is RAM. Apply system-RAM reductions: smaller model, shorter context, KV quantization.

Prevention

Measure peak memory during load tests. Budget against the measured peak, not theoretical estimates.
Run with EnableDebugLogging = true in staging and watch [KV] lines to track cache growth.
Size the host for your expected session concurrency at your chosen preset — do not size for the minimum case.

What’s next

Low memory tuning — recipes for memory-constrained hosts.
Estimate memory requirements — predictive sizing.
Context parameters — KV cache dtype and flash attention.

Net: GPU not detected

Thu, 23 Apr 2026 00:00:00 +0000

You wanted the SDK to use the GPU, but inference is slow and nvidia-smi (or equivalent) shows no activity from the process. This page walks through the detection pipeline and common misconfigurations.

Symptom

Inference runs at CPU speed (5-15 tokens/sec) despite a GPU being present.
nvidia-smi does not list the Aspose.LLM process.
rocm-smi shows zero utilization.
Logs do not mention CUDA / HIP / Metal / Vulkan initialization.

Cause

The SDK picks a backend in two stages:

BinaryManager downloads a native binary matching BinaryManagerParameters.PreferredAcceleration (or auto-detection). The binary dictates what GPU APIs are available.
Engine respects BaseModelInferenceParameters.GpuLayers — if 0, the model stays on CPU even if the binary supports GPU.

Either stage can silently fall back to CPU.

Resolution

1. Check the downloaded binary

Enable debug logging and look for the binary selection line in logs:

[BinaryManager] resolved asset: llama-b8816-bin-win-cuda-cu12.4-x64.zip

If the asset name says cpu or does not mention CUDA/HIP/Metal/Vulkan, the BinaryManager did not detect the GPU. Fix by forcing:

using Aspose.LLM.Abstractions.Acceleration;

preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;

Then clear the binary cache at BinaryManagerParameters.BinaryPath and re-run to download the GPU variant.

2. Check the driver

On Linux / Windows with NVIDIA:

nvidia-smi
# Must show Driver Version >= 525 and the GPU model.

If nvidia-smi does not find the GPU, the driver is not installed or the GPU is not accessible (containerized environment without --gpus all flag, or the host has no NVIDIA GPU).

On Linux with AMD:

rocminfo
# Must list your GPU under Agent information.

On macOS:

system_profiler SPDisplaysDataType | grep "Chipset Model"
# Must show Apple M-series for Metal support.

3. Verify `GpuLayers`

Even with the right binary, GpuLayers = 0 forces CPU. Set it explicitly:

preset.BaseModelInferenceParameters.GpuLayers = 999;

999 is the idiomatic “full offload”. The engine caps at the model’s actual layer count.

4. Check for conflicting environment variables

NVIDIA environment variables can hide GPUs from the process:

echo $CUDA_VISIBLE_DEVICES
# If set to empty or -1, no GPU is visible.

Unset or set to 0 (or a valid GPU index):

unset CUDA_VISIBLE_DEVICES
# or
export CUDA_VISIBLE_DEVICES=0

For HIP: ROCR_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES play the same role.

5. Container / WSL2 specifics

Docker: you must start containers with --gpus all (NVIDIA) or --device=/dev/kfd --device=/dev/dri (AMD ROCm). Without these flags, the container has no GPU access.

WSL2 on Windows: install NVIDIA driver on the Windows side; install CUDA inside WSL following NVIDIA’s WSL2 guide. Older Windows + WSL combinations do not support CUDA in WSL — upgrade Windows 11 and WSL.

6. Fall back to Vulkan

If CUDA / HIP setup is impractical (custom kernels, container limitations), try Vulkan:

preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.Vulkan;
preset.BaseModelInferenceParameters.GpuLayers = 999;

Vulkan runs on NVIDIA, AMD, and Intel GPUs with standard drivers. Performance is 20-40 % below CUDA but better than CPU.

7. Windows users with AMD — use Vulkan

Aspose.LLM does not ship HIP binaries for Windows. On Windows with AMD, Vulkan is the only GPU path.

Prevention

During deployment, assert GPU is active with a small probe:

// After Create, a short inference should be fast on GPU.
var sw = System.Diagnostics.Stopwatch.StartNew();
string reply = await api.SendMessageAsync("Say ok.");
sw.Stop();
if (sw.Elapsed.TotalSeconds > 2)
    _logger.LogWarning("Inference is slow - GPU may not be active.");

Log the chosen acceleration at startup for auditability.
Monitor GPU utilization in production (Datadog, Prometheus) to catch silent CPU fallback.

What’s next

Acceleration — detailed per-backend setup.
Binary manager parameters — PreferredAcceleration.
Model inference parameters — GpuLayers, SplitMode, MainGpu.

Net: Model not loading

Thu, 23 Apr 2026 00:00:00 +0000

The SDK fails during model load — on the path from AsposeLLMApi.Create that downloads and initializes the GGUF file. This page covers the usual causes.

Symptom

InvalidOperationException during Create, often wrapping a lower-level error.
Log messages mentioning “failed to load model”, “invalid magic number”, or a specific llama_load_* function.
Native segfaults or access-violation exceptions during load (rare).
Download completes but subsequent load fails.

Cause

Several distinct failure modes:

Corrupted download — partial or interrupted download; bad cached file.
Unsupported model architecture — a GGUF whose architecture is not supported by the pinned ReleaseTag.
Wrong file name — the file exists but does not match the expected quantization.
Disk or permission issues — the cache directory is not writable.
llama.cpp release mismatch — a tag older than the model’s architecture.

Resolution

1. Verify the cached file

Look at EngineParameters.ModelCachePath (default <LocalAppData>/Aspose.LLM/models) for the file the preset references. Check its size against the Hugging Face listing:

ls -la ~/.local/share/Aspose.LLM/models

If the size is far below the expected value, the download was truncated.

2. Delete the cached file and retry

Clear the partial/corrupt file and let the SDK re-download:

rm ~/.local/share/Aspose.LLM/models/Qwen2.5-7B-Instruct-Q4_K_M.gguf

Re-run your program. The SDK downloads the file again.

3. Validate the GGUF

Enable CheckTensors on the inference parameters to validate every tensor during load:

preset.BaseModelInferenceParameters.CheckTensors = true;

Start-up takes longer, but you get clear errors on malformed tensors. If validation fails, the file is corrupt — delete and re-download.

Disable CheckTensors in production after confirming the file is good.

4. Confirm the model is supported

llama.cpp supports a fixed set of architectures per release. A brand-new model might not be supported by the default ReleaseTag = "b8816". Check:

The architecture name in the model’s Hugging Face README (e.g., “Qwen2”, “Llama”, “Gemma”, “Phi”).
The llama.cpp release notes for the tag you are using.

If the architecture is newer than the tag supports:

Switch to a newer ReleaseTag if one exists (tested and validated by the Aspose team).
Fall back to a comparable model with a supported architecture.
File a support request if the architecture is critical for your use case.

5. Confirm the file name is correct

Hugging Face repos often have many quantization variants. The preset’s default file name matches one of them; if the file has been removed or renamed upstream, the download fails.

Check the repo:

https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/tree/main

Verify BaseModelSourceParameters.HuggingFaceFileName matches an existing file. Override if needed:

preset.BaseModelSourceParameters.HuggingFaceFileName = "Qwen2.5-7B-Instruct-Q5_K_M.gguf";

6. Check directory permissions

On Linux / macOS, confirm write access to the cache folder:

ls -la ~/.local/share/Aspose.LLM/
# Expect the folder to be owned by the user running the process.

On Windows, check folder ACLs — especially when the process runs under a service account different from the install user.

7. Check disk space

Large models (20B+) need 10-20+ GB free at both the cache location and temp directories used during extraction.

df -h ~/.local/share/Aspose.LLM/models

Vision-specific: mmproj

Vision presets load both the base model and a projector. If base loads but projector fails, the failure message mentions mmproj or mtmd_init_from_file. Apply the same checks to MmprojSourceParameters:

preset.MmprojSourceParameters.HuggingFaceFileName = "mmproj-F16.gguf";

Prevention

Pre-download and validate models in your CI / build pipeline. Failing early in CI beats failing at runtime.
Commit manifest files that record the expected model hash alongside the preset selection. Compare on load.
Pin a tested ReleaseTag — do not float on defaults across SDK upgrades without testing.

What’s next

Model source parameters — priority and resolution.
Supported presets — confirmed compatible models and files.
Binary download fails — related network issues.

Net: Garbled output

Thu, 23 Apr 2026 00:00:00 +0000

The model loaded successfully, but its replies are nonsensical, repetitive, or broken in a way that points to misconfiguration rather than a model problem.

Symptom

Replies contain literal marker tokens like <|im_start|>, <image>, <|endoftext|>.
The model produces the same phrase or token repeatedly.
Output is truncated mid-sentence after exactly N tokens.
Output is coherent at first, then devolves into nonsense after a few hundred tokens.
Vision replies describe something different from the actual image.

Cause

Chat template mismatch — the engine picked the wrong template for the model.
Repetition penalty too low (or zero) — the model loops.
MaxTokens too low for a reasoning model — truncation mid-reasoning.
KV cache cleanup dropped important context — middle of a long session.
Wrong preset for the model — a custom GGUF paired with a preset that does not match its architecture.
Aggressive KV quantization — on long contexts, Q4 K/V can degrade quality.

Resolution

1. Literal marker tokens in the output

Diagnosis: template mismatch. The engine fell back to a generic template; the model’s actual markers appear as text.

Enable debug logging and look for [MM] selected template: ...:

preset.EngineParameters.EnableDebugLogging = true;

If the template is fallback or does not match your model family, the model needs a supported template. Options:

Verify you are using the correct built-in preset (see Supported presets).
For a custom GGUF, try a different export from the same model with richer metadata.
For vision: see Chat templates.

2. Repetition loops

Diagnosis: the model generates the same phrase in a loop.

Raise repetition penalty:

preset.SamplerParameters.RepetitionPenalty = 1.15f; // default 1.1

If that still loops, try DRY (Don’t Repeat Yourself):

preset.SamplerParameters.DryMultiplier = 0.8f;
preset.SamplerParameters.DryBase = 1.75f;
preset.SamplerParameters.DryAllowedLength = 3;

See Sampler parameters for the full repetition knob set.

3. Truncated output (cuts off mid-sentence)

Diagnosis: MaxTokens limit hit before the model finished.

Raise the budget:

preset.ChatParameters.MaxTokens = 2048; // from default 2048; raise further for long answers

For reasoning models (Qwen3, DeepSeek-R1) that emit <think> blocks, budget 1024-2048 for the thinking plus the answer.

4. Coherent start, garbled after a while

Diagnosis: KV cache cleanup evicted critical context mid-session, or KV quantization is too aggressive at long contexts.

Change the cleanup policy:

preset.ChatParameters.CacheCleanupStrategy =
    CacheCleanupStrategy.KeepSystemPromptAndFirstUserMessage;

Or upgrade KV precision:

preset.ContextParameters.TypeK = GgmlType.F16;
preset.ContextParameters.TypeV = GgmlType.F16;

F16 is the safe default; drop to Q8_0 only with memory pressure.

5. Model produces the wrong answer for clear questions

Diagnosis: custom GGUF paired with a wrong preset, or the preset’s sampler settings do not match the model’s training distribution.

Use a built-in preset for the model family, or set up a custom preset from scratch rather than reusing an unrelated built-in.
Start with conservative sampler settings: Temperature = 0.3, TopP = 0.9.
Test with the reference prompt from the model’s Hugging Face page to rule out sampling issues.

6. Vision: reply describes the wrong thing

Diagnosis: the image was not delivered correctly to the model.

Enable MtmdContextParameters.PrintTimings = true to verify the image was processed.
Enable debug logging and look for [MM] lines — confirm image chunks are tokenized.
See Debugging vision.

Prevention

Stick to built-in presets when possible.
Test new models with simple prompts first; check the output matches the model’s reference outputs.
Run with debug logging in staging to catch template fallbacks before production.

What’s next

Sampler parameters — repetition penalties, DRY.
Chat parameters — MaxTokens, CacheCleanupStrategy.
Chat templates — vision template selection.
Debugging vision — multimodal-specific diagnosis.

Net: License errors

Thu, 23 Apr 2026 00:00:00 +0000

License errors appear when a chat method is called without a valid license. Aspose.LLM does not have an evaluation fallback for inference — every chat operation requires a license.

Symptom

The chat APIs throw:

System.Exception: Not licensed for this method

Or License.IsLicensed returns false when you expected true.

Cause

Several distinct cases:

SetLicense was not called before the chat method.
SetLicense threw, and your code continued past the failure.
The license file path does not resolve to an actual file.
The embedded resource name is wrong.
The license is expired (especially temporary licenses, typically 30 days).
The license is corrupted (partial download, file system damage).

Resolution

1. Confirm `SetLicense` was called

Every process that calls chat methods must apply the license once. A common mistake is calling SetLicense in one context (e.g., a helper class constructor) and the API in another (a different process).

var license = new Aspose.LLM.License();
license.SetLicense("Aspose.LLM.lic");

// Immediately after, verify:
Console.WriteLine($"IsLicensed: {Aspose.LLM.License.IsLicensed}");
// Should print True. If False, SetLicense failed.

2. Catch exceptions from `SetLicense`

SetLicense itself can throw when the file is missing, corrupt, or wrong format. Catch and log:

try
{
    var license = new Aspose.LLM.License();
    license.SetLicense("Aspose.LLM.lic");
}
catch (Exception ex)
{
    _logger.LogError(ex, "License could not be applied.");
    throw;
}

The exception message usually states the cause (file not found, parse error, signature check failed).

3. Verify the license file path

When you pass only a file name, SetLicense searches several locations (see Licensing). If the file lives elsewhere, pass the full path:

license.SetLicense(@"C:\licenses\Aspose.LLM.lic");

Confirm the file is copied to the process working directory or bin folder during build:

<ItemGroup>
  <None Update="Aspose.LLM.lic">
    <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
  </None>
</ItemGroup>

4. Embedded resource — check the name

For embedded licenses, the resource name must match the file name exactly:

<ItemGroup>
  <EmbeddedResource Include="Aspose.LLM.lic" />
</ItemGroup>

license.SetLicense("Aspose.LLM.lic"); // matches the file name

If the file is in a subfolder (e.g., Resources/Aspose.LLM.lic), the resource name becomes <Namespace>.Resources.Aspose.LLM.lic. Match that name exactly, or move the file to the project root.

5. Temporary license expired

Temporary licenses issued by purchase.aspose.com/temporary-license have a fixed expiry date (typically 30 days). After expiry, chat methods throw the same Not licensed for this method error.

Options:

Request a new temporary license.
Purchase a commercial license.

The application code does not change — swap the .lic file.

6. Corrupt license file

If the file is truncated or modified after issue, signature validation fails. Re-download from the Aspose purchase portal.

7. Stream-based license from a failed source

If you load via a stream:

using var stream = File.OpenRead("Aspose.LLM.lic");
license.SetLicense(stream);

Ensure the stream is at position 0 before SetLicense. If an earlier read consumed bytes, SetLicense sees truncated data.

Prevention

Apply the license at application startup, once, with explicit exception handling.
Log License.IsLicensed immediately after SetLicense to confirm.
Monitor temporary license expiry dates — have a calendar reminder 7 days before expiry.
In CI/CD, use an embedded license or pull from a secret store rather than bundling files.
For air-gapped deployments, copy the license with the rest of the deployment artifacts.

What’s next

Licensing — full license setup (file, stream, embedded resource, temporary).
License class reference — API surface of License.
AsposeLLMApi facade — where license checks sit in the chat API surface.

Net: Performance issues

Thu, 23 Apr 2026 00:00:00 +0000

The SDK loads and runs, but throughput is below expectations or first-token latency is too high. This page covers the levers that affect performance.

Symptom

Fewer tokens per second than expected for your hardware.
First-token latency of several seconds even after warm-up.
Performance spikes — fast for a while, then slow.
Occasional stalls mid-response.

Cause

Wrong acceleration backend or silent CPU fallback.
Suboptimal threading configuration.
Flash attention not enabled.
Memory pressure (swap thrashing, KV cache too large for VRAM).
Competing CPU-heavy processes on the same host.
Context size too large for the hardware.
First request of a new session (fresh prefill).

Resolution

1. Confirm the right backend

Enable debug logging and confirm the binary variant and acceleration:

[BinaryManager] resolved asset: llama-b8816-bin-win-cuda-cu12.4-x64.zip
[Engine] inference on CUDA with 32/32 layers offloaded

If the variant says cpu while you have a GPU — see GPU not detected.

2. Verify `GpuLayers`

Make sure GpuLayers is high enough to offload the model:

preset.BaseModelInferenceParameters.GpuLayers = 999;

Partial offload (e.g., GpuLayers = 20) on an 8B model keeps half on CPU — the GPU cannot accelerate what is not on it.

3. Enable flash attention

Near-universal win; rarely hurts:

preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;

Particularly important for contexts beyond 8K tokens.

4. Tune threads on CPU

For CPU inference, NThreads and NThreadsBatch matter:

preset.ContextParameters.NThreads = Environment.ProcessorCount / 2;
preset.ContextParameters.NThreadsBatch = Environment.ProcessorCount;

See CPU acceleration for the full rationale. Benchmark on your specific host to find the sweet spot; adding threads past 8-12 on generation often hurts.

5. Warm up sessions

First token on a fresh session includes prefill time (tokenizing and evaluating the system prompt + history). Amortize by reusing sessions:

// Reuse one session per user instead of creating fresh each request.

See Reduce first-token latency.

6. Shrink `ContextSize` if you do not use the full window

Longer contexts are slower per token even when mostly empty — KV scans scale with position count. Drop ContextSize to the actual max you need:

preset.ContextParameters.ContextSize = 8192;

7. Check for competing CPU load

Another CPU-heavy process on the same host steals throughput:

top / htop on Linux/macOS.
Task Manager → Details on Windows.

Serialize inference work with other CPU-heavy tasks; do not run AV scans, backup jobs, or compilers alongside.

8. Watch for memory pressure

Swap thrashing silently destroys throughput:

free -h
# Check "swap used". Nonzero during inference = bad.

If the host is swapping, reduce memory footprint (smaller model, shorter context, KV quantization).

9. Check for thermal throttling

Sustained high load heats the CPU and GPU; thermal throttling drops clocks and cuts throughput.

On laptops, plug into AC power.
Verify cooling — clean dust, check fan RPM.
On CPU: watch -n 1 'cat /proc/cpuinfo | grep MHz' (Linux).
On NVIDIA GPU: nvidia-smi -q -d CLOCK (look for Current-vs-Base clock).

10. Mirostat / dynatemp overhead

Advanced samplers like Mirostat and dynamic temperature add small per-token overhead. If you are chasing the last 10 % of throughput, disable them:

preset.SamplerParameters.Mirostat = 0;
preset.SamplerParameters.DynatempRange = 0;

Measure

Before optimizing, establish a baseline:

var sw = System.Diagnostics.Stopwatch.StartNew();
string reply = await api.SendMessageAsync(prompt);
sw.Stop();

int approxTokens = reply.Split(' ').Length * 4 / 3; // rough conversion
double tps = approxTokens / sw.Elapsed.TotalSeconds;
Console.WriteLine($"~{tps:F1} tok/s");

Run with a representative prompt size; numbers vary wildly by prompt length.

Reference throughput numbers

7B Q4_K_M at 4K context:

Hardware	Expected
CPU (i5, AVX2)	5-10 t/s
CPU (i7/i9, AVX-512)	8-15 t/s
RTX 3060 (CUDA)	40-60 t/s
RTX 4090 (CUDA)	100-140 t/s
Apple M2 Pro (Metal)	30-50 t/s
Apple M3 Max (Metal)	50-80 t/s

If your numbers are substantially below these, work through the resolution steps in order.

What’s next

Tune for speed vs quality — speed-biased configuration.
Reduce first-token latency — cut TTFT.
Acceleration — backend-specific tuning.
Context parameters — batch sizes and threading.

Documentation – Troubleshooting

Net: Binary download fails

Symptom

Cause

Resolution

1. Verify network access

2. Configure an HTTP proxy

3. Pre-populate the cache

4. Check the ReleaseTag

5. Free up disk space

6. Handle GitHub rate limits

7. TLS interception

Prevention

What’s next

Net: Out of memory

Symptom

Cause

Resolution

1. Identify which pool ran out

2. Quick wins

3. Switch to a smaller preset

4. Cap concurrent sessions

5. Recycle the API instance

6. On unified memory (Apple Silicon)

Prevention

What’s next

Net: GPU not detected

Symptom

Cause

Resolution

1. Check the downloaded binary

2. Check the driver

3. Verify GpuLayers

4. Check for conflicting environment variables

5. Container / WSL2 specifics

6. Fall back to Vulkan

7. Windows users with AMD — use Vulkan

Prevention

What’s next

Net: Model not loading

Symptom

Cause

Resolution

1. Verify the cached file

2. Delete the cached file and retry

3. Validate the GGUF

4. Confirm the model is supported

5. Confirm the file name is correct

6. Check directory permissions

7. Check disk space

Vision-specific: mmproj

Prevention

What’s next

Net: Garbled output

Symptom

Cause

Resolution

1. Literal marker tokens in the output

2. Repetition loops

3. Truncated output (cuts off mid-sentence)

4. Coherent start, garbled after a while

5. Model produces the wrong answer for clear questions

6. Vision: reply describes the wrong thing

Prevention

What’s next

Net: License errors

Symptom

Cause

Resolution

1. Confirm SetLicense was called

2. Catch exceptions from SetLicense

3. Verify the license file path

4. Embedded resource — check the name

5. Temporary license expired

6. Corrupt license file

7. Stream-based license from a failed source

Prevention

What’s next

Net: Performance issues

Symptom

4. Check the `ReleaseTag`

3. Verify `GpuLayers`

1. Confirm `SetLicense` was called

2. Catch exceptions from `SetLicense`

2. Verify `GpuLayers`

6. Shrink `ContextSize` if you do not use the full window