<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Documentation – Troubleshooting</title>
    <link>https://docs.aspose.com/llm/net/troubleshooting/</link>
    <description>Recent content in Troubleshooting on Documentation</description>
    <generator>Hugo -- gohugo.io</generator>
    <lastBuildDate>Thu, 23 Apr 2026 00:00:00 +0000</lastBuildDate>
    
	  <atom:link href="https://docs.aspose.com/llm/net/troubleshooting/index.xml" rel="self" type="application/rss+xml" />
    
    
      
        
      
    
    
    <item>
      <title>Net: Binary download fails</title>
      <link>https://docs.aspose.com/llm/net/troubleshooting/binary-download-fails/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://docs.aspose.com/llm/net/troubleshooting/binary-download-fails/</guid>
      <description>
        
        
        &lt;p&gt;On first &lt;code&gt;AsposeLLMApi.Create&lt;/code&gt;, the SDK downloads native &lt;code&gt;llama.cpp&lt;/code&gt; binaries from GitHub. In corporate or restricted environments, the download can fail. This page covers the common causes.&lt;/p&gt;
&lt;h2 id=&#34;symptom&#34;&gt;Symptom&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;HttpRequestException&lt;/code&gt; or a wrapped &lt;code&gt;InvalidOperationException&lt;/code&gt; during &lt;code&gt;Create&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Log entries mentioning GitHub, &lt;code&gt;BinaryManager&lt;/code&gt;, or HTTP status codes (403, 404, 429, 500).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Create&lt;/code&gt; hangs for a long time and eventually times out.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;cause&#34;&gt;Cause&lt;/h2&gt;
&lt;p&gt;The SDK contacts &lt;code&gt;github.com/ggml-org/llama.cpp/releases/tag/&amp;lt;ReleaseTag&amp;gt;&lt;/code&gt; to list release assets, then downloads the asset matching your platform and acceleration. Any of these can block:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Corporate firewall or proxy blocks github.com.&lt;/li&gt;
&lt;li&gt;TLS interception strips the expected certificate chain.&lt;/li&gt;
&lt;li&gt;GitHub rate-limit (HTTP 429) on unauthenticated requests.&lt;/li&gt;
&lt;li&gt;The requested &lt;code&gt;ReleaseTag&lt;/code&gt; does not exist on upstream.&lt;/li&gt;
&lt;li&gt;Disk space insufficient for the download or extraction.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;resolution&#34;&gt;Resolution&lt;/h2&gt;
&lt;h3 id=&#34;1-verify-network-access&#34;&gt;1. Verify network access&lt;/h3&gt;
&lt;p&gt;Manually confirm GitHub is reachable from the host:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;curl -I https://api.github.com/repos/ggml-org/llama.cpp/releases/tags/b8816
&lt;span class=&#34;c1&#34;&gt;# Expect HTTP/2 200&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;If the curl fails, the problem is network; fix that first (proxy config, firewall rule).&lt;/p&gt;
&lt;h3 id=&#34;2-configure-an-http-proxy&#34;&gt;2. Configure an HTTP proxy&lt;/h3&gt;
&lt;p&gt;On Windows, set &lt;code&gt;HTTP_PROXY&lt;/code&gt; and &lt;code&gt;HTTPS_PROXY&lt;/code&gt; environment variables before starting the process. On Linux / macOS, export them:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;nb&#34;&gt;export&lt;/span&gt; &lt;span class=&#34;nv&#34;&gt;HTTPS_PROXY&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;http://proxy.example.com:8080
dotnet run
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;.NET&amp;rsquo;s default &lt;code&gt;HttpClient&lt;/code&gt; honors these variables.&lt;/p&gt;
&lt;h3 id=&#34;3-pre-populate-the-cache&#34;&gt;3. Pre-populate the cache&lt;/h3&gt;
&lt;p&gt;If the host cannot reach GitHub at all, pre-download on a machine with internet access and copy the cache to the target. Full workflow: &lt;a href=&#34;https://docs.aspose.com/llm/net/use-cases/offline-deployment/&#34;&gt;Offline deployment&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&#34;4-check-the-releasetag&#34;&gt;4. Check the &lt;code&gt;ReleaseTag&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Confirm &lt;code&gt;BinaryManagerParameters.ReleaseTag&lt;/code&gt; matches a real upstream release:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BinaryManagerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ReleaseTag&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;b8816&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;// default in SDK v26.5.0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Visit &lt;code&gt;https://github.com/ggml-org/llama.cpp/releases/tag/&amp;lt;tag&amp;gt;&lt;/code&gt; in a browser to verify.&lt;/p&gt;
&lt;h3 id=&#34;5-free-up-disk-space&#34;&gt;5. Free up disk space&lt;/h3&gt;
&lt;p&gt;Binary archives are 100-500 MB; extracted trees are similar. Ensure at least 1-2 GB free at &lt;code&gt;BinaryManagerParameters.BinaryPath&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Linux&lt;/span&gt;
df -h ~/.local/share/Aspose.LLM/runtimes

&lt;span class=&#34;c1&#34;&gt;# Windows&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# Check %LOCALAPPDATA%\Aspose.LLM\runtimes free space.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id=&#34;6-handle-github-rate-limits&#34;&gt;6. Handle GitHub rate limits&lt;/h3&gt;
&lt;p&gt;If logs mention HTTP 429, you are hitting GitHub&amp;rsquo;s unauthenticated API limit (60 requests/hour per IP). Options:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Wait and retry.&lt;/li&gt;
&lt;li&gt;Use an authenticated &lt;code&gt;HttpClient&lt;/code&gt; (advanced — requires a custom &lt;code&gt;IModelFileProvider&lt;/code&gt; in the extensibility layer).&lt;/li&gt;
&lt;li&gt;Pre-populate the cache so subsequent runs do not hit the API.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;7-tls-interception&#34;&gt;7. TLS interception&lt;/h3&gt;
&lt;p&gt;Corporate TLS-inspection proxies replace GitHub&amp;rsquo;s certificate with a corporate one. .NET by default rejects that.&lt;/p&gt;
&lt;p&gt;Options (choose one):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Install the corporate root certificate into the host&amp;rsquo;s certificate store.&lt;/li&gt;
&lt;li&gt;Bypass interception for &lt;code&gt;*.github.com&lt;/code&gt; and &lt;code&gt;*.githubusercontent.com&lt;/code&gt; on the proxy.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do &lt;strong&gt;not&lt;/strong&gt; disable TLS validation in production — it is a security regression.&lt;/p&gt;
&lt;h2 id=&#34;prevention&#34;&gt;Prevention&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;For production&lt;/strong&gt;: always pre-populate caches in your deployment pipeline. Do not rely on first-run downloads in production environments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;For development&lt;/strong&gt;: keep &lt;code&gt;BinaryPath&lt;/code&gt; and &lt;code&gt;ModelCachePath&lt;/code&gt; on a shared network drive across your team so downloads happen once per team, not once per developer.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pin &lt;code&gt;ReleaseTag&lt;/code&gt;&lt;/strong&gt; explicitly in your preset — do not rely on the default across SDK upgrades.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/use-cases/offline-deployment/&#34;&gt;Offline deployment&lt;/a&gt; — full pre-population workflow.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/binary-manager/&#34;&gt;Binary manager parameters&lt;/a&gt; — &lt;code&gt;BinaryPath&lt;/code&gt;, &lt;code&gt;ReleaseTag&lt;/code&gt;, &lt;code&gt;PreferredAcceleration&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/product-overview/architecture/&#34;&gt;Architecture&lt;/a&gt; — the binary deployment stage.&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Net: Out of memory</title>
      <link>https://docs.aspose.com/llm/net/troubleshooting/out-of-memory/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://docs.aspose.com/llm/net/troubleshooting/out-of-memory/</guid>
      <description>
        
        
        &lt;p&gt;Out-of-memory failures happen at model load, during long sessions, or when running several models on the same host. The remedy depends on which memory pool is exhausted.&lt;/p&gt;
&lt;h2 id=&#34;symptom&#34;&gt;Symptom&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;At &lt;code&gt;Create&lt;/code&gt;: &lt;code&gt;InvalidOperationException&lt;/code&gt; mentioning memory allocation, &lt;code&gt;cudaErrorOutOfMemory&lt;/code&gt;, or &lt;code&gt;rocblas&lt;/code&gt; memory errors.&lt;/li&gt;
&lt;li&gt;During generation: unexpectedly slow responses (swap thrashing on Linux/macOS).&lt;/li&gt;
&lt;li&gt;On Windows: a &lt;code&gt;System.OutOfMemoryException&lt;/code&gt; or the process being killed.&lt;/li&gt;
&lt;li&gt;On GPU: &lt;code&gt;nvidia-smi&lt;/code&gt; showing the process using close to or exceeding the card&amp;rsquo;s VRAM before the failure.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;cause&#34;&gt;Cause&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Model weights plus KV cache exceed the available memory pool.&lt;/li&gt;
&lt;li&gt;KV cache grows as sessions accumulate — each active session claims a slice of &lt;code&gt;ContextSize&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Multiple sessions, long prompts, and long responses compound.&lt;/li&gt;
&lt;li&gt;On Apple Silicon (unified memory), system RAM is the shared ceiling.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;resolution&#34;&gt;Resolution&lt;/h2&gt;
&lt;h3 id=&#34;1-identify-which-pool-ran-out&#34;&gt;1. Identify which pool ran out&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What you see&lt;/th&gt;
&lt;th&gt;Ran out&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cudaErrorOutOfMemory&lt;/code&gt;, &lt;code&gt;hipErrorOutOfMemory&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;GPU VRAM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;System.OutOfMemoryException&lt;/code&gt;, Linux OOM killer, kernel panic&lt;/td&gt;
&lt;td&gt;System RAM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Swap activity (&lt;code&gt;free -h&lt;/code&gt; shows swap in use), very slow generation&lt;/td&gt;
&lt;td&gt;RAM exhausted, OS paging&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&#34;2-quick-wins&#34;&gt;2. Quick wins&lt;/h3&gt;
&lt;p&gt;Apply these changes and re-test.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Smaller context:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextSize&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;4096&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;// down from 32K or 131K
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Quantize V cache:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;TypeV&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;GgmlType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Q8_0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Enable flash attention:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;FlashAttentionMode&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FlashAttentionType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Enabled&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Partial GPU offload:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;c1&#34;&gt;// Offload some layers, leave rest on CPU.
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GpuLayers&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;20&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;See &lt;a href=&#34;https://docs.aspose.com/llm/net/use-cases/low-memory-tuning/&#34;&gt;Low memory tuning&lt;/a&gt; for the full recipe.&lt;/p&gt;
&lt;h3 id=&#34;3-switch-to-a-smaller-preset&#34;&gt;3. Switch to a smaller preset&lt;/h3&gt;
&lt;p&gt;If quick wins do not help, step down to a smaller model:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;From&lt;/th&gt;
&lt;th&gt;To&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Qwen25Preset&lt;/code&gt; (7B)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Llama32Preset&lt;/code&gt; (3B)&lt;/td&gt;
&lt;td&gt;~3-4 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Oss20Preset&lt;/code&gt; (20B)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Qwen25Preset&lt;/code&gt; (7B)&lt;/td&gt;
&lt;td&gt;~6-8 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Any F16/Q8 preset&lt;/td&gt;
&lt;td&gt;Q4_K_M equivalent&lt;/td&gt;
&lt;td&gt;50 %&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&#34;4-cap-concurrent-sessions&#34;&gt;4. Cap concurrent sessions&lt;/h3&gt;
&lt;p&gt;In multi-user hosts, cap the active session count. A back-of-envelope budget:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;max_sessions = (available_memory - model_weights - overhead) / per_session_kv_budget
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Use &lt;a href=&#34;https://docs.aspose.com/llm/net/how-to/estimate-memory-requirements/&#34;&gt;Estimate memory requirements&lt;/a&gt; for concrete numbers.&lt;/p&gt;
&lt;p&gt;Evict idle sessions by periodically disposing &lt;code&gt;AsposeLLMApi&lt;/code&gt; and recreating it. The current SDK does not provide an explicit per-session evict API.&lt;/p&gt;
&lt;h3 id=&#34;5-recycle-the-api-instance&#34;&gt;5. Recycle the API instance&lt;/h3&gt;
&lt;p&gt;In long-running hosts, KV cache and native buffers can grow beyond predictions. Periodically:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Dispose the current &lt;code&gt;AsposeLLMApi&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Wait for native memory to release.&lt;/li&gt;
&lt;li&gt;Create a new instance.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Expect a 5-30 second restart cost on warm caches.&lt;/p&gt;
&lt;h3 id=&#34;6-on-unified-memory-apple-silicon&#34;&gt;6. On unified memory (Apple Silicon)&lt;/h3&gt;
&lt;p&gt;There is no separate VRAM to optimize — everything is RAM. Apply system-RAM reductions: smaller model, shorter context, KV quantization.&lt;/p&gt;
&lt;h2 id=&#34;prevention&#34;&gt;Prevention&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Measure peak memory during load tests. Budget against the measured peak, not theoretical estimates.&lt;/li&gt;
&lt;li&gt;Run with &lt;code&gt;EnableDebugLogging = true&lt;/code&gt; in staging and watch &lt;code&gt;[KV]&lt;/code&gt; lines to track cache growth.&lt;/li&gt;
&lt;li&gt;Size the host for your expected session concurrency at your chosen preset — do not size for the minimum case.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/use-cases/low-memory-tuning/&#34;&gt;Low memory tuning&lt;/a&gt; — recipes for memory-constrained hosts.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/how-to/estimate-memory-requirements/&#34;&gt;Estimate memory requirements&lt;/a&gt; — predictive sizing.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/context/&#34;&gt;Context parameters&lt;/a&gt; — KV cache dtype and flash attention.&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Net: GPU not detected</title>
      <link>https://docs.aspose.com/llm/net/troubleshooting/gpu-not-detected/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://docs.aspose.com/llm/net/troubleshooting/gpu-not-detected/</guid>
      <description>
        
        
        &lt;p&gt;You wanted the SDK to use the GPU, but inference is slow and &lt;code&gt;nvidia-smi&lt;/code&gt; (or equivalent) shows no activity from the process. This page walks through the detection pipeline and common misconfigurations.&lt;/p&gt;
&lt;h2 id=&#34;symptom&#34;&gt;Symptom&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Inference runs at CPU speed (5-15 tokens/sec) despite a GPU being present.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;nvidia-smi&lt;/code&gt; does not list the Aspose.LLM process.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;rocm-smi&lt;/code&gt; shows zero utilization.&lt;/li&gt;
&lt;li&gt;Logs do not mention CUDA / HIP / Metal / Vulkan initialization.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;cause&#34;&gt;Cause&lt;/h2&gt;
&lt;p&gt;The SDK picks a backend in two stages:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;BinaryManager&lt;/code&gt;&lt;/strong&gt; downloads a native binary matching &lt;code&gt;BinaryManagerParameters.PreferredAcceleration&lt;/code&gt; (or auto-detection). The binary dictates what GPU APIs are available.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;Engine&lt;/code&gt;&lt;/strong&gt; respects &lt;code&gt;BaseModelInferenceParameters.GpuLayers&lt;/code&gt; — if &lt;code&gt;0&lt;/code&gt;, the model stays on CPU even if the binary supports GPU.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Either stage can silently fall back to CPU.&lt;/p&gt;
&lt;h2 id=&#34;resolution&#34;&gt;Resolution&lt;/h2&gt;
&lt;h3 id=&#34;1-check-the-downloaded-binary&#34;&gt;1. Check the downloaded binary&lt;/h3&gt;
&lt;p&gt;Enable debug logging and look for the binary selection line in logs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[BinaryManager] resolved asset: llama-b8816-bin-win-cuda-cu12.4-x64.zip
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;If the asset name says &lt;code&gt;cpu&lt;/code&gt; or does not mention CUDA/HIP/Metal/Vulkan, the &lt;code&gt;BinaryManager&lt;/code&gt; did not detect the GPU. Fix by forcing:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;k&#34;&gt;using&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;Aspose.LLM.Abstractions.Acceleration&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BinaryManagerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;PreferredAcceleration&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;AccelerationType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;CUDA&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then clear the binary cache at &lt;code&gt;BinaryManagerParameters.BinaryPath&lt;/code&gt; and re-run to download the GPU variant.&lt;/p&gt;
&lt;h3 id=&#34;2-check-the-driver&#34;&gt;2. Check the driver&lt;/h3&gt;
&lt;p&gt;On Linux / Windows with NVIDIA:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;nvidia-smi
&lt;span class=&#34;c1&#34;&gt;# Must show Driver Version &amp;gt;= 525 and the GPU model.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;If &lt;code&gt;nvidia-smi&lt;/code&gt; does not find the GPU, the driver is not installed or the GPU is not accessible (containerized environment without &lt;code&gt;--gpus all&lt;/code&gt; flag, or the host has no NVIDIA GPU).&lt;/p&gt;
&lt;p&gt;On Linux with AMD:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;rocminfo
&lt;span class=&#34;c1&#34;&gt;# Must list your GPU under Agent information.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;On macOS:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;system_profiler SPDisplaysDataType &lt;span class=&#34;p&#34;&gt;|&lt;/span&gt; grep &lt;span class=&#34;s2&#34;&gt;&amp;#34;Chipset Model&amp;#34;&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# Must show Apple M-series for Metal support.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id=&#34;3-verify-gpulayers&#34;&gt;3. Verify &lt;code&gt;GpuLayers&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Even with the right binary, &lt;code&gt;GpuLayers = 0&lt;/code&gt; forces CPU. Set it explicitly:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GpuLayers&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;999&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;999&lt;/code&gt; is the idiomatic &amp;ldquo;full offload&amp;rdquo;. The engine caps at the model&amp;rsquo;s actual layer count.&lt;/p&gt;
&lt;h3 id=&#34;4-check-for-conflicting-environment-variables&#34;&gt;4. Check for conflicting environment variables&lt;/h3&gt;
&lt;p&gt;NVIDIA environment variables can hide GPUs from the process:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;nb&#34;&gt;echo&lt;/span&gt; &lt;span class=&#34;nv&#34;&gt;$CUDA_VISIBLE_DEVICES&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# If set to empty or -1, no GPU is visible.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Unset or set to &lt;code&gt;0&lt;/code&gt; (or a valid GPU index):&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;nb&#34;&gt;unset&lt;/span&gt; CUDA_VISIBLE_DEVICES
&lt;span class=&#34;c1&#34;&gt;# or&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;export&lt;/span&gt; &lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;For HIP: &lt;code&gt;ROCR_VISIBLE_DEVICES&lt;/code&gt; and &lt;code&gt;HIP_VISIBLE_DEVICES&lt;/code&gt; play the same role.&lt;/p&gt;
&lt;h3 id=&#34;5-container--wsl2-specifics&#34;&gt;5. Container / WSL2 specifics&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Docker&lt;/strong&gt;: you must start containers with &lt;code&gt;--gpus all&lt;/code&gt; (NVIDIA) or &lt;code&gt;--device=/dev/kfd --device=/dev/dri&lt;/code&gt; (AMD ROCm). Without these flags, the container has no GPU access.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;WSL2&lt;/strong&gt; on Windows: install NVIDIA driver on the Windows side; install CUDA inside WSL following NVIDIA&amp;rsquo;s WSL2 guide. Older Windows + WSL combinations do not support CUDA in WSL — upgrade Windows 11 and WSL.&lt;/p&gt;
&lt;h3 id=&#34;6-fall-back-to-vulkan&#34;&gt;6. Fall back to Vulkan&lt;/h3&gt;
&lt;p&gt;If CUDA / HIP setup is impractical (custom kernels, container limitations), try Vulkan:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BinaryManagerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;PreferredAcceleration&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;AccelerationType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Vulkan&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GpuLayers&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;999&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Vulkan runs on NVIDIA, AMD, and Intel GPUs with standard drivers. Performance is 20-40 % below CUDA but better than CPU.&lt;/p&gt;
&lt;h3 id=&#34;7-windows-users-with-amd--use-vulkan&#34;&gt;7. Windows users with AMD — use Vulkan&lt;/h3&gt;
&lt;p&gt;Aspose.LLM does not ship HIP binaries for Windows. On Windows with AMD, Vulkan is the only GPU path.&lt;/p&gt;
&lt;h2 id=&#34;prevention&#34;&gt;Prevention&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;During deployment, assert GPU is active with a small probe:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;c1&#34;&gt;// After Create, a short inference should be fast on GPU.
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;&lt;span class=&#34;kt&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sw&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;System&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Diagnostics&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Stopwatch&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;StartNew&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
&lt;span class=&#34;kt&#34;&gt;string&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;reply&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;await&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;api&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SendMessageAsync&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;Say ok.&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;sw&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Stop&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sw&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Elapsed&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;TotalSeconds&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;_logger&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;LogWarning&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;Inference is slow - GPU may not be active.&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Log the chosen acceleration at startup for auditability.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Monitor GPU utilization in production (Datadog, Prometheus) to catch silent CPU fallback.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/acceleration/&#34;&gt;Acceleration&lt;/a&gt; — detailed per-backend setup.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/binary-manager/&#34;&gt;Binary manager parameters&lt;/a&gt; — &lt;code&gt;PreferredAcceleration&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/model-inference/&#34;&gt;Model inference parameters&lt;/a&gt; — &lt;code&gt;GpuLayers&lt;/code&gt;, &lt;code&gt;SplitMode&lt;/code&gt;, &lt;code&gt;MainGpu&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Net: Model not loading</title>
      <link>https://docs.aspose.com/llm/net/troubleshooting/model-not-loading/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://docs.aspose.com/llm/net/troubleshooting/model-not-loading/</guid>
      <description>
        
        
        &lt;p&gt;The SDK fails during model load — on the path from &lt;code&gt;AsposeLLMApi.Create&lt;/code&gt; that downloads and initializes the GGUF file. This page covers the usual causes.&lt;/p&gt;
&lt;h2 id=&#34;symptom&#34;&gt;Symptom&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;InvalidOperationException&lt;/code&gt; during &lt;code&gt;Create&lt;/code&gt;, often wrapping a lower-level error.&lt;/li&gt;
&lt;li&gt;Log messages mentioning &amp;ldquo;failed to load model&amp;rdquo;, &amp;ldquo;invalid magic number&amp;rdquo;, or a specific &lt;code&gt;llama_load_*&lt;/code&gt; function.&lt;/li&gt;
&lt;li&gt;Native segfaults or access-violation exceptions during load (rare).&lt;/li&gt;
&lt;li&gt;Download completes but subsequent load fails.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;cause&#34;&gt;Cause&lt;/h2&gt;
&lt;p&gt;Several distinct failure modes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Corrupted download&lt;/strong&gt; — partial or interrupted download; bad cached file.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unsupported model architecture&lt;/strong&gt; — a GGUF whose architecture is not supported by the pinned &lt;code&gt;ReleaseTag&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wrong file name&lt;/strong&gt; — the file exists but does not match the expected quantization.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Disk or permission issues&lt;/strong&gt; — the cache directory is not writable.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;llama.cpp&lt;/code&gt; release mismatch&lt;/strong&gt; — a tag older than the model&amp;rsquo;s architecture.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;resolution&#34;&gt;Resolution&lt;/h2&gt;
&lt;h3 id=&#34;1-verify-the-cached-file&#34;&gt;1. Verify the cached file&lt;/h3&gt;
&lt;p&gt;Look at &lt;code&gt;EngineParameters.ModelCachePath&lt;/code&gt; (default &lt;code&gt;&amp;lt;LocalAppData&amp;gt;/Aspose.LLM/models&lt;/code&gt;) for the file the preset references. Check its size against the Hugging Face listing:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;ls -la ~/.local/share/Aspose.LLM/models
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;If the size is far below the expected value, the download was truncated.&lt;/p&gt;
&lt;h3 id=&#34;2-delete-the-cached-file-and-retry&#34;&gt;2. Delete the cached file and retry&lt;/h3&gt;
&lt;p&gt;Clear the partial/corrupt file and let the SDK re-download:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;rm ~/.local/share/Aspose.LLM/models/Qwen2.5-7B-Instruct-Q4_K_M.gguf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Re-run your program. The SDK downloads the file again.&lt;/p&gt;
&lt;h3 id=&#34;3-validate-the-gguf&#34;&gt;3. Validate the GGUF&lt;/h3&gt;
&lt;p&gt;Enable &lt;code&gt;CheckTensors&lt;/code&gt; on the inference parameters to validate every tensor during load:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;CheckTensors&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;true&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Start-up takes longer, but you get clear errors on malformed tensors. If validation fails, the file is corrupt — delete and re-download.&lt;/p&gt;
&lt;p&gt;Disable &lt;code&gt;CheckTensors&lt;/code&gt; in production after confirming the file is good.&lt;/p&gt;
&lt;h3 id=&#34;4-confirm-the-model-is-supported&#34;&gt;4. Confirm the model is supported&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;llama.cpp&lt;/code&gt; supports a fixed set of architectures per release. A brand-new model might not be supported by the default &lt;code&gt;ReleaseTag = &amp;quot;b8816&amp;quot;&lt;/code&gt;. Check:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The architecture name in the model&amp;rsquo;s Hugging Face README (e.g., &amp;ldquo;Qwen2&amp;rdquo;, &amp;ldquo;Llama&amp;rdquo;, &amp;ldquo;Gemma&amp;rdquo;, &amp;ldquo;Phi&amp;rdquo;).&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;llama.cpp&lt;/code&gt; release notes for the tag you are using.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If the architecture is newer than the tag supports:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Switch to a newer &lt;code&gt;ReleaseTag&lt;/code&gt; if one exists (tested and validated by the Aspose team).&lt;/li&gt;
&lt;li&gt;Fall back to a comparable model with a supported architecture.&lt;/li&gt;
&lt;li&gt;File a &lt;a href=&#34;https://forum.aspose.com/&#34;&gt;support request&lt;/a&gt; if the architecture is critical for your use case.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;5-confirm-the-file-name-is-correct&#34;&gt;5. Confirm the file name is correct&lt;/h3&gt;
&lt;p&gt;Hugging Face repos often have many quantization variants. The preset&amp;rsquo;s default file name matches one of them; if the file has been removed or renamed upstream, the download fails.&lt;/p&gt;
&lt;p&gt;Check the repo:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/tree/main
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Verify &lt;code&gt;BaseModelSourceParameters.HuggingFaceFileName&lt;/code&gt; matches an existing file. Override if needed:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelSourceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;HuggingFaceFileName&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;Qwen2.5-7B-Instruct-Q5_K_M.gguf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id=&#34;6-check-directory-permissions&#34;&gt;6. Check directory permissions&lt;/h3&gt;
&lt;p&gt;On Linux / macOS, confirm write access to the cache folder:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;ls -la ~/.local/share/Aspose.LLM/
&lt;span class=&#34;c1&#34;&gt;# Expect the folder to be owned by the user running the process.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;On Windows, check folder ACLs — especially when the process runs under a service account different from the install user.&lt;/p&gt;
&lt;h3 id=&#34;7-check-disk-space&#34;&gt;7. Check disk space&lt;/h3&gt;
&lt;p&gt;Large models (20B+) need 10-20+ GB free at both the cache location and temp directories used during extraction.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;df -h ~/.local/share/Aspose.LLM/models
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id=&#34;vision-specific-mmproj&#34;&gt;Vision-specific: mmproj&lt;/h2&gt;
&lt;p&gt;Vision presets load both the base model and a projector. If base loads but projector fails, the failure message mentions &lt;code&gt;mmproj&lt;/code&gt; or &lt;code&gt;mtmd_init_from_file&lt;/code&gt;. Apply the same checks to &lt;code&gt;MmprojSourceParameters&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;MmprojSourceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;HuggingFaceFileName&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;mmproj-F16.gguf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id=&#34;prevention&#34;&gt;Prevention&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Pre-download and validate models in your CI / build pipeline. Failing early in CI beats failing at runtime.&lt;/li&gt;
&lt;li&gt;Commit manifest files that record the expected model hash alongside the preset selection. Compare on load.&lt;/li&gt;
&lt;li&gt;Pin a tested &lt;code&gt;ReleaseTag&lt;/code&gt; — do not float on defaults across SDK upgrades without testing.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/model-source/&#34;&gt;Model source parameters&lt;/a&gt; — priority and resolution.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/product-overview/supported-presets/&#34;&gt;Supported presets&lt;/a&gt; — confirmed compatible models and files.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/troubleshooting/binary-download-fails/&#34;&gt;Binary download fails&lt;/a&gt; — related network issues.&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Net: Garbled output</title>
      <link>https://docs.aspose.com/llm/net/troubleshooting/garbled-output/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://docs.aspose.com/llm/net/troubleshooting/garbled-output/</guid>
      <description>
        
        
        &lt;p&gt;The model loaded successfully, but its replies are nonsensical, repetitive, or broken in a way that points to misconfiguration rather than a model problem.&lt;/p&gt;
&lt;h2 id=&#34;symptom&#34;&gt;Symptom&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Replies contain literal marker tokens like &lt;code&gt;&amp;lt;|im_start|&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;image&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;|endoftext|&amp;gt;&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The model produces the same phrase or token repeatedly.&lt;/li&gt;
&lt;li&gt;Output is truncated mid-sentence after exactly N tokens.&lt;/li&gt;
&lt;li&gt;Output is coherent at first, then devolves into nonsense after a few hundred tokens.&lt;/li&gt;
&lt;li&gt;Vision replies describe something different from the actual image.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;cause&#34;&gt;Cause&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Chat template mismatch&lt;/strong&gt; — the engine picked the wrong template for the model.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Repetition penalty too low&lt;/strong&gt; (or zero) — the model loops.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;MaxTokens&lt;/code&gt; too low&lt;/strong&gt; for a reasoning model — truncation mid-reasoning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;KV cache cleanup dropped important context&lt;/strong&gt; — middle of a long session.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wrong preset for the model&lt;/strong&gt; — a custom GGUF paired with a preset that does not match its architecture.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Aggressive KV quantization&lt;/strong&gt; — on long contexts, Q4 K/V can degrade quality.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;resolution&#34;&gt;Resolution&lt;/h2&gt;
&lt;h3 id=&#34;1-literal-marker-tokens-in-the-output&#34;&gt;1. Literal marker tokens in the output&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Diagnosis&lt;/strong&gt;: template mismatch. The engine fell back to a generic template; the model&amp;rsquo;s actual markers appear as text.&lt;/p&gt;
&lt;p&gt;Enable debug logging and look for &lt;code&gt;[MM] selected template: ...&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;EngineParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;EnableDebugLogging&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;true&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;If the template is &lt;code&gt;fallback&lt;/code&gt; or does not match your model family, the model needs a supported template. Options:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Verify you are using the correct built-in preset (see &lt;a href=&#34;https://docs.aspose.com/llm/net/product-overview/supported-presets/&#34;&gt;Supported presets&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;For a custom GGUF, try a different export from the same model with richer metadata.&lt;/li&gt;
&lt;li&gt;For vision: see &lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/multimodal/chat-templates/&#34;&gt;Chat templates&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;2-repetition-loops&#34;&gt;2. Repetition loops&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Diagnosis&lt;/strong&gt;: the model generates the same phrase in a loop.&lt;/p&gt;
&lt;p&gt;Raise repetition penalty:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SamplerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;RepetitionPenalty&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;1.15f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;// default 1.1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;If that still loops, try DRY (Don&amp;rsquo;t Repeat Yourself):&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SamplerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DryMultiplier&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;0.8f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SamplerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DryBase&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;1.75f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SamplerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DryAllowedLength&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;See &lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/sampler/&#34;&gt;Sampler parameters&lt;/a&gt; for the full repetition knob set.&lt;/p&gt;
&lt;h3 id=&#34;3-truncated-output-cuts-off-mid-sentence&#34;&gt;3. Truncated output (cuts off mid-sentence)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Diagnosis&lt;/strong&gt;: &lt;code&gt;MaxTokens&lt;/code&gt; limit hit before the model finished.&lt;/p&gt;
&lt;p&gt;Raise the budget:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ChatParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;MaxTokens&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;2048&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;// from default 2048; raise further for long answers
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;For reasoning models (Qwen3, DeepSeek-R1) that emit &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; blocks, budget 1024-2048 for the thinking plus the answer.&lt;/p&gt;
&lt;h3 id=&#34;4-coherent-start-garbled-after-a-while&#34;&gt;4. Coherent start, garbled after a while&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Diagnosis&lt;/strong&gt;: KV cache cleanup evicted critical context mid-session, or KV quantization is too aggressive at long contexts.&lt;/p&gt;
&lt;p&gt;Change the cleanup policy:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ChatParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;CacheCleanupStrategy&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;CacheCleanupStrategy&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;KeepSystemPromptAndFirstUserMessage&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Or upgrade KV precision:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;TypeK&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;GgmlType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;F16&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;TypeV&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;GgmlType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;F16&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;F16 is the safe default; drop to Q8_0 only with memory pressure.&lt;/p&gt;
&lt;h3 id=&#34;5-model-produces-the-wrong-answer-for-clear-questions&#34;&gt;5. Model produces the wrong answer for clear questions&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Diagnosis&lt;/strong&gt;: custom GGUF paired with a wrong preset, or the preset&amp;rsquo;s sampler settings do not match the model&amp;rsquo;s training distribution.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use a built-in preset for the model family, or set up a &lt;a href=&#34;https://docs.aspose.com/llm/net/use-cases/custom-preset/&#34;&gt;custom preset&lt;/a&gt; from scratch rather than reusing an unrelated built-in.&lt;/li&gt;
&lt;li&gt;Start with conservative sampler settings: &lt;code&gt;Temperature = 0.3&lt;/code&gt;, &lt;code&gt;TopP = 0.9&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Test with the reference prompt from the model&amp;rsquo;s Hugging Face page to rule out sampling issues.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;6-vision-reply-describes-the-wrong-thing&#34;&gt;6. Vision: reply describes the wrong thing&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Diagnosis&lt;/strong&gt;: the image was not delivered correctly to the model.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Enable &lt;code&gt;MtmdContextParameters.PrintTimings = true&lt;/code&gt; to verify the image was processed.&lt;/li&gt;
&lt;li&gt;Enable debug logging and look for &lt;code&gt;[MM]&lt;/code&gt; lines — confirm image chunks are tokenized.&lt;/li&gt;
&lt;li&gt;See &lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/multimodal/debugging-vision/&#34;&gt;Debugging vision&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;prevention&#34;&gt;Prevention&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Stick to built-in presets when possible.&lt;/li&gt;
&lt;li&gt;Test new models with simple prompts first; check the output matches the model&amp;rsquo;s reference outputs.&lt;/li&gt;
&lt;li&gt;Run with debug logging in staging to catch template fallbacks before production.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/sampler/&#34;&gt;Sampler parameters&lt;/a&gt; — repetition penalties, DRY.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/chat/&#34;&gt;Chat parameters&lt;/a&gt; — &lt;code&gt;MaxTokens&lt;/code&gt;, &lt;code&gt;CacheCleanupStrategy&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/multimodal/chat-templates/&#34;&gt;Chat templates&lt;/a&gt; — vision template selection.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/multimodal/debugging-vision/&#34;&gt;Debugging vision&lt;/a&gt; — multimodal-specific diagnosis.&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Net: License errors</title>
      <link>https://docs.aspose.com/llm/net/troubleshooting/license-errors/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://docs.aspose.com/llm/net/troubleshooting/license-errors/</guid>
      <description>
        
        
        &lt;p&gt;License errors appear when a chat method is called without a valid license. Aspose.LLM does not have an evaluation fallback for inference — every chat operation requires a license.&lt;/p&gt;
&lt;h2 id=&#34;symptom&#34;&gt;Symptom&lt;/h2&gt;
&lt;p&gt;The chat APIs throw:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;System.Exception: Not licensed for this method
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Or &lt;code&gt;License.IsLicensed&lt;/code&gt; returns &lt;code&gt;false&lt;/code&gt; when you expected &lt;code&gt;true&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;cause&#34;&gt;Cause&lt;/h2&gt;
&lt;p&gt;Several distinct cases:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;SetLicense&lt;/code&gt; was not called before the chat method.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SetLicense&lt;/code&gt; threw, and your code continued past the failure.&lt;/li&gt;
&lt;li&gt;The license file path does not resolve to an actual file.&lt;/li&gt;
&lt;li&gt;The embedded resource name is wrong.&lt;/li&gt;
&lt;li&gt;The license is expired (especially temporary licenses, typically 30 days).&lt;/li&gt;
&lt;li&gt;The license is corrupted (partial download, file system damage).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;resolution&#34;&gt;Resolution&lt;/h2&gt;
&lt;h3 id=&#34;1-confirm-setlicense-was-called&#34;&gt;1. Confirm &lt;code&gt;SetLicense&lt;/code&gt; was called&lt;/h3&gt;
&lt;p&gt;Every process that calls chat methods must apply the license once. A common mistake is calling &lt;code&gt;SetLicense&lt;/code&gt; in one context (e.g., a helper class constructor) and the API in another (a different process).&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;kt&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;license&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;new&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Aspose&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;LLM&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;License&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;license&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SetLicense&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;Aspose.LLM.lic&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;// Immediately after, verify:
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Console&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;WriteLine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;$&amp;#34;IsLicensed: {Aspose.LLM.License.IsLicensed}&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;// Should print True. If False, SetLicense failed.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id=&#34;2-catch-exceptions-from-setlicense&#34;&gt;2. Catch exceptions from &lt;code&gt;SetLicense&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;SetLicense&lt;/code&gt; itself can throw when the file is missing, corrupt, or wrong format. Catch and log:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;k&#34;&gt;try&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
    &lt;span class=&#34;kt&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;license&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;new&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Aspose&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;LLM&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;License&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;license&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SetLicense&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;Aspose.LLM.lic&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;catch&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Exception&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ex&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;_logger&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;LogError&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ex&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;License could not be applied.&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;throw&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The exception message usually states the cause (file not found, parse error, signature check failed).&lt;/p&gt;
&lt;h3 id=&#34;3-verify-the-license-file-path&#34;&gt;3. Verify the license file path&lt;/h3&gt;
&lt;p&gt;When you pass only a file name, &lt;code&gt;SetLicense&lt;/code&gt; searches several locations (see &lt;a href=&#34;https://docs.aspose.com/llm/net/licensing/&#34;&gt;Licensing&lt;/a&gt;). If the file lives elsewhere, pass the full path:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;license&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SetLicense&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;@&amp;#34;C:\licenses\Aspose.LLM.lic&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Confirm the file is copied to the process working directory or bin folder during build:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-xml&#34; data-lang=&#34;xml&#34;&gt;&lt;span class=&#34;nt&#34;&gt;&amp;lt;ItemGroup&amp;gt;&lt;/span&gt;
  &lt;span class=&#34;nt&#34;&gt;&amp;lt;None&lt;/span&gt; &lt;span class=&#34;na&#34;&gt;Update=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;Aspose.LLM.lic&amp;#34;&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&#34;nt&#34;&gt;&amp;lt;CopyToOutputDirectory&amp;gt;&lt;/span&gt;PreserveNewest&lt;span class=&#34;nt&#34;&gt;&amp;lt;/CopyToOutputDirectory&amp;gt;&lt;/span&gt;
  &lt;span class=&#34;nt&#34;&gt;&amp;lt;/None&amp;gt;&lt;/span&gt;
&lt;span class=&#34;nt&#34;&gt;&amp;lt;/ItemGroup&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id=&#34;4-embedded-resource--check-the-name&#34;&gt;4. Embedded resource — check the name&lt;/h3&gt;
&lt;p&gt;For embedded licenses, the resource name must match the file name exactly:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-xml&#34; data-lang=&#34;xml&#34;&gt;&lt;span class=&#34;nt&#34;&gt;&amp;lt;ItemGroup&amp;gt;&lt;/span&gt;
  &lt;span class=&#34;nt&#34;&gt;&amp;lt;EmbeddedResource&lt;/span&gt; &lt;span class=&#34;na&#34;&gt;Include=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;Aspose.LLM.lic&amp;#34;&lt;/span&gt; &lt;span class=&#34;nt&#34;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&#34;nt&#34;&gt;&amp;lt;/ItemGroup&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;license&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SetLicense&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;Aspose.LLM.lic&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;// matches the file name
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;If the file is in a subfolder (e.g., &lt;code&gt;Resources/Aspose.LLM.lic&lt;/code&gt;), the resource name becomes &lt;code&gt;&amp;lt;Namespace&amp;gt;.Resources.Aspose.LLM.lic&lt;/code&gt;. Match that name exactly, or move the file to the project root.&lt;/p&gt;
&lt;h3 id=&#34;5-temporary-license-expired&#34;&gt;5. Temporary license expired&lt;/h3&gt;
&lt;p&gt;Temporary licenses issued by &lt;a href=&#34;https://purchase.aspose.com/temporary-license&#34;&gt;purchase.aspose.com/temporary-license&lt;/a&gt; have a fixed expiry date (typically 30 days). After expiry, chat methods throw the same &lt;code&gt;Not licensed for this method&lt;/code&gt; error.&lt;/p&gt;
&lt;p&gt;Options:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Request a new temporary license.&lt;/li&gt;
&lt;li&gt;Purchase a commercial license.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The application code does not change — swap the &lt;code&gt;.lic&lt;/code&gt; file.&lt;/p&gt;
&lt;h3 id=&#34;6-corrupt-license-file&#34;&gt;6. Corrupt license file&lt;/h3&gt;
&lt;p&gt;If the file is truncated or modified after issue, signature validation fails. Re-download from the Aspose purchase portal.&lt;/p&gt;
&lt;h3 id=&#34;7-stream-based-license-from-a-failed-source&#34;&gt;7. Stream-based license from a failed source&lt;/h3&gt;
&lt;p&gt;If you load via a stream:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;k&#34;&gt;using&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;stream&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;File&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;OpenRead&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;Aspose.LLM.lic&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;license&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SetLicense&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;stream&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Ensure the stream is at position 0 before &lt;code&gt;SetLicense&lt;/code&gt;. If an earlier read consumed bytes, &lt;code&gt;SetLicense&lt;/code&gt; sees truncated data.&lt;/p&gt;
&lt;h2 id=&#34;prevention&#34;&gt;Prevention&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Apply the license at application startup, once, with explicit exception handling.&lt;/li&gt;
&lt;li&gt;Log &lt;code&gt;License.IsLicensed&lt;/code&gt; immediately after &lt;code&gt;SetLicense&lt;/code&gt; to confirm.&lt;/li&gt;
&lt;li&gt;Monitor temporary license expiry dates — have a calendar reminder 7 days before expiry.&lt;/li&gt;
&lt;li&gt;In CI/CD, use an embedded license or pull from a secret store rather than bundling files.&lt;/li&gt;
&lt;li&gt;For air-gapped deployments, copy the license with the rest of the deployment artifacts.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/licensing/&#34;&gt;Licensing&lt;/a&gt; — full license setup (file, stream, embedded resource, temporary).&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/license/&#34;&gt;License class reference&lt;/a&gt; — API surface of &lt;code&gt;License&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/asposellmapi/&#34;&gt;AsposeLLMApi facade&lt;/a&gt; — where license checks sit in the chat API surface.&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Net: Performance issues</title>
      <link>https://docs.aspose.com/llm/net/troubleshooting/performance-issues/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://docs.aspose.com/llm/net/troubleshooting/performance-issues/</guid>
      <description>
        
        
        &lt;p&gt;The SDK loads and runs, but throughput is below expectations or first-token latency is too high. This page covers the levers that affect performance.&lt;/p&gt;
&lt;h2 id=&#34;symptom&#34;&gt;Symptom&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Fewer tokens per second than expected for your hardware.&lt;/li&gt;
&lt;li&gt;First-token latency of several seconds even after warm-up.&lt;/li&gt;
&lt;li&gt;Performance spikes — fast for a while, then slow.&lt;/li&gt;
&lt;li&gt;Occasional stalls mid-response.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;cause&#34;&gt;Cause&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Wrong acceleration backend or silent CPU fallback.&lt;/li&gt;
&lt;li&gt;Suboptimal threading configuration.&lt;/li&gt;
&lt;li&gt;Flash attention not enabled.&lt;/li&gt;
&lt;li&gt;Memory pressure (swap thrashing, KV cache too large for VRAM).&lt;/li&gt;
&lt;li&gt;Competing CPU-heavy processes on the same host.&lt;/li&gt;
&lt;li&gt;Context size too large for the hardware.&lt;/li&gt;
&lt;li&gt;First request of a new session (fresh prefill).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;resolution&#34;&gt;Resolution&lt;/h2&gt;
&lt;h3 id=&#34;1-confirm-the-right-backend&#34;&gt;1. Confirm the right backend&lt;/h3&gt;
&lt;p&gt;Enable debug logging and confirm the binary variant and acceleration:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[BinaryManager] resolved asset: llama-b8816-bin-win-cuda-cu12.4-x64.zip
[Engine] inference on CUDA with 32/32 layers offloaded
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;If the variant says &lt;code&gt;cpu&lt;/code&gt; while you have a GPU — see &lt;a href=&#34;https://docs.aspose.com/llm/net/troubleshooting/gpu-not-detected/&#34;&gt;GPU not detected&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&#34;2-verify-gpulayers&#34;&gt;2. Verify &lt;code&gt;GpuLayers&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Make sure &lt;code&gt;GpuLayers&lt;/code&gt; is high enough to offload the model:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GpuLayers&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;999&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Partial offload (e.g., &lt;code&gt;GpuLayers = 20&lt;/code&gt;) on an 8B model keeps half on CPU — the GPU cannot accelerate what is not on it.&lt;/p&gt;
&lt;h3 id=&#34;3-enable-flash-attention&#34;&gt;3. Enable flash attention&lt;/h3&gt;
&lt;p&gt;Near-universal win; rarely hurts:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;FlashAttentionMode&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FlashAttentionType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Enabled&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Particularly important for contexts beyond 8K tokens.&lt;/p&gt;
&lt;h3 id=&#34;4-tune-threads-on-cpu&#34;&gt;4. Tune threads on CPU&lt;/h3&gt;
&lt;p&gt;For CPU inference, &lt;code&gt;NThreads&lt;/code&gt; and &lt;code&gt;NThreadsBatch&lt;/code&gt; matter:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;NThreads&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Environment&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ProcessorCount&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;/&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;NThreadsBatch&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Environment&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ProcessorCount&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;See &lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/acceleration/cpu/&#34;&gt;CPU acceleration&lt;/a&gt; for the full rationale. Benchmark on your specific host to find the sweet spot; adding threads past 8-12 on generation often hurts.&lt;/p&gt;
&lt;h3 id=&#34;5-warm-up-sessions&#34;&gt;5. Warm up sessions&lt;/h3&gt;
&lt;p&gt;First token on a fresh session includes prefill time (tokenizing and evaluating the system prompt + history). Amortize by reusing sessions:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;c1&#34;&gt;// Reuse one session per user instead of creating fresh each request.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;See &lt;a href=&#34;https://docs.aspose.com/llm/net/how-to/reduce-first-token-latency/&#34;&gt;Reduce first-token latency&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&#34;6-shrink-contextsize-if-you-do-not-use-the-full-window&#34;&gt;6. Shrink &lt;code&gt;ContextSize&lt;/code&gt; if you do not use the full window&lt;/h3&gt;
&lt;p&gt;Longer contexts are slower per token even when mostly empty — KV scans scale with position count. Drop &lt;code&gt;ContextSize&lt;/code&gt; to the actual max you need:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextSize&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;8192&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id=&#34;7-check-for-competing-cpu-load&#34;&gt;7. Check for competing CPU load&lt;/h3&gt;
&lt;p&gt;Another CPU-heavy process on the same host steals throughput:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;top&lt;/code&gt; / &lt;code&gt;htop&lt;/code&gt; on Linux/macOS.&lt;/li&gt;
&lt;li&gt;Task Manager → Details on Windows.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Serialize inference work with other CPU-heavy tasks; do not run AV scans, backup jobs, or compilers alongside.&lt;/p&gt;
&lt;h3 id=&#34;8-watch-for-memory-pressure&#34;&gt;8. Watch for memory pressure&lt;/h3&gt;
&lt;p&gt;Swap thrashing silently destroys throughput:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;free -h
&lt;span class=&#34;c1&#34;&gt;# Check &amp;#34;swap used&amp;#34;. Nonzero during inference = bad.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;If the host is swapping, reduce memory footprint (smaller model, shorter context, KV quantization).&lt;/p&gt;
&lt;h3 id=&#34;9-check-for-thermal-throttling&#34;&gt;9. Check for thermal throttling&lt;/h3&gt;
&lt;p&gt;Sustained high load heats the CPU and GPU; thermal throttling drops clocks and cuts throughput.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;On laptops, plug into AC power.&lt;/li&gt;
&lt;li&gt;Verify cooling — clean dust, check fan RPM.&lt;/li&gt;
&lt;li&gt;On CPU: &lt;code&gt;watch -n 1 &#39;cat /proc/cpuinfo | grep MHz&#39;&lt;/code&gt; (Linux).&lt;/li&gt;
&lt;li&gt;On NVIDIA GPU: &lt;code&gt;nvidia-smi -q -d CLOCK&lt;/code&gt; (look for &lt;code&gt;Current&lt;/code&gt;-vs-&lt;code&gt;Base&lt;/code&gt; clock).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;10-mirostat--dynatemp-overhead&#34;&gt;10. Mirostat / dynatemp overhead&lt;/h3&gt;
&lt;p&gt;Advanced samplers like Mirostat and dynamic temperature add small per-token overhead. If you are chasing the last 10 % of throughput, disable them:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SamplerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Mirostat&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SamplerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;DynatempRange&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id=&#34;measure&#34;&gt;Measure&lt;/h2&gt;
&lt;p&gt;Before optimizing, establish a baseline:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;kt&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sw&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;System&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Diagnostics&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Stopwatch&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;StartNew&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
&lt;span class=&#34;kt&#34;&gt;string&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;reply&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;await&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;api&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SendMessageAsync&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;sw&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Stop&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;

&lt;span class=&#34;kt&#34;&gt;int&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;approxTokens&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;reply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Split&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sc&#34;&gt;&amp;#39; &amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;).&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Length&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;*&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;4&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;/&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;// rough conversion
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;&lt;span class=&#34;kt&#34;&gt;double&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;tps&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;approxTokens&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;/&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sw&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Elapsed&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;TotalSeconds&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;Console&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;WriteLine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;$&amp;#34;~{tps:F1} tok/s&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run with a representative prompt size; numbers vary wildly by prompt length.&lt;/p&gt;
&lt;h2 id=&#34;reference-throughput-numbers&#34;&gt;Reference throughput numbers&lt;/h2&gt;
&lt;p&gt;7B Q4_K_M at 4K context:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hardware&lt;/th&gt;
&lt;th&gt;Expected&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU (i5, AVX2)&lt;/td&gt;
&lt;td&gt;5-10 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU (i7/i9, AVX-512)&lt;/td&gt;
&lt;td&gt;8-15 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3060 (CUDA)&lt;/td&gt;
&lt;td&gt;40-60 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4090 (CUDA)&lt;/td&gt;
&lt;td&gt;100-140 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apple M2 Pro (Metal)&lt;/td&gt;
&lt;td&gt;30-50 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apple M3 Max (Metal)&lt;/td&gt;
&lt;td&gt;50-80 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If your numbers are substantially below these, work through the resolution steps in order.&lt;/p&gt;
&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/how-to/tune-for-speed-vs-quality/&#34;&gt;Tune for speed vs quality&lt;/a&gt; — speed-biased configuration.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/how-to/reduce-first-token-latency/&#34;&gt;Reduce first-token latency&lt;/a&gt; — cut TTFT.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/acceleration/&#34;&gt;Acceleration&lt;/a&gt; — backend-specific tuning.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/context/&#34;&gt;Context parameters&lt;/a&gt; — batch sizes and threading.&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
  </channel>
</rss>
