<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Documentation – Acceleration</title>
    <link>https://docs.aspose.com/llm/net/developer-reference/acceleration/</link>
    <description>Recent content in Acceleration on Documentation</description>
    <generator>Hugo -- gohugo.io</generator>
    <lastBuildDate>Thu, 23 Apr 2026 00:00:00 +0000</lastBuildDate>
    
	  <atom:link href="https://docs.aspose.com/llm/net/developer-reference/acceleration/index.xml" rel="self" type="application/rss+xml" />
    
    
      
        
      
    
    
    <item>
      <title>Net: CUDA</title>
      <link>https://docs.aspose.com/llm/net/developer-reference/acceleration/cuda/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://docs.aspose.com/llm/net/developer-reference/acceleration/cuda/</guid>
      <description>
        
        
        &lt;p&gt;CUDA is the fastest backend for Aspose.LLM for .NET on NVIDIA GPUs. It supports single-GPU and multi-GPU setups, aggressive memory offload, and the full &lt;code&gt;llama.cpp&lt;/code&gt; feature set.&lt;/p&gt;
&lt;h2 id=&#34;requirements&#34;&gt;Requirements&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GPU&lt;/strong&gt;: NVIDIA with compute capability 5.0 or higher.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Driver&lt;/strong&gt;: version 525 or later.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CUDA runtime&lt;/strong&gt;: the one bundled with the downloaded native binary (typically CUDA 11.7 or 12.x). You do &lt;strong&gt;not&lt;/strong&gt; install CUDA separately — the SDK&amp;rsquo;s binary ships with the runtime.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OS&lt;/strong&gt;: Windows 10+ or Linux (glibc 2.28+). Not supported on macOS.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Verify driver and GPU with &lt;code&gt;nvidia-smi&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;nvidia-smi
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Check the &lt;code&gt;Driver Version&lt;/code&gt; (≥ 525) and the GPU model. Every modern GeForce, Quadro, Tesla, or data-center GPU qualifies.&lt;/p&gt;
&lt;h2 id=&#34;select-cuda&#34;&gt;Select CUDA&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;k&#34;&gt;using&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;Aspose.LLM.Abstractions.Acceleration&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;

&lt;span class=&#34;kt&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;new&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Qwen25Preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BinaryManagerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;PreferredAcceleration&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;AccelerationType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;CUDA&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GpuLayers&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;999&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;// full offload
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;using&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;api&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;AsposeLLMApi&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Create&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;On the first run, the SDK downloads the CUDA variant of &lt;code&gt;llama.cpp&lt;/code&gt; binaries (typically 400-800 MB) and caches it at &lt;code&gt;BinaryManagerParameters.BinaryPath&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;single-gpu&#34;&gt;Single GPU&lt;/h2&gt;
&lt;p&gt;With one GPU, the defaults work. The engine places the offloaded layers on GPU 0.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BinaryManagerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;PreferredAcceleration&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;AccelerationType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;CUDA&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GpuLayers&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;999&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id=&#34;specific-gpu-selection&#34;&gt;Specific GPU selection&lt;/h2&gt;
&lt;p&gt;On hosts with multiple GPUs, &lt;code&gt;MainGpu&lt;/code&gt; picks which one gets the entire model when split mode is &lt;code&gt;None&lt;/code&gt;.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;k&#34;&gt;using&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;Aspose.LLM.Abstractions.Parameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BinaryManagerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;PreferredAcceleration&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;AccelerationType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;CUDA&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SplitMode&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;LlamaSplitMode&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;LLAMA_SPLIT_MODE_NONE&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;MainGpu&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;    &lt;span class=&#34;c1&#34;&gt;// use GPU index 1
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GpuLayers&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;999&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Use &lt;code&gt;CUDA_VISIBLE_DEVICES&lt;/code&gt; environment variable to constrain which GPUs the process sees — standard NVIDIA tooling.&lt;/p&gt;
&lt;h2 id=&#34;multi-gpu-split&#34;&gt;Multi-GPU split&lt;/h2&gt;
&lt;p&gt;Distribute the model across multiple GPUs.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BinaryManagerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;PreferredAcceleration&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;AccelerationType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;CUDA&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SplitMode&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;LlamaSplitMode&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;LLAMA_SPLIT_MODE_LAYER&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GpuLayers&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;999&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Split modes:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;LLAMA_SPLIT_MODE_NONE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Single GPU only. Whole model on &lt;code&gt;MainGpu&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;LLAMA_SPLIT_MODE_LAYER&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Split layers across GPUs. Good default for multi-GPU.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;LLAMA_SPLIT_MODE_ROW&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Split rows — enables tensor parallelism where supported. Fastest on setups with high-bandwidth GPU interconnects (NVLink).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For unequal GPU memory sizes, set &lt;code&gt;TensorSplit&lt;/code&gt; to balance the load:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;c1&#34;&gt;// 24 GB GPU + 12 GB GPU: 2:1 split.
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;TensorSplit&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;new&lt;/span&gt; &lt;span class=&#34;kt&#34;&gt;float&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[]&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;2.0f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;1.0f&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Values are normalized to sum to 1.&lt;/p&gt;
&lt;h2 id=&#34;partial-offload-on-memory-tight-gpus&#34;&gt;Partial offload on memory-tight GPUs&lt;/h2&gt;
&lt;p&gt;If the model does not fit entirely in VRAM, offload the first N layers and leave the rest on CPU.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BinaryManagerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;PreferredAcceleration&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;AccelerationType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;CUDA&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GpuLayers&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;28&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;// first 28 layers on GPU
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Benchmark to find the right split — &amp;ldquo;offload until VRAM is ~1-2 GB short of full&amp;rdquo; — because the KV cache also claims VRAM proportional to GPU layer count.&lt;/p&gt;
&lt;h2 id=&#34;memory-tips&#34;&gt;Memory tips&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;KV cache quantization&lt;/strong&gt; — set &lt;code&gt;ContextParameters.TypeV = GgmlType.Q8_0&lt;/code&gt; to halve V-cache memory with minor quality impact.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flash Attention&lt;/strong&gt; — &lt;code&gt;ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled&lt;/code&gt; reduces memory at long contexts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Shorter context&lt;/strong&gt; — drop &lt;code&gt;ContextParameters.ContextSize&lt;/code&gt; if you do not need the preset&amp;rsquo;s default length.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;common-issues&#34;&gt;Common issues&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cudaErrorInsufficientDriver&lt;/code&gt; in logs&lt;/td&gt;
&lt;td&gt;Driver too old.&lt;/td&gt;
&lt;td&gt;Upgrade NVIDIA driver to 525+.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CUDA binary downloaded but inference is on CPU&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GpuLayers = 0&lt;/code&gt; or not set.&lt;/td&gt;
&lt;td&gt;Set &lt;code&gt;GpuLayers = 999&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Out-of-memory on model load&lt;/td&gt;
&lt;td&gt;VRAM too small for full offload.&lt;/td&gt;
&lt;td&gt;Lower &lt;code&gt;GpuLayers&lt;/code&gt;, enable flash attention, quantize KV.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Slower than expected on multi-GPU&lt;/td&gt;
&lt;td&gt;&lt;code&gt;SplitMode = None&lt;/code&gt;.&lt;/td&gt;
&lt;td&gt;Switch to &lt;code&gt;LAYER&lt;/code&gt; or &lt;code&gt;ROW&lt;/code&gt; depending on interconnect.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/binary-manager/&#34;&gt;Binary manager parameters&lt;/a&gt; — &lt;code&gt;PreferredAcceleration&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/model-inference/&#34;&gt;Model inference parameters&lt;/a&gt; — &lt;code&gt;GpuLayers&lt;/code&gt;, &lt;code&gt;SplitMode&lt;/code&gt;, &lt;code&gt;TensorSplit&lt;/code&gt;, &lt;code&gt;MainGpu&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/acceleration/vulkan/&#34;&gt;Vulkan&lt;/a&gt; — cross-vendor GPU alternative.&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Net: Metal</title>
      <link>https://docs.aspose.com/llm/net/developer-reference/acceleration/metal/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://docs.aspose.com/llm/net/developer-reference/acceleration/metal/</guid>
      <description>
        
        
        &lt;p&gt;Metal is Aspose.LLM for .NET&amp;rsquo;s preferred backend on Apple Silicon Macs (M1, M2, M3, M4). It uses the unified memory architecture — the CPU and GPU share the same physical RAM — so there is no separate VRAM budget to worry about.&lt;/p&gt;
&lt;h2 id=&#34;requirements&#34;&gt;Requirements&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Hardware&lt;/strong&gt;: Apple Silicon Mac (M-series chip). Intel Macs are not supported.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OS&lt;/strong&gt;: macOS 11 (Big Sur) or later.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No driver install&lt;/strong&gt;: Metal ships with macOS; nothing to configure.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;select-metal&#34;&gt;Select Metal&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;k&#34;&gt;using&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;Aspose.LLM.Abstractions.Acceleration&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;

&lt;span class=&#34;kt&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;new&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Qwen25Preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BinaryManagerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;PreferredAcceleration&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;AccelerationType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Metal&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GpuLayers&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;999&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;using&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;api&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;AsposeLLMApi&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Create&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;On first run, the SDK downloads the Metal variant of &lt;code&gt;llama.cpp&lt;/code&gt; binaries (typically 100-200 MB). Subsequent runs use the cache.&lt;/p&gt;
&lt;p&gt;On Apple Silicon with &lt;code&gt;PreferredAcceleration = null&lt;/code&gt;, auto-detection picks Metal by default — the explicit setting is redundant but harmless.&lt;/p&gt;
&lt;h2 id=&#34;unified-memory&#34;&gt;Unified memory&lt;/h2&gt;
&lt;p&gt;Because RAM and GPU memory are the same physical chip, &lt;code&gt;GpuLayers = 999&lt;/code&gt; does not create a separate VRAM claim — it tells the Metal backend to run the layers on the GPU compute units. The total memory footprint is the same whether you run on CPU or Metal, but Metal is significantly faster for matrix math.&lt;/p&gt;
&lt;p&gt;You do not need to worry about &amp;ldquo;VRAM fits&amp;rdquo; calculations on Apple Silicon. If the model plus KV cache fit in system RAM, they fit for Metal too.&lt;/p&gt;
&lt;p&gt;Typical memory ceilings per Mac:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mac model&lt;/th&gt;
&lt;th&gt;Unified memory options&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;M1 / M2 / M3 base&lt;/td&gt;
&lt;td&gt;8 GB, 16 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;M1 / M2 / M3 Pro&lt;/td&gt;
&lt;td&gt;16 GB, 32 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;M1 / M2 / M3 Max&lt;/td&gt;
&lt;td&gt;32 GB, 64 GB, 96 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;M2 / M3 Ultra&lt;/td&gt;
&lt;td&gt;64 GB, 128 GB, 192 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;M4 / M4 Pro / M4 Max&lt;/td&gt;
&lt;td&gt;16 GB up to 128 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;A 7B Q4_K_M model needs ~8-12 GB including KV cache — comfortable on any 16 GB Mac. A 70B Q4 needs 40+ GB and realistically wants an M2/M3 Ultra or M4 Max with 64 GB or more.&lt;/p&gt;
&lt;h2 id=&#34;single-chip-only&#34;&gt;Single-chip only&lt;/h2&gt;
&lt;p&gt;Multi-GPU is not applicable on Apple Silicon. &lt;code&gt;MainGpu&lt;/code&gt;, &lt;code&gt;SplitMode&lt;/code&gt;, and &lt;code&gt;TensorSplit&lt;/code&gt; have no effect — leave them at defaults.&lt;/p&gt;
&lt;h2 id=&#34;performance-tips&#34;&gt;Performance tips&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Full offload&lt;/strong&gt; — &lt;code&gt;GpuLayers = 999&lt;/code&gt; always. Partial offload is rarely useful because there is no separate VRAM budget.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flash Attention&lt;/strong&gt; — &lt;code&gt;ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled&lt;/code&gt;. Metal implements flash attention kernels efficiently.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prefer smaller quantizations&lt;/strong&gt; — Apple Silicon&amp;rsquo;s memory bandwidth, not compute, is often the bottleneck. Q4 models run noticeably faster than Q8 on the same hardware for this reason.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;power-management&#34;&gt;Power management&lt;/h2&gt;
&lt;p&gt;Metal respects macOS power policy. On battery, macOS may throttle GPU clocks and reduce inference speed. Plug into AC for sustained throughput when benchmarking or running long jobs.&lt;/p&gt;
&lt;h2 id=&#34;common-issues&#34;&gt;Common issues&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;ldquo;This backend is not supported&amp;rdquo; on Intel Mac&lt;/td&gt;
&lt;td&gt;Intel hardware.&lt;/td&gt;
&lt;td&gt;Use CPU acceleration instead; upgrade to Apple Silicon for GPU.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Slower than expected&lt;/td&gt;
&lt;td&gt;macOS throttling on battery.&lt;/td&gt;
&lt;td&gt;Connect to AC power; close GPU-heavy apps.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Out-of-memory&lt;/td&gt;
&lt;td&gt;Combined model + KV cache exceeds system RAM.&lt;/td&gt;
&lt;td&gt;Smaller preset, lower context size, quantize KV cache.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/model-inference/&#34;&gt;Model inference parameters&lt;/a&gt; — &lt;code&gt;GpuLayers&lt;/code&gt; configuration.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/context/&#34;&gt;Context parameters&lt;/a&gt; — flash attention and KV cache settings.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/binary-manager/&#34;&gt;Binary manager parameters&lt;/a&gt; — &lt;code&gt;PreferredAcceleration&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Net: Vulkan</title>
      <link>https://docs.aspose.com/llm/net/developer-reference/acceleration/vulkan/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://docs.aspose.com/llm/net/developer-reference/acceleration/vulkan/</guid>
      <description>
        
        
        &lt;p&gt;Vulkan is a cross-vendor GPU backend that works on most modern discrete and integrated GPUs. It does not need CUDA or ROCm installed — only a recent graphics driver with Vulkan support. Use it when CUDA or HIP are unavailable, or when you want one codepath that works across NVIDIA, AMD, and Intel hardware.&lt;/p&gt;
&lt;h2 id=&#34;requirements&#34;&gt;Requirements&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GPU&lt;/strong&gt;: any GPU with Vulkan 1.2 or later support. This covers most discrete GPUs from 2018+ and integrated GPUs from Intel (Xe, UHD), AMD (RDNA, Vega), and NVIDIA (Maxwell+).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Driver&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;NVIDIA: 470 or later.&lt;/li&gt;
&lt;li&gt;AMD: modern Adrenalin or Mesa RADV on Linux.&lt;/li&gt;
&lt;li&gt;Intel: current Arc / Xe driver.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OS&lt;/strong&gt;: Windows 10+ or Linux (glibc 2.28+). Not supported on macOS — Metal is the macOS equivalent.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Verify Vulkan support:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Windows&lt;/strong&gt;: install Vulkan SDK&amp;rsquo;s &lt;code&gt;vulkaninfoSDK.exe&lt;/code&gt; or run a Vulkan demo app.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Linux&lt;/strong&gt;: &lt;code&gt;vulkaninfo&lt;/code&gt; from the &lt;code&gt;vulkan-tools&lt;/code&gt; package.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;select-vulkan&#34;&gt;Select Vulkan&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;k&#34;&gt;using&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;Aspose.LLM.Abstractions.Acceleration&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;

&lt;span class=&#34;kt&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;new&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Qwen25Preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BinaryManagerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;PreferredAcceleration&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;AccelerationType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Vulkan&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GpuLayers&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;999&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;using&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;api&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;AsposeLLMApi&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Create&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The SDK downloads the Vulkan variant (typically 200-400 MB) on first run.&lt;/p&gt;
&lt;h2 id=&#34;when-to-prefer-vulkan&#34;&gt;When to prefer Vulkan&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Cross-vendor deployments&lt;/strong&gt; — the same binary works on NVIDIA, AMD, and Intel GPUs without swapping backends.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Integrated GPUs&lt;/strong&gt; — Intel Xe, AMD integrated Radeon, and NVIDIA integrated GPUs work via Vulkan, no vendor-specific runtime.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Containers without CUDA/ROCm&lt;/strong&gt; — Vulkan runs without NVIDIA Container Toolkit or ROCm Docker setup.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Older hardware&lt;/strong&gt; — many GPUs that no longer get CUDA updates still have Vulkan drivers.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;when-to-prefer-cuda-or-hip-instead&#34;&gt;When to prefer CUDA or HIP instead&lt;/h2&gt;
&lt;p&gt;For dedicated NVIDIA or AMD workloads, vendor-specific backends are usually faster:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/acceleration/cuda/&#34;&gt;CUDA&lt;/a&gt; is typically 20-40 % faster than Vulkan on NVIDIA.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/acceleration/hip-rocm/&#34;&gt;HIP&lt;/a&gt; is faster than Vulkan on AMD for most workloads.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Use Vulkan when portability trumps raw speed.&lt;/p&gt;
&lt;h2 id=&#34;multi-gpu&#34;&gt;Multi-GPU&lt;/h2&gt;
&lt;p&gt;Vulkan supports multi-GPU setups via &lt;code&gt;SplitMode&lt;/code&gt; and &lt;code&gt;TensorSplit&lt;/code&gt; like CUDA, but driver support for multi-device Vulkan is less mature. Test on your specific hardware before committing — single-GPU Vulkan is well-trodden; multi-GPU Vulkan is hit-or-miss.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SplitMode&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;LlamaSplitMode&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;LLAMA_SPLIT_MODE_LAYER&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GpuLayers&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;999&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;// Optional TensorSplit for unequal GPU sizes.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id=&#34;performance-tips&#34;&gt;Performance tips&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Flash Attention&lt;/strong&gt; — supported on most Vulkan drivers; enable via &lt;code&gt;ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Integrated GPU memory&lt;/strong&gt; — iGPUs use system RAM for GPU memory. Short contexts and heavy KV quantization help fit within the shared pool.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Driver version matters&lt;/strong&gt; — recent Vulkan drivers have significant perf improvements for LLM inference. Keep drivers current.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;common-issues&#34;&gt;Common issues&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Vulkan binary downloaded but runs on CPU&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GpuLayers = 0&lt;/code&gt;, or Vulkan driver missing.&lt;/td&gt;
&lt;td&gt;Set &lt;code&gt;GpuLayers = 999&lt;/code&gt;; verify Vulkan with &lt;code&gt;vulkaninfo&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Crash at model load&lt;/td&gt;
&lt;td&gt;Very old driver or unsupported GPU.&lt;/td&gt;
&lt;td&gt;Update driver; fall back to CPU if hardware is pre-Vulkan-1.2.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Significantly slower than CUDA on NVIDIA&lt;/td&gt;
&lt;td&gt;Expected — use CUDA instead for best performance.&lt;/td&gt;
&lt;td&gt;Switch &lt;code&gt;PreferredAcceleration&lt;/code&gt; to &lt;code&gt;CUDA&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/acceleration/cuda/&#34;&gt;CUDA&lt;/a&gt; — faster on NVIDIA.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/acceleration/hip-rocm/&#34;&gt;HIP / ROCm&lt;/a&gt; — faster on AMD.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/acceleration/cpu/&#34;&gt;CPU&lt;/a&gt; — last-resort fallback.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/model-inference/&#34;&gt;Model inference parameters&lt;/a&gt; — GPU offload configuration.&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Net: HIP / ROCm</title>
      <link>https://docs.aspose.com/llm/net/developer-reference/acceleration/hip-rocm/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://docs.aspose.com/llm/net/developer-reference/acceleration/hip-rocm/</guid>
      <description>
        
        
        &lt;p&gt;HIP (and its underlying ROCm stack) is the AMD-specific GPU backend for Aspose.LLM for .NET. It is Linux-only and targeted at AMD Instinct data-center GPUs and recent Radeon consumer cards.&lt;/p&gt;
&lt;h2 id=&#34;requirements&#34;&gt;Requirements&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GPU&lt;/strong&gt;: ROCm-supported AMD GPU.
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Instinct&lt;/strong&gt;: MI100, MI210, MI250, MI300 series.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Radeon&lt;/strong&gt;: RDNA 3 (RX 7900 series) and newer are officially supported. Some RDNA 2 cards (RX 6800/6900) work with &lt;code&gt;HSA_OVERRIDE_GFX_VERSION&lt;/code&gt; workarounds.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OS&lt;/strong&gt;: Linux with ROCm 6.x. Ubuntu 22.04 LTS and RHEL 9 are the commonly tested hosts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Driver&lt;/strong&gt;: install the ROCm stack via the official AMD packages. Verify with &lt;code&gt;rocminfo&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No Windows support&lt;/strong&gt;: AMD has Windows ROCm in preview, but Aspose.LLM&amp;rsquo;s HIP binaries currently target Linux only. On Windows with AMD GPUs, use &lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/acceleration/vulkan/&#34;&gt;Vulkan&lt;/a&gt; instead.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Verify ROCm:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;rocminfo &lt;span class=&#34;p&#34;&gt;|&lt;/span&gt; grep &lt;span class=&#34;s2&#34;&gt;&amp;#34;Name:&amp;#34;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The output should list your GPU by name (e.g., &lt;code&gt;gfx1100&lt;/code&gt; for RX 7900 XTX).&lt;/p&gt;
&lt;h2 id=&#34;select-hip&#34;&gt;Select HIP&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;k&#34;&gt;using&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;Aspose.LLM.Abstractions.Acceleration&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;

&lt;span class=&#34;kt&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;new&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Qwen25Preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BinaryManagerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;PreferredAcceleration&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;AccelerationType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;HIP&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GpuLayers&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;999&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;using&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;api&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;AsposeLLMApi&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Create&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The SDK downloads the HIP variant (typically 300-500 MB) on first run.&lt;/p&gt;
&lt;h2 id=&#34;multi-gpu&#34;&gt;Multi-GPU&lt;/h2&gt;
&lt;p&gt;HIP supports multi-GPU across AMD cards of the same ROCm generation.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SplitMode&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;LlamaSplitMode&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;LLAMA_SPLIT_MODE_LAYER&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GpuLayers&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;999&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;// Optionally tune TensorSplit per-GPU VRAM.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Use &lt;code&gt;ROCR_VISIBLE_DEVICES&lt;/code&gt; to control which GPUs the process sees.&lt;/p&gt;
&lt;h2 id=&#34;rdna-2-workaround&#34;&gt;RDNA 2 workaround&lt;/h2&gt;
&lt;p&gt;Some RDNA 2 cards (e.g., RX 6800 XT, RX 6900 XT) are not officially supported by ROCm, but work with a GFX version override:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;nb&#34;&gt;export&lt;/span&gt; &lt;span class=&#34;nv&#34;&gt;HSA_OVERRIDE_GFX_VERSION&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;10.3.0
dotnet run
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Apply this before starting your process. The override tricks ROCm into treating an RX 6800/6900 as a supported variant. Quality is fine; performance is slightly lower than on supported cards.&lt;/p&gt;
&lt;h2 id=&#34;performance-tips&#34;&gt;Performance tips&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Flash Attention&lt;/strong&gt; — supported on recent ROCm releases; enable via &lt;code&gt;ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partial offload&lt;/strong&gt; — for consumer Radeon cards with limited VRAM (16-24 GB), tune &lt;code&gt;GpuLayers&lt;/code&gt; to slightly below full to leave room for KV cache.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;KV quantization&lt;/strong&gt; — aggressive &lt;code&gt;TypeV = GgmlType.Q8_0&lt;/code&gt; or even &lt;code&gt;Q4_0&lt;/code&gt; claws back meaningful VRAM on long contexts.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;windows-amd-users-use-vulkan&#34;&gt;Windows AMD users: use Vulkan&lt;/h2&gt;
&lt;p&gt;Aspose.LLM does not ship Windows HIP binaries. If your AMD GPU is on Windows, use &lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/acceleration/vulkan/&#34;&gt;Vulkan&lt;/a&gt; — AMD&amp;rsquo;s Vulkan driver is excellent and gets you GPU acceleration without ROCm.&lt;/p&gt;
&lt;h2 id=&#34;common-issues&#34;&gt;Common issues&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;rocblas_status_internal_error&lt;/code&gt; on load&lt;/td&gt;
&lt;td&gt;Incompatible ROCm version.&lt;/td&gt;
&lt;td&gt;Match ROCm 6.x; upgrade if older.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unsupported GPU at startup&lt;/td&gt;
&lt;td&gt;Card not on ROCm&amp;rsquo;s support list.&lt;/td&gt;
&lt;td&gt;Use Vulkan as fallback, or try &lt;code&gt;HSA_OVERRIDE_GFX_VERSION&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference on CPU despite HIP binary&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GpuLayers = 0&lt;/code&gt; or ROCm runtime missing.&lt;/td&gt;
&lt;td&gt;Set &lt;code&gt;GpuLayers = 999&lt;/code&gt;; verify with &lt;code&gt;rocminfo&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-GPU instability&lt;/td&gt;
&lt;td&gt;Mixing different RDNA generations.&lt;/td&gt;
&lt;td&gt;Stick to same-generation GPUs; try &lt;code&gt;LLAMA_SPLIT_MODE_LAYER&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/acceleration/vulkan/&#34;&gt;Vulkan&lt;/a&gt; — Windows alternative for AMD.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/binary-manager/&#34;&gt;Binary manager parameters&lt;/a&gt; — &lt;code&gt;PreferredAcceleration&lt;/code&gt; selection.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/model-inference/&#34;&gt;Model inference parameters&lt;/a&gt; — GPU offload and multi-GPU settings.&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Net: CPU</title>
      <link>https://docs.aspose.com/llm/net/developer-reference/acceleration/cpu/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://docs.aspose.com/llm/net/developer-reference/acceleration/cpu/</guid>
      <description>
        
        
        &lt;p&gt;CPU is the fallback backend for Aspose.LLM for .NET when no supported GPU is available. It works on every supported platform and requires no special drivers or runtimes. Throughput is lower than GPU inference, but reasonable for small models (3B-7B parameters) on modern CPUs.&lt;/p&gt;
&lt;h2 id=&#34;requirements&#34;&gt;Requirements&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CPU&lt;/strong&gt;: any x64 CPU. ARM64 is supported on Linux and macOS (Apple Silicon). The SDK picks the optimal CPU variant automatically based on detected instruction sets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OS&lt;/strong&gt;: Windows 10+, Linux (glibc 2.28+), macOS 11+. Intel Macs use CPU; Apple Silicon Macs prefer &lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/acceleration/metal/&#34;&gt;Metal&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;cpu-variants&#34;&gt;CPU variants&lt;/h2&gt;
&lt;p&gt;The SDK downloads one of four CPU variants depending on your CPU&amp;rsquo;s instruction sets:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variant&lt;/th&gt;
&lt;th&gt;Requires&lt;/th&gt;
&lt;th&gt;Typical CPUs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AVX512&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;AVX-512 instructions&lt;/td&gt;
&lt;td&gt;Intel Xeon Scalable 2nd+, recent Core (Cannon Lake+), AMD Zen 4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AVX2&lt;/code&gt; (default fallback)&lt;/td&gt;
&lt;td&gt;AVX2 instructions&lt;/td&gt;
&lt;td&gt;Most CPUs from 2014 onwards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AVX&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;AVX instructions&lt;/td&gt;
&lt;td&gt;Sandy Bridge / Ivy Bridge era (2011-2013)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NoAVX&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;x64 baseline&lt;/td&gt;
&lt;td&gt;Pre-2011 CPUs (very slow)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Auto-detection picks the highest available variant. To force a specific variant:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;k&#34;&gt;using&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;Aspose.LLM.Abstractions.Acceleration&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BinaryManagerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;PreferredAcceleration&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;AccelerationType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;AVX2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;// or AVX512, AVX, NoAVX
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GpuLayers&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;CPU download size: typically 80-150 MB.&lt;/p&gt;
&lt;h2 id=&#34;thread-configuration&#34;&gt;Thread configuration&lt;/h2&gt;
&lt;p&gt;On CPU, threading is the primary lever for throughput. Two settings:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;ContextParameters.NThreads&lt;/code&gt;&lt;/strong&gt; — threads for generation (token-by-token decode). Typically &lt;code&gt;half of ProcessorCount&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;ContextParameters.NThreadsBatch&lt;/code&gt;&lt;/strong&gt; — threads for prompt processing (initial tokenization and context fill). Typically &lt;code&gt;all ProcessorCount&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;NThreads&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;8&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;NThreadsBatch&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;16&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;When &lt;code&gt;NThreads&lt;/code&gt; is &lt;code&gt;null&lt;/code&gt;, the engine falls back to &lt;code&gt;EngineParameters.DefaultThreads&lt;/code&gt; (default: &lt;code&gt;ProcessorCount - 1&lt;/code&gt;).&lt;/p&gt;
&lt;h3 id=&#34;why-different-thread-counts&#34;&gt;Why different thread counts&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Prompt processing&lt;/strong&gt; is embarrassingly parallel — more threads help.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Generation&lt;/strong&gt; is sequential at the token level and bound by memory bandwidth. Beyond a certain point (often around 8-12 threads), adding threads does not help and sometimes slows generation due to NUMA or cache contention.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Benchmark on your specific hardware for the best numbers. Start with &lt;code&gt;NThreads = ProcessorCount / 2&lt;/code&gt; and &lt;code&gt;NThreadsBatch = ProcessorCount&lt;/code&gt;, then tune.&lt;/p&gt;
&lt;h2 id=&#34;performance-expectations&#34;&gt;Performance expectations&lt;/h2&gt;
&lt;p&gt;Rough tokens-per-second for a 7B Q4_K_M model on CPU only:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:right&#34;&gt;CPU class&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;Approximate t/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right&#34;&gt;Old x64 (2015)&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;1-3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right&#34;&gt;Modern Core i5 / Ryzen 5&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;5-10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right&#34;&gt;Modern Core i7 / Ryzen 7 / Xeon&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;8-15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right&#34;&gt;High-end Core i9 / Threadripper / Xeon Platinum&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;12-25&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;These are ballpark; real numbers depend on clock speed, memory bandwidth, AVX level, and how busy the CPU is with other work.&lt;/p&gt;
&lt;p&gt;For sustained CPU inference, expect 5-15 t/s on mainstream desktop CPUs. That is acceptable for chat UIs and batch jobs, but slow for real-time applications.&lt;/p&gt;
&lt;h2 id=&#34;memory&#34;&gt;Memory&lt;/h2&gt;
&lt;p&gt;Model weights plus KV cache live in system RAM. Typical memory for a 7B Q4_K_M model with 32K context: 8-12 GB. See &lt;a href=&#34;https://docs.aspose.com/llm/net/system-requirements/#memory&#34;&gt;System requirements&lt;/a&gt; for per-preset estimates.&lt;/p&gt;
&lt;h2 id=&#34;performance-tips&#34;&gt;Performance tips&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Use AVX512 when available&lt;/strong&gt; — on Zen 4 or recent Xeon, AVX512 is 20-40 % faster than AVX2 for the same model.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flash Attention&lt;/strong&gt; — &lt;code&gt;ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled&lt;/code&gt; helps on long contexts even on CPU.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quantize the KV cache&lt;/strong&gt; — &lt;code&gt;TypeV = GgmlType.Q8_0&lt;/code&gt; halves V-cache memory with minimal quality impact.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Smaller models first&lt;/strong&gt; — 3B models (Llama 3.2, Phi 4 mini) run 2-3× faster than 7B models on the same CPU.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Disable other CPU-heavy work&lt;/strong&gt; during inference. CPU throughput is sensitive to contention.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;common-issues&#34;&gt;Common issues&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Very slow inference&lt;/td&gt;
&lt;td&gt;Using &lt;code&gt;NoAVX&lt;/code&gt; or &lt;code&gt;AVX&lt;/code&gt; variant on a modern CPU.&lt;/td&gt;
&lt;td&gt;Verify &lt;code&gt;PreferredAcceleration&lt;/code&gt; auto-detection; force &lt;code&gt;AVX2&lt;/code&gt; or &lt;code&gt;AVX512&lt;/code&gt; explicitly.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High CPU usage, low throughput&lt;/td&gt;
&lt;td&gt;Too many threads — contention.&lt;/td&gt;
&lt;td&gt;Reduce &lt;code&gt;NThreads&lt;/code&gt;; try &lt;code&gt;NThreads = ProcessorCount / 2&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stalls between tokens&lt;/td&gt;
&lt;td&gt;Memory pressure or page thrashing.&lt;/td&gt;
&lt;td&gt;Check memory; enable &lt;code&gt;UseMemoryMapping = true&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Crashes on &lt;code&gt;AVX512&lt;/code&gt; variant&lt;/td&gt;
&lt;td&gt;CPU reports AVX-512 but has a bug in legacy modes.&lt;/td&gt;
&lt;td&gt;Drop to &lt;code&gt;AVX2&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/model-inference/&#34;&gt;Model inference parameters&lt;/a&gt; — &lt;code&gt;GpuLayers = 0&lt;/code&gt; pattern for CPU-only.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/context/&#34;&gt;Context parameters&lt;/a&gt; — &lt;code&gt;NThreads&lt;/code&gt; and &lt;code&gt;NThreadsBatch&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/acceleration/metal/&#34;&gt;Metal&lt;/a&gt; — macOS-preferred alternative on Apple Silicon.&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
  </channel>
</rss>
