<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Documentation – How-to recipes</title>
    <link>https://docs.aspose.com/llm/net/how-to/</link>
    <description>Recent content in How-to recipes on Documentation</description>
    <generator>Hugo -- gohugo.io</generator>
    <lastBuildDate>Thu, 23 Apr 2026 00:00:00 +0000</lastBuildDate>
    
	  <atom:link href="https://docs.aspose.com/llm/net/how-to/index.xml" rel="self" type="application/rss+xml" />
    
    
      
        
      
    
    
    <item>
      <title>Net: Select a model by task</title>
      <link>https://docs.aspose.com/llm/net/how-to/select-model-by-task/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://docs.aspose.com/llm/net/how-to/select-model-by-task/</guid>
      <description>
        
        
        &lt;p&gt;Match a built-in preset to the task. Start small; move up only if output quality does not meet your bar.&lt;/p&gt;
&lt;h2 id=&#34;quick-picker&#34;&gt;Quick picker&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Your task&lt;/th&gt;
&lt;th&gt;Preset&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;General chat, mid-complexity tasks&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Qwen25Preset&lt;/code&gt; (7B)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latest general-purpose model&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Qwen3Preset&lt;/code&gt; (8B)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Small footprint, fast, long context&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Llama32Preset&lt;/code&gt; (3B, 131K)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Smallest possible model&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Phi4Preset&lt;/code&gt; (mini)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coding tasks&lt;/td&gt;
&lt;td&gt;&lt;code&gt;DeepSeekCoder2Preset&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Step-by-step reasoning&lt;/td&gt;
&lt;td&gt;&lt;code&gt;DeepseekR1Qwen3Preset&lt;/code&gt; or &lt;code&gt;Oss20Preset&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Largest model, strongest reasoning&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Oss20Preset&lt;/code&gt; (20B)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image understanding, small&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Qwen3VL2BPreset&lt;/code&gt; (2B)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image understanding, mid&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Qwen25VL3BPreset&lt;/code&gt; (3B)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Text-heavy images (OCR-style)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Gemma3VisionPreset&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Strongest vision reasoning&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Ministral3VisionPreset&lt;/code&gt; (8B)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;decision-tree&#34;&gt;Decision tree&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Do you need vision (image input)?&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Yes → pick a vision preset based on size and image type.&lt;/li&gt;
&lt;li&gt;No → continue.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Is the task coding?&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Yes → &lt;code&gt;DeepSeekCoder2Preset&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;No → continue.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Does the task require explicit step-by-step reasoning?&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Yes → &lt;code&gt;DeepseekR1Qwen3Preset&lt;/code&gt; or &lt;code&gt;Oss20Preset&lt;/code&gt; (budget 1024-2048 &lt;code&gt;MaxTokens&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;No → continue.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How much memory do you have?&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;4-8 GB → &lt;code&gt;Llama32Preset&lt;/code&gt; or &lt;code&gt;Phi4Preset&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;12-16 GB → &lt;code&gt;Qwen25Preset&lt;/code&gt; or &lt;code&gt;Qwen3Preset&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;24+ GB → any preset; &lt;code&gt;Oss20Preset&lt;/code&gt; for best quality.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;after-you-pick&#34;&gt;After you pick&lt;/h2&gt;
&lt;p&gt;Override the default values where they do not fit your scenario. See &lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/presets/customizing/&#34;&gt;Customizing presets&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If none of the built-ins fit, &lt;a href=&#34;https://docs.aspose.com/llm/net/use-cases/bring-your-own-gguf/&#34;&gt;bring your own GGUF&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/product-overview/supported-presets/&#34;&gt;Supported presets&lt;/a&gt; — catalog with Hugging Face sources.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/presets/using-built-in/&#34;&gt;Using built-in presets&lt;/a&gt; — full picker guidance.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/use-cases/custom-preset/&#34;&gt;Custom preset&lt;/a&gt; — patterns for tuning.&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Net: Understand quantization</title>
      <link>https://docs.aspose.com/llm/net/how-to/understand-quantization/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://docs.aspose.com/llm/net/how-to/understand-quantization/</guid>
      <description>
        
        
        &lt;p&gt;Quantization reduces the precision of model weights from the full-precision training format (usually F16 or BF16) to fewer bits per value. Smaller weights mean smaller files, less memory, and faster inference — at some cost in output quality.&lt;/p&gt;
&lt;h2 id=&#34;the-basic-trade-off&#34;&gt;The basic trade-off&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Precision&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;File size (relative)&lt;/th&gt;
&lt;th&gt;Quality loss&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;F32 (32-bit float)&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;2.0×&lt;/td&gt;
&lt;td&gt;None (reference)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;F16 (16-bit float)&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;1.0×&lt;/td&gt;
&lt;td&gt;Essentially none&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BF16 (brain float 16)&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;1.0×&lt;/td&gt;
&lt;td&gt;Essentially none&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q8_0 (8-bit)&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~0.5×&lt;/td&gt;
&lt;td&gt;Very small&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q6_K (6-bit)&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~0.38×&lt;/td&gt;
&lt;td&gt;Small&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q5_K_M (5-bit medium)&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~0.33×&lt;/td&gt;
&lt;td&gt;Small-to-moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q4_K_M (4-bit medium)&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~0.27×&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q4_0 (4-bit classic)&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~0.25×&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q3_K (3-bit)&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~0.22×&lt;/td&gt;
&lt;td&gt;Noticeable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q2_K (2-bit)&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~0.18×&lt;/td&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IQ4_XS / IQ3_S (importance quant.)&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~0.23-0.30×&lt;/td&gt;
&lt;td&gt;Smaller than Q equivalents at similar size&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IQ2_XXS (very aggressive)&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~0.15×&lt;/td&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Values are approximate; actual size depends on model architecture and specific quantizer.&lt;/p&gt;
&lt;h2 id=&#34;popular-picks&#34;&gt;Popular picks&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Q4_K_M&lt;/strong&gt; — the default for most community-uploaded GGUFs. Good balance for 7B+ models.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Q5_K_M&lt;/strong&gt; — slightly bigger, slightly better quality. Worth it when you have memory headroom.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Q8_0&lt;/strong&gt; — near-lossless. Use when you want the best quality that is not F16 and have 2× the memory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;IQ4_XS&lt;/strong&gt; — aggressive importance quantization; often better quality-per-byte than Q4_0 for the same size.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;F16&lt;/strong&gt; — full precision. Useful for reproducibility or benchmarks; rarely worth the memory otherwise.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;how-to-pick-for-your-preset&#34;&gt;How to pick for your preset&lt;/h2&gt;
&lt;p&gt;Built-in presets already choose a quantization. To change, override &lt;code&gt;BaseModelSourceParameters.HuggingFaceFileName&lt;/code&gt; to a different file from the same repo:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;kt&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;new&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Qwen25Preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;// Default is Qwen2.5-7B-Instruct-Q4_K_M.gguf; switch to Q8_0:
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelSourceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;HuggingFaceFileName&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;Qwen2.5-7B-Instruct-Q8_0.gguf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Confirm the file exists in the repository (check the Hugging Face page).&lt;/p&gt;
&lt;h2 id=&#34;memory-rough-estimate&#34;&gt;Memory rough estimate&lt;/h2&gt;
&lt;p&gt;For a model with N parameters:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Quantization&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;Bytes per parameter&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;7B model&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;70B model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;F16&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;2.0&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~14 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~140 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q8_0&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;1.0&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~7 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~70 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q5_K_M&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;0.625&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~4.4 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~44 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;0.5&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~3.5 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~35 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q4_0&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;0.5&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~3.5 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~35 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IQ4_XS&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;0.45&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~3.2 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~32 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q3_K&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;0.375&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~2.6 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~26 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Add KV cache and intermediate buffers on top — see &lt;a href=&#34;https://docs.aspose.com/llm/net/how-to/estimate-memory-requirements/&#34;&gt;Estimate memory requirements&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;when-to-pick-a-smaller-quantization&#34;&gt;When to pick a smaller quantization&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Memory is the binding constraint. The model does not fit at higher quantization, but fits at lower.&lt;/li&gt;
&lt;li&gt;You are running many models and want to pack several into one machine.&lt;/li&gt;
&lt;li&gt;You accept some quality loss for speed — smaller quantizations run slightly faster due to reduced memory bandwidth pressure.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;when-to-avoid-aggressive-quantization&#34;&gt;When to avoid aggressive quantization&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Tasks sensitive to precise output (code generation, math, legal/medical reasoning).&lt;/li&gt;
&lt;li&gt;Long reasoning chains where errors compound.&lt;/li&gt;
&lt;li&gt;When you have the memory — use Q5_K_M or Q8_0 when you can.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;kv-cache-quantization&#34;&gt;KV cache quantization&lt;/h2&gt;
&lt;p&gt;Separate from model weight quantization, you can quantize the KV cache at runtime via &lt;code&gt;ContextParameters.TypeK&lt;/code&gt; and &lt;code&gt;TypeV&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;TypeK&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;GgmlType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;F16&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;TypeV&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;GgmlType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Q8_0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Saves memory on long contexts with minor quality impact. See &lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/context/&#34;&gt;Context parameters&lt;/a&gt; for the full enum.&lt;/p&gt;
&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/model-source/&#34;&gt;Model source parameters&lt;/a&gt; — how to select a specific GGUF file.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/how-to/estimate-memory-requirements/&#34;&gt;Estimate memory requirements&lt;/a&gt; — factor quantization into your sizing.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/product-overview/supported-presets/&#34;&gt;Supported presets&lt;/a&gt; — built-in preset quantizations.&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Net: Tune for speed vs quality</title>
      <link>https://docs.aspose.com/llm/net/how-to/tune-for-speed-vs-quality/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://docs.aspose.com/llm/net/how-to/tune-for-speed-vs-quality/</guid>
      <description>
        
        
        &lt;p&gt;Several knobs move a preset along the speed-quality curve. This how-to summarizes them with concrete recommendations.&lt;/p&gt;
&lt;h2 id=&#34;speed-biased-configuration&#34;&gt;Speed-biased configuration&lt;/h2&gt;
&lt;p&gt;When throughput matters most — bulk processing, real-time chat, short answers.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;kt&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;new&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Llama32Preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;// 3B model
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextSize&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;4096&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;FlashAttentionMode&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FlashAttentionType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Enabled&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;TypeV&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;GgmlType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Q8_0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SamplerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Temperature&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;0.3f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SamplerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;TopP&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;0.9f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SamplerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;TopK&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;20&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ChatParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;MaxTokens&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;256&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GpuLayers&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;999&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BinaryManagerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;PreferredAcceleration&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;AccelerationType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;CUDA&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Expected throughput: 50-100 tokens/sec on a mid-range GPU, 15-30 on modern CPU.&lt;/p&gt;
&lt;h2 id=&#34;quality-biased-configuration&#34;&gt;Quality-biased configuration&lt;/h2&gt;
&lt;p&gt;When the best possible output matters — deep analysis, complex reasoning, long-form writing.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;kt&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;new&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Oss20Preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;// 20B model
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextSize&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;32768&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;FlashAttentionMode&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FlashAttentionType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Enabled&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;TypeK&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;GgmlType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;F16&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;TypeV&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;GgmlType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;F16&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SamplerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Temperature&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;0.7f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SamplerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;TopP&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;0.95f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SamplerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;MinP&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;0.05f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SamplerParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;RepetitionPenalty&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;1.05f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ChatParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;MaxTokens&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;2048&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;BaseModelInferenceParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GpuLayers&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;999&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Expected throughput: 10-30 tokens/sec on a high-end GPU. CPU not recommended.&lt;/p&gt;
&lt;h2 id=&#34;balanced-configuration&#34;&gt;Balanced configuration&lt;/h2&gt;
&lt;p&gt;The default for most scenarios.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;kt&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;new&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Qwen25Preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;// 7B model
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;// All other settings at preset defaults.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Expected throughput: 30-60 tokens/sec on a mid-range GPU.&lt;/p&gt;
&lt;h2 id=&#34;knobs-cheat-sheet&#34;&gt;Knobs cheat sheet&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Knob&lt;/th&gt;
&lt;th&gt;Faster&lt;/th&gt;
&lt;th&gt;Better quality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model size&lt;/td&gt;
&lt;td&gt;Smaller (&lt;code&gt;Phi4Preset&lt;/code&gt;, &lt;code&gt;Llama32Preset&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Larger (&lt;code&gt;Oss20Preset&lt;/code&gt;, &lt;code&gt;Qwen3Preset&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quantization&lt;/td&gt;
&lt;td&gt;Q4_K_M, Q4_0&lt;/td&gt;
&lt;td&gt;Q8_0, F16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ContextSize&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Shorter&lt;/td&gt;
&lt;td&gt;Longer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;FlashAttentionMode&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Enabled&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Enabled&lt;/code&gt; (both)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TypeV&lt;/code&gt; (KV cache)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;, &lt;code&gt;Q4_0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;F16&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TypeK&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;F16&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;F16&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Temperature&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.0-0.3&lt;/code&gt; (more deterministic)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.7-0.9&lt;/code&gt; (more creative, nuanced)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TopP&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.8-0.9&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.9-0.95&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TopK&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;20&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;40&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;RepetitionPenalty&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.05&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.1&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;MaxTokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Lower&lt;/td&gt;
&lt;td&gt;Higher&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GpuLayers&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;999&lt;/code&gt; (full offload)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;999&lt;/code&gt; (same)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;measure-do-not-guess&#34;&gt;Measure, do not guess&lt;/h2&gt;
&lt;p&gt;Throughput is hardware-specific. Benchmark on your actual target machine with realistic prompts before committing to a configuration.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;kt&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sw&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;System&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Diagnostics&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Stopwatch&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;StartNew&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
&lt;span class=&#34;kt&#34;&gt;string&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;reply&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;await&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;api&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SendMessageAsync&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;sw&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Stop&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;

&lt;span class=&#34;n&#34;&gt;Console&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;WriteLine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;$&amp;#34;Tokens: ~{reply.Split(&amp;#39; &amp;#39;).Length}&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;Console&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;WriteLine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;$&amp;#34;Time: {sw.Elapsed.TotalSeconds:F2}s&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;Console&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;WriteLine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;$&amp;#34;Rate: ~{reply.Split(&amp;#39; &amp;#39;).Length / sw.Elapsed.TotalSeconds:F1} tok/s&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Word count is an approximation — real tokens are usually 1.3-1.5× the word count for English.&lt;/p&gt;
&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/sampler/&#34;&gt;Sampler parameters&lt;/a&gt; — fine-grained sampler control.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/context/&#34;&gt;Context parameters&lt;/a&gt; — context, flash attention, KV cache.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/how-to/understand-quantization/&#34;&gt;Understand quantization&lt;/a&gt; — how quantization affects throughput.&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Net: Handle cancellation</title>
      <link>https://docs.aspose.com/llm/net/how-to/handle-cancellation/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://docs.aspose.com/llm/net/how-to/handle-cancellation/</guid>
      <description>
        
        
        &lt;p&gt;Both &lt;code&gt;SendMessageAsync&lt;/code&gt; and &lt;code&gt;SendMessageToSessionAsync&lt;/code&gt; accept a &lt;code&gt;CancellationToken&lt;/code&gt;. Firing it stops token generation promptly. The session state remains intact — you can continue the conversation with the next call.&lt;/p&gt;
&lt;h2 id=&#34;cancel-on-timeout&#34;&gt;Cancel on timeout&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;k&#34;&gt;using&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cts&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;new&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;CancellationTokenSource&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;TimeSpan&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;FromSeconds&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;30&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;));&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;try&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
    &lt;span class=&#34;kt&#34;&gt;string&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;reply&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;await&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;api&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SendMessageAsync&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
        &lt;span class=&#34;s&#34;&gt;&amp;#34;Write a 500-word essay about migration patterns of the Arctic tern.&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
        &lt;span class=&#34;n&#34;&gt;cancellationToken&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Token&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;Console&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;WriteLine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;reply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;catch&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;OperationCanceledException&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;Console&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;WriteLine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;Generation timed out.&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id=&#34;cancel-on-user-action-ctrlc&#34;&gt;Cancel on user action (Ctrl+C)&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;k&#34;&gt;using&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cts&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;new&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;CancellationTokenSource&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;Console&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;CancelKeyPress&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;+=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;_&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;e&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&amp;gt;&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;e&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Cancel&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;true&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;   &lt;span class=&#34;c1&#34;&gt;// prevent default process termination
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;cts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Cancel&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;};&lt;/span&gt;

&lt;span class=&#34;k&#34;&gt;try&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
    &lt;span class=&#34;kt&#34;&gt;string&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;reply&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;await&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;api&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SendMessageAsync&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cancellationToken&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Token&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;Console&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;WriteLine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;reply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;catch&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;OperationCanceledException&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;Console&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;WriteLine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;Cancelled by user.&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id=&#34;cancel-from-an-http-host&#34;&gt;Cancel from an HTTP host&lt;/h2&gt;
&lt;p&gt;ASP.NET Core passes a request cancellation token to endpoint handlers. Forward it to the SDK:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;app&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;MapPost&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;/chat&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;async&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ChatRequest&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;req&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Engine&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;engine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;CancellationToken&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&amp;gt;&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
    &lt;span class=&#34;kt&#34;&gt;string&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sessionId&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;req&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SessionId&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;??&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;await&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;engine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;InitiateNewSession&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;try&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
        &lt;span class=&#34;kt&#34;&gt;string&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;reply&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;await&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;engine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GetChatSessionResponse&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sessionId&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;req&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Message&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;null&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Results&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Ok&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;new&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sessionId&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;reply&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;});&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;catch&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;OperationCanceledException&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
    &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
        &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Results&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;StatusCode&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;499&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;// client closed request
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;    &lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;When the client disconnects, &lt;code&gt;ct&lt;/code&gt; fires; the SDK stops generating.&lt;/p&gt;
&lt;h2 id=&#34;session-state-after-cancellation&#34;&gt;Session state after cancellation&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;The partial output is &lt;strong&gt;discarded&lt;/strong&gt; — the user&amp;rsquo;s message goes into the history, but no assistant message is recorded.&lt;/li&gt;
&lt;li&gt;The session remains alive and can accept a new message immediately.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;k&#34;&gt;try&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
    &lt;span class=&#34;k&#34;&gt;await&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;api&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SendMessageToSessionAsync&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sessionId&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;Long prompt...&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cancellationToken&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ct&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;catch&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;OperationCanceledException&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;// Session is still usable.
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;// Continue without issue:
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;&lt;span class=&#34;kt&#34;&gt;string&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;reply&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;await&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;api&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SendMessageToSessionAsync&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sessionId&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;Short follow-up.&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id=&#34;combine-timeout-with-linked-tokens&#34;&gt;Combine timeout with linked tokens&lt;/h2&gt;
&lt;p&gt;For a handler that has both a user-cancellation token and a timeout:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;k&#34;&gt;using&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;timeoutCts&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;new&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;CancellationTokenSource&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;TimeSpan&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;FromSeconds&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;60&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;));&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;using&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;linkedCts&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;CancellationTokenSource&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;CreateLinkedTokenSource&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;externalCancellationToken&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;timeoutCts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Token&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;

&lt;span class=&#34;kt&#34;&gt;string&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;reply&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;await&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;api&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SendMessageAsync&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;prompt&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cancellationToken&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;linkedCts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Token&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Either source cancels the operation. Inspect which one fired if your UX needs to tell them apart:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;k&#34;&gt;catch&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;OperationCanceledException&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;when&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;timeoutCts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;IsCancellationRequested&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;// Timed out.
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;catch&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;OperationCanceledException&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
    &lt;span class=&#34;c1&#34;&gt;// User or external cancellation.
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id=&#34;what-cancellation-does-not-cover&#34;&gt;What cancellation does not cover&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Model load&lt;/strong&gt; — the synchronous model-load inside &lt;code&gt;AsposeLLMApi.Create&lt;/code&gt; is not interruptible via &lt;code&gt;CancellationToken&lt;/code&gt;. Budget for the cold start; do not try to cancel it.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Binary download&lt;/strong&gt; — same. The first-run binary deployment runs during &lt;code&gt;Create&lt;/code&gt; and is synchronous.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For application-level time limits on startup, wrap &lt;code&gt;Create&lt;/code&gt; in a &lt;code&gt;Task.Run&lt;/code&gt; with an external watchdog — but be aware that even if you stop waiting on the task, the background work continues until it completes or the process terminates.&lt;/p&gt;
&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/asposellmapi/&#34;&gt;AsposeLLMApi facade&lt;/a&gt; — method signatures including &lt;code&gt;CancellationToken&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/use-cases/integration-with-aspnet-core/&#34;&gt;Integration with ASP.NET Core&lt;/a&gt; — cancellation in HTTP hosts.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/chat-sessions/&#34;&gt;Chat sessions&lt;/a&gt; — session lifecycle around cancellation.&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Net: Reduce first-token latency</title>
      <link>https://docs.aspose.com/llm/net/how-to/reduce-first-token-latency/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://docs.aspose.com/llm/net/how-to/reduce-first-token-latency/</guid>
      <description>
        
        
        &lt;p&gt;&amp;ldquo;First-token latency&amp;rdquo; is the time between sending a message and starting to see output. It has two components:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Cold-start&lt;/strong&gt; — binary download, model load, session creation. Happens once per &lt;code&gt;AsposeLLMApi&lt;/code&gt; instance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Per-message&lt;/strong&gt; — prompt tokenization, KV cache prefill, first-token generation.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Both can be reduced.&lt;/p&gt;
&lt;h2 id=&#34;warm-up-at-startup&#34;&gt;Warm up at startup&lt;/h2&gt;
&lt;p&gt;The first &lt;code&gt;AsposeLLMApi.Create&lt;/code&gt; call is slow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;First ever: downloads binaries + model (100-500 MB + 2-15 GB). Several minutes.&lt;/li&gt;
&lt;li&gt;Cached: model load only. 5-30 seconds.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do it during application startup, not on the first user request.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;c1&#34;&gt;// At application startup:
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;&lt;span class=&#34;kt&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;license&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;new&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Aspose&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;LLM&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;License&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;license&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SetLicense&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;Aspose.LLM.lic&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;

&lt;span class=&#34;kt&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;new&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Qwen25Preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;EngineParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;EnableDebugLogging&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;false&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;

&lt;span class=&#34;kt&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;api&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;AsposeLLMApi&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Create&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;// slow the first time
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;In ASP.NET Core, warm up from &lt;code&gt;ApplicationStarted&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;app&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Lifetime&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ApplicationStarted&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Register&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(()&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&amp;gt;&lt;/span&gt;
&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
    &lt;span class=&#34;n&#34;&gt;_&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;app&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Services&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;GetRequiredService&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Engine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;&amp;gt;();&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;// triggers model load
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;In a Worker Service, do it inside &lt;code&gt;ExecuteAsync&lt;/code&gt; before entering the main loop.&lt;/p&gt;
&lt;h2 id=&#34;pre-create-a-session&#34;&gt;Pre-create a session&lt;/h2&gt;
&lt;p&gt;Starting a session takes tens to hundreds of milliseconds. For a chat server that processes user requests, create a session before the first request arrives:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;kt&#34;&gt;string&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;warmupSessionId&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;await&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;api&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;StartNewChatAsync&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;await&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;api&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SendMessageToSessionAsync&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;warmupSessionId&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;ping&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;// Keep this session; the engine is now fully warm.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id=&#34;shorten-system-prompts&#34;&gt;Shorten system prompts&lt;/h2&gt;
&lt;p&gt;Every new session tokenizes and evaluates the system prompt before the first user turn. A 500-token system prompt costs hundreds of milliseconds on CPU, tens on GPU. Keep system prompts short.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;c1&#34;&gt;// 50 tokens — fast first-turn.
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ChatParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SystemPrompt&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;You are a concise assistant. Answer briefly.&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;// 500 tokens of preamble — slow.
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;// preset.ChatParameters.SystemPrompt = &amp;#34;&amp;lt;long preamble with many instructions and examples&amp;gt;&amp;#34;;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;If you need extensive priming, use &lt;code&gt;ChatParameters.History&lt;/code&gt; with a few-shot example set — the examples are tokenized once per session creation and cached across turns.&lt;/p&gt;
&lt;h2 id=&#34;size-nbatch-correctly&#34;&gt;Size &lt;code&gt;NBatch&lt;/code&gt; correctly&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;ContextParameters.NBatch&lt;/code&gt; controls how many tokens the engine processes per &lt;code&gt;llama_decode&lt;/code&gt; call during prompt ingestion. Larger &lt;code&gt;NBatch&lt;/code&gt; is faster &lt;strong&gt;as long as it fits in memory&lt;/strong&gt;.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;c1&#34;&gt;// Built-in presets typically set NBatch = 2048. For prompt-heavy scenarios:
&lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;NBatch&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;4096&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;NUbatch&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;4096&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Too large, and you blow VRAM budgets; too small, and prompt prefill is slow. Tune on your hardware.&lt;/p&gt;
&lt;h2 id=&#34;enable-flash-attention&#34;&gt;Enable flash attention&lt;/h2&gt;
&lt;p&gt;Flash attention improves prefill time dramatically on long prompts:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ContextParameters&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;FlashAttentionMode&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;FlashAttentionType&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Enabled&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Always enable when supported.&lt;/p&gt;
&lt;h2 id=&#34;keep-sessions-alive&#34;&gt;Keep sessions alive&lt;/h2&gt;
&lt;p&gt;Reuse sessions across requests instead of creating a fresh one each time. Session creation costs prefill time; reusing amortizes it across turns.&lt;/p&gt;
&lt;p&gt;In HTTP hosts, map user IDs to session IDs — see &lt;a href=&#34;https://docs.aspose.com/llm/net/use-cases/multiple-concurrent-sessions/&#34;&gt;Multiple concurrent sessions&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;pre-populate-binary-and-model-caches&#34;&gt;Pre-populate binary and model caches&lt;/h2&gt;
&lt;p&gt;In offline or container deployments, download binaries and models on the build machine; ship them with your image. The runtime host skips the multi-minute initial download.&lt;/p&gt;
&lt;p&gt;See &lt;a href=&#34;https://docs.aspose.com/llm/net/use-cases/offline-deployment/&#34;&gt;Offline deployment&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;measure&#34;&gt;Measure&lt;/h2&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csharp&#34; data-lang=&#34;csharp&#34;&gt;&lt;span class=&#34;kt&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;totalSw&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;System&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Diagnostics&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Stopwatch&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;StartNew&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
&lt;span class=&#34;k&#34;&gt;using&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;api&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;AsposeLLMApi&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Create&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;preset&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;Console&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;WriteLine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;$&amp;#34;Create: {totalSw.Elapsed}&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;

&lt;span class=&#34;kt&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;firstSw&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;System&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Diagnostics&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Stopwatch&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;StartNew&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
&lt;span class=&#34;kt&#34;&gt;string&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;reply&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;await&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;api&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SendMessageAsync&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;Say hello.&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;Console&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;WriteLine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;$&amp;#34;First message: {firstSw.Elapsed}&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;

&lt;span class=&#34;kt&#34;&gt;var&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;secondSw&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;System&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Diagnostics&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Stopwatch&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;StartNew&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt;
&lt;span class=&#34;kt&#34;&gt;string&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;reply2&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;await&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;api&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SendMessageAsync&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;Say hello again.&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;Console&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;WriteLine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;$&amp;#34;Second message: {secondSw.Elapsed}&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The second message is noticeably faster than the first — session is already warm.&lt;/p&gt;
&lt;h2 id=&#34;typical-numbers-modern-gpu&#34;&gt;Typical numbers (modern GPU)&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Create&lt;/code&gt; (cold, first ever)&lt;/td&gt;
&lt;td&gt;2-10 min (download + load)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Create&lt;/code&gt; (cached)&lt;/td&gt;
&lt;td&gt;5-30 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;StartNewChatAsync&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;50-200 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;First token after a 100-token prompt&lt;/td&gt;
&lt;td&gt;200-500 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Subsequent tokens&lt;/td&gt;
&lt;td&gt;10-20 ms each&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;CPU numbers are roughly 5-10× higher for each stage.&lt;/p&gt;
&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/product-overview/architecture/&#34;&gt;Architecture&lt;/a&gt; — what happens during &lt;code&gt;Create&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/developer-reference/parameters/context/&#34;&gt;Context parameters&lt;/a&gt; — &lt;code&gt;NBatch&lt;/code&gt;, flash attention.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/use-cases/offline-deployment/&#34;&gt;Offline deployment&lt;/a&gt; — skip the initial download at runtime.&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Net: Estimate memory requirements</title>
      <link>https://docs.aspose.com/llm/net/how-to/estimate-memory-requirements/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://docs.aspose.com/llm/net/how-to/estimate-memory-requirements/</guid>
      <description>
        
        
        &lt;p&gt;Four things claim memory when the SDK runs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Model weights.&lt;/li&gt;
&lt;li&gt;KV cache.&lt;/li&gt;
&lt;li&gt;Vision projector (vision presets only).&lt;/li&gt;
&lt;li&gt;Intermediate buffers and sampler state.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This how-to helps you predict the total before deployment.&lt;/p&gt;
&lt;h2 id=&#34;rule-of-thumb-sizes&#34;&gt;Rule-of-thumb sizes&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Approximate size&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model weights&lt;/td&gt;
&lt;td&gt;&lt;code&gt;parameters × bytes_per_parameter&lt;/code&gt;. For a 7B Q4_K_M model, ~3.5 GB.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;KV cache&lt;/td&gt;
&lt;td&gt;&lt;code&gt;layers × heads × head_dim × context × 2 × bytes_per_kv&lt;/code&gt;. Actual numbers below.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vision projector&lt;/td&gt;
&lt;td&gt;200 MB – 2 GB.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intermediate buffers&lt;/td&gt;
&lt;td&gt;50 MB – 500 MB.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;step-1-weights-from-quantization&#34;&gt;Step 1. Weights from quantization&lt;/h2&gt;
&lt;p&gt;See &lt;a href=&#34;https://docs.aspose.com/llm/net/how-to/understand-quantization/&#34;&gt;Understand quantization&lt;/a&gt; for the per-parameter bytes table.&lt;/p&gt;
&lt;p&gt;Rough: &lt;code&gt;weights_bytes ≈ parameters × bytes_per_param&lt;/code&gt;.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:right&#34;&gt;Parameters&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;Q4_K_M&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;Q8_0&lt;/th&gt;
&lt;th style=&#34;text-align:right&#34;&gt;F16&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right&#34;&gt;3B&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~1.8 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~3.2 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~6 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right&#34;&gt;7B&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~3.5 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~7 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~14 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right&#34;&gt;8B&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~4 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~8 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~16 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right&#34;&gt;20B&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~11 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~21 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~40 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right&#34;&gt;70B&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~35 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~70 GB&lt;/td&gt;
&lt;td style=&#34;text-align:right&#34;&gt;~140 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;step-2-kv-cache&#34;&gt;Step 2. KV cache&lt;/h2&gt;
&lt;p&gt;Depends on model architecture (number of layers, heads, head dimension), context size, and KV dtype. The underlying formula is complex; below are empirical numbers for common presets at their default contexts:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Preset&lt;/th&gt;
&lt;th&gt;KV at default context (F16)&lt;/th&gt;
&lt;th&gt;KV at default context (Q8_0 V)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Llama32Preset&lt;/code&gt; (131K)&lt;/td&gt;
&lt;td&gt;~8 GB&lt;/td&gt;
&lt;td&gt;~5 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Qwen25Preset&lt;/code&gt; (32K)&lt;/td&gt;
&lt;td&gt;~2 GB&lt;/td&gt;
&lt;td&gt;~1.3 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Qwen3Preset&lt;/code&gt; (32K)&lt;/td&gt;
&lt;td&gt;~2 GB&lt;/td&gt;
&lt;td&gt;~1.3 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DeepseekR1Qwen3Preset&lt;/code&gt; (131K)&lt;/td&gt;
&lt;td&gt;~6 GB&lt;/td&gt;
&lt;td&gt;~4 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Oss20Preset&lt;/code&gt; (131K)&lt;/td&gt;
&lt;td&gt;~10 GB&lt;/td&gt;
&lt;td&gt;~6 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Phi4Preset&lt;/code&gt; (16K)&lt;/td&gt;
&lt;td&gt;~0.8 GB&lt;/td&gt;
&lt;td&gt;~0.5 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Qwen3VL2BPreset&lt;/code&gt; (262K)&lt;/td&gt;
&lt;td&gt;~12 GB&lt;/td&gt;
&lt;td&gt;~7 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Scales roughly linearly with actual session length. A 32K-capable preset at only 4K of actual context uses ~1/8th of the listed KV.&lt;/p&gt;
&lt;h2 id=&#34;step-3-vision-projector-if-applicable&#34;&gt;Step 3. Vision projector (if applicable)&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Projector quantization&lt;/th&gt;
&lt;th&gt;Typical size&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;F16&lt;/td&gt;
&lt;td&gt;800 MB – 2 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q8_0&lt;/td&gt;
&lt;td&gt;500 MB – 1 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;250 MB – 500 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Each vision preset declares its &lt;code&gt;mmproj&lt;/code&gt; file in &lt;code&gt;MmprojSourceParameters&lt;/code&gt; — see &lt;a href=&#34;https://docs.aspose.com/llm/net/product-overview/supported-presets/#vision-presets&#34;&gt;Supported presets&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;step-4-add-overhead&#34;&gt;Step 4. Add overhead&lt;/h2&gt;
&lt;p&gt;Sampler state, tokenizer, scratch buffers: 50 MB – 500 MB. Depends on batch size and context length.&lt;/p&gt;
&lt;p&gt;For a conservative budget, add &lt;strong&gt;500 MB&lt;/strong&gt; on top of weights + KV + projector.&lt;/p&gt;
&lt;h2 id=&#34;worked-examples&#34;&gt;Worked examples&lt;/h2&gt;
&lt;h3 id=&#34;qwen25preset-on-a-12-gb-gpu&#34;&gt;&lt;code&gt;Qwen25Preset&lt;/code&gt; on a 12 GB GPU&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Weights (7B Q4_K_M): 3.5 GB&lt;/li&gt;
&lt;li&gt;KV at 32K F16: 2 GB&lt;/li&gt;
&lt;li&gt;Overhead: 0.5 GB&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: ~6 GB&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Comfortable fit; headroom for longer sessions or higher KV dtype.&lt;/p&gt;
&lt;h3 id=&#34;qwen3preset-at-full-32k-on-a-16-gb-gpu&#34;&gt;&lt;code&gt;Qwen3Preset&lt;/code&gt; at full 32K on a 16 GB GPU&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Weights (8B Q4_K_M): 4 GB&lt;/li&gt;
&lt;li&gt;KV at 32K F16: 2 GB&lt;/li&gt;
&lt;li&gt;Overhead: 0.5 GB&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: ~6.5 GB&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Fits with room for growth.&lt;/p&gt;
&lt;h3 id=&#34;oss20preset-at-full-131k-on-a-24-gb-gpu&#34;&gt;&lt;code&gt;Oss20Preset&lt;/code&gt; at full 131K on a 24 GB GPU&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Weights (20B Q4_K_M): 11 GB&lt;/li&gt;
&lt;li&gt;KV at 131K F16: 10 GB&lt;/li&gt;
&lt;li&gt;Overhead: 0.5 GB&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: ~21.5 GB&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Fits a 24 GB card tightly. To leave more headroom:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Quantize V cache: KV drops to ~6 GB. Total ~17.5 GB. Comfortable.&lt;/li&gt;
&lt;li&gt;Shorten context to 32K: KV drops to ~2.5 GB. Total ~14 GB.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;ministral3visionpreset-on-a-16-gb-gpu&#34;&gt;&lt;code&gt;Ministral3VisionPreset&lt;/code&gt; on a 16 GB GPU&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Base weights (8B Q4_K_M): 4 GB&lt;/li&gt;
&lt;li&gt;Projector (BF16): ~2 GB&lt;/li&gt;
&lt;li&gt;KV at 32K F16 (shortened from 262K default): ~2 GB&lt;/li&gt;
&lt;li&gt;Overhead: 0.5 GB&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: ~8.5 GB&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Shortening context is the easy win here.&lt;/p&gt;
&lt;h2 id=&#34;shrinking-memory&#34;&gt;Shrinking memory&lt;/h2&gt;
&lt;p&gt;In order of quality impact (least to most):&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Shorten &lt;code&gt;ContextSize&lt;/code&gt;&lt;/strong&gt; to what you actually use.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quantize V cache&lt;/strong&gt; (&lt;code&gt;TypeV = Q8_0&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enable flash attention&lt;/strong&gt; — reduces KV memory at long contexts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quantize K cache&lt;/strong&gt; (&lt;code&gt;TypeK = Q8_0&lt;/code&gt;) — larger quality impact than V.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use a smaller preset&lt;/strong&gt; — last resort when the model itself is too large.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;See &lt;a href=&#34;https://docs.aspose.com/llm/net/use-cases/low-memory-tuning/&#34;&gt;Low-memory tuning&lt;/a&gt; for the full recipe.&lt;/p&gt;
&lt;h2 id=&#34;measuring-actual-usage&#34;&gt;Measuring actual usage&lt;/h2&gt;
&lt;p&gt;After &lt;code&gt;Create&lt;/code&gt; and a few messages, check real memory:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Linux&lt;/span&gt;
nvidia-smi      &lt;span class=&#34;c1&#34;&gt;# VRAM per GPU&lt;/span&gt;
top / htop      &lt;span class=&#34;c1&#34;&gt;# system RAM&lt;/span&gt;

&lt;span class=&#34;c1&#34;&gt;# Windows&lt;/span&gt;
&lt;span class=&#34;c1&#34;&gt;# Task Manager → Performance → GPU / Memory&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The number you read includes OS page cache of memory-mapped files — some of it is reclaimable under pressure. Still, treat the reading as a ceiling estimate.&lt;/p&gt;
&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/how-to/understand-quantization/&#34;&gt;Understand quantization&lt;/a&gt; — precision impact on weights.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/system-requirements/&#34;&gt;System requirements&lt;/a&gt; — per-preset memory ranges.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.aspose.com/llm/net/use-cases/low-memory-tuning/&#34;&gt;Low-memory tuning&lt;/a&gt; — when the numbers do not fit your budget.&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
  </channel>
</rss>
