Documentation – How-to recipes

Net: Select a model by task

Thu, 23 Apr 2026 00:00:00 +0000

Match a built-in preset to the task. Start small; move up only if output quality does not meet your bar.

Quick picker

Your task	Preset
General chat, mid-complexity tasks	`Qwen25Preset` (7B)
Latest general-purpose model	`Qwen3Preset` (8B)
Small footprint, fast, long context	`Llama32Preset` (3B, 131K)
Smallest possible model	`Phi4Preset` (mini)
Coding tasks	`DeepSeekCoder2Preset`
Step-by-step reasoning	`DeepseekR1Qwen3Preset` or `Oss20Preset`
Largest model, strongest reasoning	`Oss20Preset` (20B)
Image understanding, small	`Qwen3VL2BPreset` (2B)
Image understanding, mid	`Qwen25VL3BPreset` (3B)
Text-heavy images (OCR-style)	`Gemma3VisionPreset`
Strongest vision reasoning	`Ministral3VisionPreset` (8B)

Decision tree

Do you need vision (image input)?
- Yes → pick a vision preset based on size and image type.
- No → continue.
Is the task coding?
- Yes → DeepSeekCoder2Preset.
- No → continue.
Does the task require explicit step-by-step reasoning?
- Yes → DeepseekR1Qwen3Preset or Oss20Preset (budget 1024-2048 MaxTokens).
- No → continue.
How much memory do you have?
- 4-8 GB → Llama32Preset or Phi4Preset.
- 12-16 GB → Qwen25Preset or Qwen3Preset.
- 24+ GB → any preset; Oss20Preset for best quality.

After you pick

Override the default values where they do not fit your scenario. See Customizing presets.

If none of the built-ins fit, bring your own GGUF.

What’s next

Supported presets — catalog with Hugging Face sources.
Using built-in presets — full picker guidance.
Custom preset — patterns for tuning.

Net: Understand quantization

Thu, 23 Apr 2026 00:00:00 +0000

Quantization reduces the precision of model weights from the full-precision training format (usually F16 or BF16) to fewer bits per value. Smaller weights mean smaller files, less memory, and faster inference — at some cost in output quality.

The basic trade-off

Precision	File size (relative)	Quality loss
F32 (32-bit float)	2.0×	None (reference)
F16 (16-bit float)	1.0×	Essentially none
BF16 (brain float 16)	1.0×	Essentially none
Q8_0 (8-bit)	~0.5×	Very small
Q6_K (6-bit)	~0.38×	Small
Q5_K_M (5-bit medium)	~0.33×	Small-to-moderate
Q4_K_M (4-bit medium)	~0.27×	Moderate
Q4_0 (4-bit classic)	~0.25×	Moderate
Q3_K (3-bit)	~0.22×	Noticeable
Q2_K (2-bit)	~0.18×	Large
IQ4_XS / IQ3_S (importance quant.)	~0.23-0.30×	Smaller than Q equivalents at similar size
IQ2_XXS (very aggressive)	~0.15×	Large

Values are approximate; actual size depends on model architecture and specific quantizer.

Popular picks

Q4_K_M — the default for most community-uploaded GGUFs. Good balance for 7B+ models.
Q5_K_M — slightly bigger, slightly better quality. Worth it when you have memory headroom.
Q8_0 — near-lossless. Use when you want the best quality that is not F16 and have 2× the memory.
IQ4_XS — aggressive importance quantization; often better quality-per-byte than Q4_0 for the same size.
F16 — full precision. Useful for reproducibility or benchmarks; rarely worth the memory otherwise.

How to pick for your preset

Built-in presets already choose a quantization. To change, override BaseModelSourceParameters.HuggingFaceFileName to a different file from the same repo:

var preset = new Qwen25Preset();
// Default is Qwen2.5-7B-Instruct-Q4_K_M.gguf; switch to Q8_0:
preset.BaseModelSourceParameters.HuggingFaceFileName = "Qwen2.5-7B-Instruct-Q8_0.gguf";

Confirm the file exists in the repository (check the Hugging Face page).

Memory rough estimate

For a model with N parameters:

Quantization	Bytes per parameter	7B model	70B model
F16	2.0	~14 GB	~140 GB
Q8_0	1.0	~7 GB	~70 GB
Q5_K_M	0.625	~4.4 GB	~44 GB
Q4_K_M	0.5	~3.5 GB	~35 GB
Q4_0	0.5	~3.5 GB	~35 GB
IQ4_XS	0.45	~3.2 GB	~32 GB
Q3_K	0.375	~2.6 GB	~26 GB

Add KV cache and intermediate buffers on top — see Estimate memory requirements.

When to pick a smaller quantization

Memory is the binding constraint. The model does not fit at higher quantization, but fits at lower.
You are running many models and want to pack several into one machine.
You accept some quality loss for speed — smaller quantizations run slightly faster due to reduced memory bandwidth pressure.

When to avoid aggressive quantization

Tasks sensitive to precise output (code generation, math, legal/medical reasoning).
Long reasoning chains where errors compound.
When you have the memory — use Q5_K_M or Q8_0 when you can.

KV cache quantization

Separate from model weight quantization, you can quantize the KV cache at runtime via ContextParameters.TypeK and TypeV:

preset.ContextParameters.TypeK = GgmlType.F16;
preset.ContextParameters.TypeV = GgmlType.Q8_0;

Saves memory on long contexts with minor quality impact. See Context parameters for the full enum.

What’s next

Model source parameters — how to select a specific GGUF file.
Estimate memory requirements — factor quantization into your sizing.
Supported presets — built-in preset quantizations.

Net: Tune for speed vs quality

Thu, 23 Apr 2026 00:00:00 +0000

Several knobs move a preset along the speed-quality curve. This how-to summarizes them with concrete recommendations.

Speed-biased configuration

When throughput matters most — bulk processing, real-time chat, short answers.

var preset = new Llama32Preset(); // 3B model
preset.ContextParameters.ContextSize = 4096;
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;
preset.ContextParameters.TypeV = GgmlType.Q8_0;

preset.SamplerParameters.Temperature = 0.3f;
preset.SamplerParameters.TopP = 0.9f;
preset.SamplerParameters.TopK = 20;

preset.ChatParameters.MaxTokens = 256;

preset.BaseModelInferenceParameters.GpuLayers = 999;
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.CUDA;

Expected throughput: 50-100 tokens/sec on a mid-range GPU, 15-30 on modern CPU.

Quality-biased configuration

When the best possible output matters — deep analysis, complex reasoning, long-form writing.

var preset = new Oss20Preset(); // 20B model
preset.ContextParameters.ContextSize = 32768;
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;
preset.ContextParameters.TypeK = GgmlType.F16;
preset.ContextParameters.TypeV = GgmlType.F16;

preset.SamplerParameters.Temperature = 0.7f;
preset.SamplerParameters.TopP = 0.95f;
preset.SamplerParameters.MinP = 0.05f;
preset.SamplerParameters.RepetitionPenalty = 1.05f;

preset.ChatParameters.MaxTokens = 2048;

preset.BaseModelInferenceParameters.GpuLayers = 999;

Expected throughput: 10-30 tokens/sec on a high-end GPU. CPU not recommended.

Balanced configuration

The default for most scenarios.

var preset = new Qwen25Preset(); // 7B model
// All other settings at preset defaults.

Expected throughput: 30-60 tokens/sec on a mid-range GPU.

Knobs cheat sheet

Knob	Faster	Better quality
Model size	Smaller (`Phi4Preset`, `Llama32Preset`)	Larger (`Oss20Preset`, `Qwen3Preset`)
Quantization	Q4_K_M, Q4_0	Q8_0, F16
`ContextSize`	Shorter	Longer
`FlashAttentionMode`	`Enabled`	`Enabled` (both)
`TypeV` (KV cache)	`Q8_0`, `Q4_0`	`F16`
`TypeK`	`F16`	`F16`
`Temperature`	`0.0-0.3` (more deterministic)	`0.7-0.9` (more creative, nuanced)
`TopP`	`0.8-0.9`	`0.9-0.95`
`TopK`	`20`	`40`
`RepetitionPenalty`	`1.05`	`1.1`
`MaxTokens`	Lower	Higher
`GpuLayers`	`999` (full offload)	`999` (same)

Measure, do not guess

Throughput is hardware-specific. Benchmark on your actual target machine with realistic prompts before committing to a configuration.

var sw = System.Diagnostics.Stopwatch.StartNew();
string reply = await api.SendMessageAsync(prompt);
sw.Stop();

Console.WriteLine($"Tokens: ~{reply.Split(' ').Length}");
Console.WriteLine($"Time: {sw.Elapsed.TotalSeconds:F2}s");
Console.WriteLine($"Rate: ~{reply.Split(' ').Length / sw.Elapsed.TotalSeconds:F1} tok/s");

Word count is an approximation — real tokens are usually 1.3-1.5× the word count for English.

What’s next

Sampler parameters — fine-grained sampler control.
Context parameters — context, flash attention, KV cache.
Understand quantization — how quantization affects throughput.

Net: Handle cancellation

Thu, 23 Apr 2026 00:00:00 +0000

Both SendMessageAsync and SendMessageToSessionAsync accept a CancellationToken. Firing it stops token generation promptly. The session state remains intact — you can continue the conversation with the next call.

Cancel on timeout

using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(30));

try
{
    string reply = await api.SendMessageAsync(
        "Write a 500-word essay about migration patterns of the Arctic tern.",
        cancellationToken: cts.Token);
    Console.WriteLine(reply);
}
catch (OperationCanceledException)
{
    Console.WriteLine("Generation timed out.");
}

Cancel on user action (Ctrl+C)

using var cts = new CancellationTokenSource();
Console.CancelKeyPress += (_, e) =>
{
    e.Cancel = true;   // prevent default process termination
    cts.Cancel();
};

try
{
    string reply = await api.SendMessageAsync(prompt, cancellationToken: cts.Token);
    Console.WriteLine(reply);
}
catch (OperationCanceledException)
{
    Console.WriteLine("Cancelled by user.");
}

Cancel from an HTTP host

ASP.NET Core passes a request cancellation token to endpoint handlers. Forward it to the SDK:

app.MapPost("/chat", async (ChatRequest req, Engine engine, CancellationToken ct) =>
{
    string sessionId = req.SessionId ?? await engine.InitiateNewSession();
    try
    {
        string reply = await engine.GetChatSessionResponse(sessionId, req.Message, null, ct);
        return Results.Ok(new { sessionId, reply });
    }
    catch (OperationCanceledException)
    {
        return Results.StatusCode(499); // client closed request
    }
});

When the client disconnects, ct fires; the SDK stops generating.

Session state after cancellation

The partial output is discarded — the user’s message goes into the history, but no assistant message is recorded.
The session remains alive and can accept a new message immediately.

try
{
    await api.SendMessageToSessionAsync(sessionId, "Long prompt...", cancellationToken: ct);
}
catch (OperationCanceledException)
{
    // Session is still usable.
}

// Continue without issue:
string reply = await api.SendMessageToSessionAsync(sessionId, "Short follow-up.");

Combine timeout with linked tokens

For a handler that has both a user-cancellation token and a timeout:

using var timeoutCts = new CancellationTokenSource(TimeSpan.FromSeconds(60));
using var linkedCts = CancellationTokenSource.CreateLinkedTokenSource(
    externalCancellationToken, timeoutCts.Token);

string reply = await api.SendMessageAsync(prompt, cancellationToken: linkedCts.Token);

Either source cancels the operation. Inspect which one fired if your UX needs to tell them apart:

catch (OperationCanceledException) when (timeoutCts.IsCancellationRequested)
{
    // Timed out.
}
catch (OperationCanceledException)
{
    // User or external cancellation.
}

What cancellation does not cover

Model load — the synchronous model-load inside AsposeLLMApi.Create is not interruptible via CancellationToken. Budget for the cold start; do not try to cancel it.
Binary download — same. The first-run binary deployment runs during Create and is synchronous.

For application-level time limits on startup, wrap Create in a Task.Run with an external watchdog — but be aware that even if you stop waiting on the task, the background work continues until it completes or the process terminates.

What’s next

AsposeLLMApi facade — method signatures including CancellationToken.
Integration with ASP.NET Core — cancellation in HTTP hosts.
Chat sessions — session lifecycle around cancellation.

Net: Reduce first-token latency

Thu, 23 Apr 2026 00:00:00 +0000

“First-token latency” is the time between sending a message and starting to see output. It has two components:

Cold-start — binary download, model load, session creation. Happens once per AsposeLLMApi instance.
Per-message — prompt tokenization, KV cache prefill, first-token generation.

Both can be reduced.

Warm up at startup

The first AsposeLLMApi.Create call is slow:

First ever: downloads binaries + model (100-500 MB + 2-15 GB). Several minutes.
Cached: model load only. 5-30 seconds.

Do it during application startup, not on the first user request.

// At application startup:
var license = new Aspose.LLM.License();
license.SetLicense("Aspose.LLM.lic");

var preset = new Qwen25Preset();
preset.EngineParameters.EnableDebugLogging = false;

var api = AsposeLLMApi.Create(preset); // slow the first time

In ASP.NET Core, warm up from ApplicationStarted:

app.Lifetime.ApplicationStarted.Register(() =>
{
    _ = app.Services.GetRequiredService<Engine>(); // triggers model load
});

In a Worker Service, do it inside ExecuteAsync before entering the main loop.

Pre-create a session

Starting a session takes tens to hundreds of milliseconds. For a chat server that processes user requests, create a session before the first request arrives:

string warmupSessionId = await api.StartNewChatAsync();
await api.SendMessageToSessionAsync(warmupSessionId, "ping");
// Keep this session; the engine is now fully warm.

Shorten system prompts

Every new session tokenizes and evaluates the system prompt before the first user turn. A 500-token system prompt costs hundreds of milliseconds on CPU, tens on GPU. Keep system prompts short.

// 50 tokens — fast first-turn.
preset.ChatParameters.SystemPrompt = "You are a concise assistant. Answer briefly.";

// 500 tokens of preamble — slow.
// preset.ChatParameters.SystemPrompt = "<long preamble with many instructions and examples>";

If you need extensive priming, use ChatParameters.History with a few-shot example set — the examples are tokenized once per session creation and cached across turns.

Size `NBatch` correctly

ContextParameters.NBatch controls how many tokens the engine processes per llama_decode call during prompt ingestion. Larger NBatch is faster as long as it fits in memory.

// Built-in presets typically set NBatch = 2048. For prompt-heavy scenarios:
preset.ContextParameters.NBatch = 4096;
preset.ContextParameters.NUbatch = 4096;

Too large, and you blow VRAM budgets; too small, and prompt prefill is slow. Tune on your hardware.

Enable flash attention

Flash attention improves prefill time dramatically on long prompts:

preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;

Always enable when supported.

Keep sessions alive

Reuse sessions across requests instead of creating a fresh one each time. Session creation costs prefill time; reusing amortizes it across turns.

In HTTP hosts, map user IDs to session IDs — see Multiple concurrent sessions.

Pre-populate binary and model caches

In offline or container deployments, download binaries and models on the build machine; ship them with your image. The runtime host skips the multi-minute initial download.

See Offline deployment.

Measure

var totalSw = System.Diagnostics.Stopwatch.StartNew();
using var api = AsposeLLMApi.Create(preset);
Console.WriteLine($"Create: {totalSw.Elapsed}");

var firstSw = System.Diagnostics.Stopwatch.StartNew();
string reply = await api.SendMessageAsync("Say hello.");
Console.WriteLine($"First message: {firstSw.Elapsed}");

var secondSw = System.Diagnostics.Stopwatch.StartNew();
string reply2 = await api.SendMessageAsync("Say hello again.");
Console.WriteLine($"Second message: {secondSw.Elapsed}");

The second message is noticeably faster than the first — session is already warm.

Typical numbers (modern GPU)

Stage	Time
`Create` (cold, first ever)	2-10 min (download + load)
`Create` (cached)	5-30 s
`StartNewChatAsync`	50-200 ms
First token after a 100-token prompt	200-500 ms
Subsequent tokens	10-20 ms each

CPU numbers are roughly 5-10× higher for each stage.

What’s next

Architecture — what happens during Create.
Context parameters — NBatch, flash attention.
Offline deployment — skip the initial download at runtime.

Net: Estimate memory requirements

Thu, 23 Apr 2026 00:00:00 +0000

Four things claim memory when the SDK runs:

Model weights.
KV cache.
Vision projector (vision presets only).
Intermediate buffers and sampler state.

This how-to helps you predict the total before deployment.

Rule-of-thumb sizes

Component	Approximate size
Model weights	`parameters × bytes_per_parameter`. For a 7B Q4_K_M model, ~3.5 GB.
KV cache	`layers × heads × head_dim × context × 2 × bytes_per_kv`. Actual numbers below.
Vision projector	200 MB – 2 GB.
Intermediate buffers	50 MB – 500 MB.

Step 1. Weights from quantization

See Understand quantization for the per-parameter bytes table.

Rough: weights_bytes ≈ parameters × bytes_per_param.

Parameters	Q4_K_M	Q8_0	F16
3B	~1.8 GB	~3.2 GB	~6 GB
7B	~3.5 GB	~7 GB	~14 GB
8B	~4 GB	~8 GB	~16 GB
20B	~11 GB	~21 GB	~40 GB
70B	~35 GB	~70 GB	~140 GB

Step 2. KV cache

Depends on model architecture (number of layers, heads, head dimension), context size, and KV dtype. The underlying formula is complex; below are empirical numbers for common presets at their default contexts:

Preset	KV at default context (F16)	KV at default context (Q8_0 V)
`Llama32Preset` (131K)	~8 GB	~5 GB
`Qwen25Preset` (32K)	~2 GB	~1.3 GB
`Qwen3Preset` (32K)	~2 GB	~1.3 GB
`DeepseekR1Qwen3Preset` (131K)	~6 GB	~4 GB
`Oss20Preset` (131K)	~10 GB	~6 GB
`Phi4Preset` (16K)	~0.8 GB	~0.5 GB
`Qwen3VL2BPreset` (262K)	~12 GB	~7 GB

Scales roughly linearly with actual session length. A 32K-capable preset at only 4K of actual context uses ~1/8th of the listed KV.

Step 3. Vision projector (if applicable)

Projector quantization	Typical size
F16	800 MB – 2 GB
Q8_0	500 MB – 1 GB
Q4_K_M	250 MB – 500 MB

Each vision preset declares its mmproj file in MmprojSourceParameters — see Supported presets.

Step 4. Add overhead

Sampler state, tokenizer, scratch buffers: 50 MB – 500 MB. Depends on batch size and context length.

For a conservative budget, add 500 MB on top of weights + KV + projector.

Worked examples

`Qwen25Preset` on a 12 GB GPU

Weights (7B Q4_K_M): 3.5 GB
KV at 32K F16: 2 GB
Overhead: 0.5 GB
Total: ~6 GB

Comfortable fit; headroom for longer sessions or higher KV dtype.

`Qwen3Preset` at full 32K on a 16 GB GPU

Weights (8B Q4_K_M): 4 GB
KV at 32K F16: 2 GB
Overhead: 0.5 GB
Total: ~6.5 GB

Fits with room for growth.

`Oss20Preset` at full 131K on a 24 GB GPU

Weights (20B Q4_K_M): 11 GB
KV at 131K F16: 10 GB
Overhead: 0.5 GB
Total: ~21.5 GB

Fits a 24 GB card tightly. To leave more headroom:

Quantize V cache: KV drops to ~6 GB. Total ~17.5 GB. Comfortable.
Shorten context to 32K: KV drops to ~2.5 GB. Total ~14 GB.

`Ministral3VisionPreset` on a 16 GB GPU

Base weights (8B Q4_K_M): 4 GB
Projector (BF16): ~2 GB
KV at 32K F16 (shortened from 262K default): ~2 GB
Overhead: 0.5 GB
Total: ~8.5 GB

Shortening context is the easy win here.

Shrinking memory

In order of quality impact (least to most):

Shorten ContextSize to what you actually use.
Quantize V cache (TypeV = Q8_0).
Enable flash attention — reduces KV memory at long contexts.
Quantize K cache (TypeK = Q8_0) — larger quality impact than V.
Use a smaller preset — last resort when the model itself is too large.

See Low-memory tuning for the full recipe.

Measuring actual usage

After Create and a few messages, check real memory:

# Linux
nvidia-smi      # VRAM per GPU
top / htop      # system RAM

# Windows
# Task Manager → Performance → GPU / Memory

The number you read includes OS page cache of memory-mapped files — some of it is reclaimable under pressure. Still, treat the reading as a ceiling estimate.

What’s next

Understand quantization — precision impact on weights.
System requirements — per-preset memory ranges.
Low-memory tuning — when the numbers do not fit your budget.

Documentation – How-to recipes

Net: Select a model by task

Quick picker

Decision tree

After you pick

What’s next

Net: Understand quantization

The basic trade-off

Popular picks

How to pick for your preset

Memory rough estimate

When to pick a smaller quantization

When to avoid aggressive quantization

KV cache quantization

What’s next

Net: Tune for speed vs quality

Speed-biased configuration

Quality-biased configuration

Balanced configuration

Knobs cheat sheet

Measure, do not guess

What’s next

Net: Handle cancellation

Cancel on timeout

Cancel on user action (Ctrl+C)

Cancel from an HTTP host

Session state after cancellation

Combine timeout with linked tokens

What cancellation does not cover

What’s next

Net: Reduce first-token latency

Warm up at startup

Pre-create a session

Shorten system prompts

Size NBatch correctly

Enable flash attention

Keep sessions alive

Pre-populate binary and model caches

Measure

Typical numbers (modern GPU)

What’s next

Net: Estimate memory requirements

Rule-of-thumb sizes

Step 1. Weights from quantization

Step 2. KV cache

Step 3. Vision projector (if applicable)

Step 4. Add overhead

Worked examples

Qwen25Preset on a 12 GB GPU

Qwen3Preset at full 32K on a 16 GB GPU

Oss20Preset at full 131K on a 24 GB GPU

Ministral3VisionPreset on a 16 GB GPU

Shrinking memory

Measuring actual usage

What’s next

Size `NBatch` correctly

`Qwen25Preset` on a 12 GB GPU

`Qwen3Preset` at full 32K on a 16 GB GPU

`Oss20Preset` at full 131K on a 24 GB GPU

`Ministral3VisionPreset` on a 16 GB GPU