Reduce first-token latency
“First-token latency” is the time between sending a message and starting to see output. It has two components:
- Cold-start — binary download, model load, session creation. Happens once per
AsposeLLMApiinstance. - Per-message — prompt tokenization, KV cache prefill, first-token generation.
Both can be reduced.
Warm up at startup
The first AsposeLLMApi.Create call is slow:
- First ever: downloads binaries + model (100-500 MB + 2-15 GB). Several minutes.
- Cached: model load only. 5-30 seconds.
Do it during application startup, not on the first user request.
// At application startup:
var license = new Aspose.LLM.License();
license.SetLicense("Aspose.LLM.lic");
var preset = new Qwen25Preset();
preset.EngineParameters.EnableDebugLogging = false;
var api = AsposeLLMApi.Create(preset); // slow the first time
In ASP.NET Core, warm up from ApplicationStarted:
app.Lifetime.ApplicationStarted.Register(() =>
{
_ = app.Services.GetRequiredService<Engine>(); // triggers model load
});
In a Worker Service, do it inside ExecuteAsync before entering the main loop.
Pre-create a session
Starting a session takes tens to hundreds of milliseconds. For a chat server that processes user requests, create a session before the first request arrives:
string warmupSessionId = await api.StartNewChatAsync();
await api.SendMessageToSessionAsync(warmupSessionId, "ping");
// Keep this session; the engine is now fully warm.
Shorten system prompts
Every new session tokenizes and evaluates the system prompt before the first user turn. A 500-token system prompt costs hundreds of milliseconds on CPU, tens on GPU. Keep system prompts short.
// 50 tokens — fast first-turn.
preset.ChatParameters.SystemPrompt = "You are a concise assistant. Answer briefly.";
// 500 tokens of preamble — slow.
// preset.ChatParameters.SystemPrompt = "<long preamble with many instructions and examples>";
If you need extensive priming, use ChatParameters.History with a few-shot example set — the examples are tokenized once per session creation and cached across turns.
Size NBatch correctly
ContextParameters.NBatch controls how many tokens the engine processes per llama_decode call during prompt ingestion. Larger NBatch is faster as long as it fits in memory.
// Built-in presets typically set NBatch = 2048. For prompt-heavy scenarios:
preset.ContextParameters.NBatch = 4096;
preset.ContextParameters.NUbatch = 4096;
Too large, and you blow VRAM budgets; too small, and prompt prefill is slow. Tune on your hardware.
Enable flash attention
Flash attention improves prefill time dramatically on long prompts:
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;
Always enable when supported.
Keep sessions alive
Reuse sessions across requests instead of creating a fresh one each time. Session creation costs prefill time; reusing amortizes it across turns.
In HTTP hosts, map user IDs to session IDs — see Multiple concurrent sessions.
Pre-populate binary and model caches
In offline or container deployments, download binaries and models on the build machine; ship them with your image. The runtime host skips the multi-minute initial download.
See Offline deployment.
Measure
var totalSw = System.Diagnostics.Stopwatch.StartNew();
using var api = AsposeLLMApi.Create(preset);
Console.WriteLine($"Create: {totalSw.Elapsed}");
var firstSw = System.Diagnostics.Stopwatch.StartNew();
string reply = await api.SendMessageAsync("Say hello.");
Console.WriteLine($"First message: {firstSw.Elapsed}");
var secondSw = System.Diagnostics.Stopwatch.StartNew();
string reply2 = await api.SendMessageAsync("Say hello again.");
Console.WriteLine($"Second message: {secondSw.Elapsed}");
The second message is noticeably faster than the first — session is already warm.
Typical numbers (modern GPU)
| Stage | Time |
|---|---|
Create (cold, first ever) |
2-10 min (download + load) |
Create (cached) |
5-30 s |
StartNewChatAsync |
50-200 ms |
| First token after a 100-token prompt | 200-500 ms |
| Subsequent tokens | 10-20 ms each |
CPU numbers are roughly 5-10× higher for each stage.
What’s next
- Architecture — what happens during
Create. - Context parameters —
NBatch, flash attention. - Offline deployment — skip the initial download at runtime.