NBatch
Contents
[
Hide
]
NBatch is the logical maximum batch size — the upper bound on the number of tokens submitted in one call to the native llama_decode function. Larger batch sizes speed up prompt processing at the cost of more temporary memory.
Quick reference
| Type | uint? |
| Default | null (native default, typically 2048) |
| Range | 512 – 8192 typical; power-of-two values recommended |
| Category | Context size and batching |
| Field on | ContextParameters.NBatch |
What it does
When the engine processes a prompt (system message + conversation history + new user turn), it feeds tokens to the model in batches. NBatch caps the largest batch sent in one call.
- Smaller
NBatch(512) — lower memory footprint, slower prompt processing. - Larger
NBatch(4096, 8192) — faster prompt processing, more temporary memory.
NBatch affects prompt processing time, not generation throughput. Once the first output token is produced, subsequent tokens come one at a time regardless of batch size.
When to change it
| Scenario | Value |
|---|---|
| Default | null (use native default) |
| Fast prompt processing, ample memory | 4096 |
| Memory-constrained | 512 or 1024 |
| Very long prompts (summarization, long context) | 4096 – 8192 |
Built-in presets set NBatch based on the model’s needs — Qwen25Preset uses 3072, Llama32Preset uses 2048, vision presets often use 4096.
Example
var preset = new Qwen25Preset();
preset.ContextParameters.NBatch = 4096; // faster prompt processing
preset.ContextParameters.NUbatch = 4096;
using var api = AsposeLLMApi.Create(preset);
Interactions
NUbatch— physical batch size; typically set equal to or less thanNBatch.ContextSize—NBatchshould not exceedContextSize.NThreadsBatch— threads that process the batch.
What’s next
- NUbatch — physical batch size.
- NThreadsBatch — prompt-processing threads.
- Reduce first-token latency — batch size’s role in TTFT.