NBatch

NBatch is the logical maximum batch size — the upper bound on the number of tokens submitted in one call to the native llama_decode function. Larger batch sizes speed up prompt processing at the cost of more temporary memory.

Quick reference


Type	`uint?`
Default	`null` (native default, typically 2048)
Range	`512` – `8192` typical; power-of-two values recommended
Category	Context size and batching
Field on	`ContextParameters.NBatch`

What it does

When the engine processes a prompt (system message + conversation history + new user turn), it feeds tokens to the model in batches. NBatch caps the largest batch sent in one call.

Smaller NBatch (512) — lower memory footprint, slower prompt processing.
Larger NBatch (4096, 8192) — faster prompt processing, more temporary memory.

NBatch affects prompt processing time, not generation throughput. Once the first output token is produced, subsequent tokens come one at a time regardless of batch size.

When to change it

Scenario	Value
Default	`null` (use native default)
Fast prompt processing, ample memory	`4096`
Memory-constrained	`512` or `1024`
Very long prompts (summarization, long context)	`4096` – `8192`

Built-in presets set NBatch based on the model’s needs — Qwen25Preset uses 3072, Llama32Preset uses 2048, vision presets often use 4096.

Example

var preset = new Qwen25Preset();
preset.ContextParameters.NBatch = 4096;  // faster prompt processing
preset.ContextParameters.NUbatch = 4096;

using var api = AsposeLLMApi.Create(preset);

Interactions

NUbatch — physical batch size; typically set equal to or less than NBatch.
ContextSize — NBatch should not exceed ContextSize.
NThreadsBatch — threads that process the batch.

What’s next

NUbatch — physical batch size.
NThreadsBatch — prompt-processing threads.
Reduce first-token latency — batch size’s role in TTFT.

ContextSize NUbatch