NThreadsBatch
Contents
[
Hide
]
NThreadsBatch is the number of CPU threads the engine uses during prompt processing (the initial prefill phase). Prompt processing is embarrassingly parallel and benefits from using most or all available cores.
Quick reference
| Type | int? |
| Default | null (falls back to EngineParameters.DefaultThreads) |
| Range | 1 and above |
| Category | Threading |
| Field on | ContextParameters.NThreadsBatch |
What it does
During prompt processing, the engine runs matrix multiplications over many tokens at once. This workload parallelizes well: more threads directly translate to higher throughput, up to memory-bandwidth limits.
NThreadsBatch = ProcessorCount— typical. Use all cores for fast prompt ingestion.NThreadsBatch = half ProcessorCount— leave room for other CPU workloads.NThreadsBatch < NThreads— unusual, almost always wrong for modern CPUs.
Prompt processing happens once per incoming message (on user input). Generation (NThreads) happens per output token. For chat, prompt-processing time dominates when the prompt is long, generation dominates when the output is long.
When to change it
| Scenario | Value |
|---|---|
| Default | null (use DefaultThreads) |
| Dedicated inference machine | ProcessorCount (all cores) |
| Shared machine | ProcessorCount / 2 |
| Very long prompts, memory-bound hardware | Benchmark — adding threads may not help past 16 |
Example
var preset = new Qwen25Preset();
preset.ContextParameters.NThreads = 8; // generation
preset.ContextParameters.NThreadsBatch = Environment.ProcessorCount; // prompt prefill
Interactions
NThreads— generation threads; typically different fromNThreadsBatch.NBatch— larger batch sizes better utilize highNThreadsBatch.EngineParameters.DefaultThreads— fallback when null.
What’s next
- NThreads — generation-phase threads.
- NBatch — batch size.
- Reduce first-token latency — prompt-processing throughput’s role in TTFT.