UseMemoryMapping
Contents
[
Hide
]
UseMemoryMapping controls whether the engine memory-maps the GGUF file instead of reading it into RAM. Memory mapping lets the OS stream the model on demand, which cuts startup time and peak memory.
Quick reference
| Type | bool? |
| Default | null (native default — usually true) |
| Category | Model loading |
| Field on | ModelInferenceParameters.UseMemoryMapping |
What it does
true(default) — the OS maps the GGUF file into address space. Pages are brought into memory on first access. Startup time is fast; peak memory is bounded by the working set.false— the engine reads the full file into RAM before model init. Startup is slower; peak memory doubles during load (read buffer + allocation).null— native default; behaves astrueon most platforms.
Memory mapping is preferred unless your filesystem does not support mmap (some network filesystems, container volume drivers).
When to change it
| Scenario | Value |
|---|---|
| Default (recommended) | null |
| Network filesystem without mmap support | false |
| Need full model loaded into RAM upfront | false |
| Debugging mmap-specific issues | false |
Example
var preset = new Qwen25Preset();
preset.BaseModelInferenceParameters.UseMemoryMapping = true; // default, shown explicitly
For an NFS-mounted model:
preset.BaseModelInferenceParameters.UseMemoryMapping = false;
// Loading takes longer, but works on filesystems where mmap fails.
Interactions
UseMemoryLocking— lock working set to prevent paging.GpuLayers— offloaded layers are copied from the mapped file to GPU memory.
What’s next
- UseMemoryLocking — prevent paging.
- GpuLayers — GPU offload.
- Model inference hub — all inference knobs.