Low memory tuning
When memory is tight — edge devices, small VMs, constrained containers, laptops with 16 GB RAM — pick a small preset and tune several levers together. This use case walks through the moves that matter most.
When to use this pattern
- Deploying on a laptop with 8-16 GB of RAM.
- Shared containers with a 4-8 GB RAM cap.
- Edge devices (NUCs, small ARM boards) with limited memory.
- Sharing a GPU with other processes, leaving only a few GB of VRAM free.
Prerequisites
Pick a small model
Start with the smallest preset that meets your quality bar:
| Preset | Model size | Memory footprint (Q4 + default context) |
|---|---|---|
Llama32Preset |
3B | ~6-8 GB |
Phi4Preset |
Mini | ~4-6 GB |
Qwen25VL3BPreset |
3B VL | ~4-6 GB + projector |
Qwen3VL2BPreset |
2B VL | ~3-5 GB + projector |
Larger presets (7B, 8B, 20B) push past 8 GB — avoid them when memory is scarce.
Shrink context size
Built-in presets often declare very long contexts. Shorten to what you actually need:
var preset = new Llama32Preset();
preset.ContextParameters.ContextSize = 4096; // default is 131072
KV cache scales with context; shorter context directly reduces memory.
Quantize the KV cache
using Aspose.LLM.Abstractions.Models;
preset.ContextParameters.TypeK = GgmlType.F16; // keep K precise
preset.ContextParameters.TypeV = GgmlType.Q8_0; // V to 8-bit
This halves V-cache memory with minor quality impact. For even tighter budgets:
preset.ContextParameters.TypeV = GgmlType.Q4_0; // aggressive
Be aware: aggressive KV quantization degrades output quality, especially at long ranges. Validate on your workload.
Enable memory mapping
UseMemoryMapping = true (the default) lets the OS page model weights from disk on demand rather than copying them all into RAM at load time. Keep this enabled:
preset.BaseModelInferenceParameters.UseMemoryMapping = true;
On very tight memory, disable memory locking (if enabled):
preset.BaseModelInferenceParameters.UseMemoryLocking = false; // explicit
Enable flash attention
Even on short contexts, flash attention reduces memory during generation:
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;
Drop MaxTokens
Cap response length so the model does not generate long outputs that extend the KV cache needlessly:
preset.ChatParameters.MaxTokens = 256;
Full minimal-memory example
using Aspose.LLM;
using Aspose.LLM.Abstractions.Acceleration;
using Aspose.LLM.Abstractions.Models;
using Aspose.LLM.Abstractions.Parameters.Presets;
var license = new Aspose.LLM.License();
license.SetLicense("Aspose.LLM.lic");
var preset = new Llama32Preset();
// CPU-only to avoid GPU memory.
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.AVX2;
preset.BaseModelInferenceParameters.GpuLayers = 0;
preset.BaseModelInferenceParameters.UseMemoryMapping = true;
// Short context.
preset.ContextParameters.ContextSize = 4096;
// Aggressive KV quantization.
preset.ContextParameters.TypeK = GgmlType.F16;
preset.ContextParameters.TypeV = GgmlType.Q8_0;
// Flash attention.
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;
// Short response cap.
preset.ChatParameters.MaxTokens = 256;
// Conservative threading.
preset.ContextParameters.NThreads = 2;
preset.ContextParameters.NThreadsBatch = 4;
using var api = AsposeLLMApi.Create(preset);
string reply = await api.SendMessageAsync("Summarize cloud computing in one sentence.");
Console.WriteLine(reply);
Expected footprint: 4-5 GB RAM, fits comfortably in an 8 GB machine.
Monitor actual usage
Check real memory usage after load:
- Windows: Task Manager → Process details → Memory.
- Linux:
top,htop, or/proc/<pid>/status. - macOS: Activity Monitor → memory.
The real number includes OS page cache of memory-mapped model files — it may look high but most of it is reclaimable under memory pressure.
Trade-offs
| What you cut | Pain |
|---|---|
| Context size | Less room for long prompts; no long-form analysis. |
| KV quantization to Q8_0 | Minor quality drop on long contexts. |
| KV quantization to Q4_0 | Noticeable quality drop. |
MaxTokens |
Truncated answers; bad for reasoning models (see Chat parameters). |
| Small model | Lower reasoning quality. |
| CPU-only | Slower throughput (5-15 t/s vs 40-100+ on GPU). |
Start by cutting context and enabling KV quantization. Only if memory is still tight should you drop to an even smaller model.
What’s next
- Context parameters — every knob that affects memory.
- CPU-only deployment — when GPU memory is unavailable.
- Long context tuning — the opposite direction.