Low memory tuning

When memory is tight — edge devices, small VMs, constrained containers, laptops with 16 GB RAM — pick a small preset and tune several levers together. This use case walks through the moves that matter most.

When to use this pattern

  • Deploying on a laptop with 8-16 GB of RAM.
  • Shared containers with a 4-8 GB RAM cap.
  • Edge devices (NUCs, small ARM boards) with limited memory.
  • Sharing a GPU with other processes, leaving only a few GB of VRAM free.

Prerequisites

Pick a small model

Start with the smallest preset that meets your quality bar:

Preset Model size Memory footprint (Q4 + default context)
Llama32Preset 3B ~6-8 GB
Phi4Preset Mini ~4-6 GB
Qwen25VL3BPreset 3B VL ~4-6 GB + projector
Qwen3VL2BPreset 2B VL ~3-5 GB + projector

Larger presets (7B, 8B, 20B) push past 8 GB — avoid them when memory is scarce.

Shrink context size

Built-in presets often declare very long contexts. Shorten to what you actually need:

var preset = new Llama32Preset();
preset.ContextParameters.ContextSize = 4096; // default is 131072

KV cache scales with context; shorter context directly reduces memory.

Quantize the KV cache

using Aspose.LLM.Abstractions.Models;

preset.ContextParameters.TypeK = GgmlType.F16;    // keep K precise
preset.ContextParameters.TypeV = GgmlType.Q8_0;   // V to 8-bit

This halves V-cache memory with minor quality impact. For even tighter budgets:

preset.ContextParameters.TypeV = GgmlType.Q4_0;   // aggressive

Be aware: aggressive KV quantization degrades output quality, especially at long ranges. Validate on your workload.

Enable memory mapping

UseMemoryMapping = true (the default) lets the OS page model weights from disk on demand rather than copying them all into RAM at load time. Keep this enabled:

preset.BaseModelInferenceParameters.UseMemoryMapping = true;

On very tight memory, disable memory locking (if enabled):

preset.BaseModelInferenceParameters.UseMemoryLocking = false; // explicit

Enable flash attention

Even on short contexts, flash attention reduces memory during generation:

preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;

Drop MaxTokens

Cap response length so the model does not generate long outputs that extend the KV cache needlessly:

preset.ChatParameters.MaxTokens = 256;

Full minimal-memory example

using Aspose.LLM;
using Aspose.LLM.Abstractions.Acceleration;
using Aspose.LLM.Abstractions.Models;
using Aspose.LLM.Abstractions.Parameters.Presets;

var license = new Aspose.LLM.License();
license.SetLicense("Aspose.LLM.lic");

var preset = new Llama32Preset();

// CPU-only to avoid GPU memory.
preset.BinaryManagerParameters.PreferredAcceleration = AccelerationType.AVX2;
preset.BaseModelInferenceParameters.GpuLayers = 0;
preset.BaseModelInferenceParameters.UseMemoryMapping = true;

// Short context.
preset.ContextParameters.ContextSize = 4096;

// Aggressive KV quantization.
preset.ContextParameters.TypeK = GgmlType.F16;
preset.ContextParameters.TypeV = GgmlType.Q8_0;

// Flash attention.
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;

// Short response cap.
preset.ChatParameters.MaxTokens = 256;

// Conservative threading.
preset.ContextParameters.NThreads = 2;
preset.ContextParameters.NThreadsBatch = 4;

using var api = AsposeLLMApi.Create(preset);

string reply = await api.SendMessageAsync("Summarize cloud computing in one sentence.");
Console.WriteLine(reply);

Expected footprint: 4-5 GB RAM, fits comfortably in an 8 GB machine.

Monitor actual usage

Check real memory usage after load:

  • Windows: Task Manager → Process details → Memory.
  • Linux: top, htop, or /proc/<pid>/status.
  • macOS: Activity Monitor → memory.

The real number includes OS page cache of memory-mapped model files — it may look high but most of it is reclaimable under memory pressure.

Trade-offs

What you cut Pain
Context size Less room for long prompts; no long-form analysis.
KV quantization to Q8_0 Minor quality drop on long contexts.
KV quantization to Q4_0 Noticeable quality drop.
MaxTokens Truncated answers; bad for reasoning models (see Chat parameters).
Small model Lower reasoning quality.
CPU-only Slower throughput (5-15 t/s vs 40-100+ on GPU).

Start by cutting context and enabling KV quantization. Only if memory is still tight should you drop to an even smaller model.

What’s next