Estimate memory requirements

Four things claim memory when the SDK runs:

Model weights.
KV cache.
Vision projector (vision presets only).
Intermediate buffers and sampler state.

This how-to helps you predict the total before deployment.

Rule-of-thumb sizes

Component	Approximate size
Model weights	`parameters × bytes_per_parameter`. For a 7B Q4_K_M model, ~3.5 GB.
KV cache	`layers × heads × head_dim × context × 2 × bytes_per_kv`. Actual numbers below.
Vision projector	200 MB – 2 GB.
Intermediate buffers	50 MB – 500 MB.

Step 1. Weights from quantization

See Understand quantization for the per-parameter bytes table.

Rough: weights_bytes ≈ parameters × bytes_per_param.

Parameters	Q4_K_M	Q8_0	F16
3B	~1.8 GB	~3.2 GB	~6 GB
7B	~3.5 GB	~7 GB	~14 GB
8B	~4 GB	~8 GB	~16 GB
20B	~11 GB	~21 GB	~40 GB
70B	~35 GB	~70 GB	~140 GB

Step 2. KV cache

Depends on model architecture (number of layers, heads, head dimension), context size, and KV dtype. The underlying formula is complex; below are empirical numbers for common presets at their default contexts:

Preset	KV at default context (F16)	KV at default context (Q8_0 V)
`Llama32Preset` (131K)	~8 GB	~5 GB
`Qwen25Preset` (32K)	~2 GB	~1.3 GB
`Qwen3Preset` (32K)	~2 GB	~1.3 GB
`DeepseekR1Qwen3Preset` (131K)	~6 GB	~4 GB
`Oss20Preset` (131K)	~10 GB	~6 GB
`Phi4Preset` (16K)	~0.8 GB	~0.5 GB
`Qwen3VL2BPreset` (262K)	~12 GB	~7 GB

Scales roughly linearly with actual session length. A 32K-capable preset at only 4K of actual context uses ~1/8th of the listed KV.

Step 3. Vision projector (if applicable)

Projector quantization	Typical size
F16	800 MB – 2 GB
Q8_0	500 MB – 1 GB
Q4_K_M	250 MB – 500 MB

Each vision preset declares its mmproj file in MmprojSourceParameters — see Supported presets.

Step 4. Add overhead

Sampler state, tokenizer, scratch buffers: 50 MB – 500 MB. Depends on batch size and context length.

For a conservative budget, add 500 MB on top of weights + KV + projector.

Worked examples

`Qwen25Preset` on a 12 GB GPU

Weights (7B Q4_K_M): 3.5 GB
KV at 32K F16: 2 GB
Overhead: 0.5 GB
Total: ~6 GB

Comfortable fit; headroom for longer sessions or higher KV dtype.

`Qwen3Preset` at full 32K on a 16 GB GPU

Weights (8B Q4_K_M): 4 GB
KV at 32K F16: 2 GB
Overhead: 0.5 GB
Total: ~6.5 GB

Fits with room for growth.

`Oss20Preset` at full 131K on a 24 GB GPU

Weights (20B Q4_K_M): 11 GB
KV at 131K F16: 10 GB
Overhead: 0.5 GB
Total: ~21.5 GB

Fits a 24 GB card tightly. To leave more headroom:

Quantize V cache: KV drops to ~6 GB. Total ~17.5 GB. Comfortable.
Shorten context to 32K: KV drops to ~2.5 GB. Total ~14 GB.

`Ministral3VisionPreset` on a 16 GB GPU

Base weights (8B Q4_K_M): 4 GB
Projector (BF16): ~2 GB
KV at 32K F16 (shortened from 262K default): ~2 GB
Overhead: 0.5 GB
Total: ~8.5 GB

Shortening context is the easy win here.

Shrinking memory

In order of quality impact (least to most):

Shorten ContextSize to what you actually use.
Quantize V cache (TypeV = Q8_0).
Enable flash attention — reduces KV memory at long contexts.
Quantize K cache (TypeK = Q8_0) — larger quality impact than V.
Use a smaller preset — last resort when the model itself is too large.

See Low-memory tuning for the full recipe.

Measuring actual usage

After Create and a few messages, check real memory:

# Linux
nvidia-smi      # VRAM per GPU
top / htop      # system RAM

# Windows
# Task Manager → Performance → GPU / Memory

The number you read includes OS page cache of memory-mapped files — some of it is reclaimable under pressure. Still, treat the reading as a ceiling estimate.

What’s next

Understand quantization — precision impact on weights.
System requirements — per-preset memory ranges.
Low-memory tuning — when the numbers do not fit your budget.

Reduce first-token latency