Understand quantization
Quantization reduces the precision of model weights from the full-precision training format (usually F16 or BF16) to fewer bits per value. Smaller weights mean smaller files, less memory, and faster inference — at some cost in output quality.
The basic trade-off
| Precision | File size (relative) | Quality loss |
|---|---|---|
| F32 (32-bit float) | 2.0× | None (reference) |
| F16 (16-bit float) | 1.0× | Essentially none |
| BF16 (brain float 16) | 1.0× | Essentially none |
| Q8_0 (8-bit) | ~0.5× | Very small |
| Q6_K (6-bit) | ~0.38× | Small |
| Q5_K_M (5-bit medium) | ~0.33× | Small-to-moderate |
| Q4_K_M (4-bit medium) | ~0.27× | Moderate |
| Q4_0 (4-bit classic) | ~0.25× | Moderate |
| Q3_K (3-bit) | ~0.22× | Noticeable |
| Q2_K (2-bit) | ~0.18× | Large |
| IQ4_XS / IQ3_S (importance quant.) | ~0.23-0.30× | Smaller than Q equivalents at similar size |
| IQ2_XXS (very aggressive) | ~0.15× | Large |
Values are approximate; actual size depends on model architecture and specific quantizer.
Popular picks
- Q4_K_M — the default for most community-uploaded GGUFs. Good balance for 7B+ models.
- Q5_K_M — slightly bigger, slightly better quality. Worth it when you have memory headroom.
- Q8_0 — near-lossless. Use when you want the best quality that is not F16 and have 2× the memory.
- IQ4_XS — aggressive importance quantization; often better quality-per-byte than Q4_0 for the same size.
- F16 — full precision. Useful for reproducibility or benchmarks; rarely worth the memory otherwise.
How to pick for your preset
Built-in presets already choose a quantization. To change, override BaseModelSourceParameters.HuggingFaceFileName to a different file from the same repo:
var preset = new Qwen25Preset();
// Default is Qwen2.5-7B-Instruct-Q4_K_M.gguf; switch to Q8_0:
preset.BaseModelSourceParameters.HuggingFaceFileName = "Qwen2.5-7B-Instruct-Q8_0.gguf";
Confirm the file exists in the repository (check the Hugging Face page).
Memory rough estimate
For a model with N parameters:
| Quantization | Bytes per parameter | 7B model | 70B model |
|---|---|---|---|
| F16 | 2.0 | ~14 GB | ~140 GB |
| Q8_0 | 1.0 | ~7 GB | ~70 GB |
| Q5_K_M | 0.625 | ~4.4 GB | ~44 GB |
| Q4_K_M | 0.5 | ~3.5 GB | ~35 GB |
| Q4_0 | 0.5 | ~3.5 GB | ~35 GB |
| IQ4_XS | 0.45 | ~3.2 GB | ~32 GB |
| Q3_K | 0.375 | ~2.6 GB | ~26 GB |
Add KV cache and intermediate buffers on top — see Estimate memory requirements.
When to pick a smaller quantization
- Memory is the binding constraint. The model does not fit at higher quantization, but fits at lower.
- You are running many models and want to pack several into one machine.
- You accept some quality loss for speed — smaller quantizations run slightly faster due to reduced memory bandwidth pressure.
When to avoid aggressive quantization
- Tasks sensitive to precise output (code generation, math, legal/medical reasoning).
- Long reasoning chains where errors compound.
- When you have the memory — use Q5_K_M or Q8_0 when you can.
KV cache quantization
Separate from model weight quantization, you can quantize the KV cache at runtime via ContextParameters.TypeK and TypeV:
preset.ContextParameters.TypeK = GgmlType.F16;
preset.ContextParameters.TypeV = GgmlType.Q8_0;
Saves memory on long contexts with minor quality impact. See Context parameters for the full enum.
What’s next
- Model source parameters — how to select a specific GGUF file.
- Estimate memory requirements — factor quantization into your sizing.
- Supported presets — built-in preset quantizations.