Vision presets
The SDK ships four built-in vision presets. Each preset configures both the base language model and its multimodal projector (mmproj) — the two files load together on first Create.
Available presets
| Preset | Model | Base context | Quantization | mmproj file |
|---|---|---|---|---|
Qwen25VL3BPreset |
Qwen 2.5 VL 3B Instruct | 128 000 | UD-IQ2_XXS | mmproj-F16.gguf |
Qwen3VL2BPreset |
Qwen 3 VL 2B Instruct | 262 144 | Q4_K_M | mmproj-Qwen3VL-2B-Instruct-Q8_0.gguf |
Gemma3VisionPreset |
Gemma 3 Vision (Latex fine-tune) | 8 096 | Q4_K_M | Gemma-3-Vision-Latex.mmproj-f16.gguf |
Ministral3VisionPreset |
Ministral 3 8B Instruct (Mistral AI, 2512 release) | 262 144 | Q4_K_M | Ministral-3-8B-Instruct-2512-BF16-mmproj.gguf |
See Supported presets for the Hugging Face source repositories.
Picker
| Need | Try |
|---|---|
| Smallest footprint, long context | Qwen3VL2BPreset (2B parameters, 262K context) |
| General-purpose vision Q&A | Qwen25VL3BPreset (3B, 128K) |
| Text-heavy images (documents, LaTeX) | Gemma3VisionPreset |
| Strongest reasoning on complex images | Ministral3VisionPreset (8B, 262K) |
All four produce reasonable image descriptions and simple spatial reasoning. For OCR-style tasks on dense text, lean toward Gemma 3 Vision or Ministral 3 — the larger projectors handle small text better.
Memory
Vision presets load two files: the base model and the projector. Add the projector memory footprint on top of the base model — typically 200 MB to 2 GB depending on precision.
| Preset | Base (VRAM/RAM) | Projector | Total |
|---|---|---|---|
Qwen25VL3BPreset (UD-IQ2_XXS) |
~2 GB | ~0.8 GB (F16) | ~3 GB |
Qwen3VL2BPreset |
~2 GB | ~0.5 GB (Q8_0) | ~2.5 GB |
Gemma3VisionPreset |
~3 GB | ~1 GB (F16) | ~4 GB |
Ministral3VisionPreset |
~6 GB | ~2 GB (BF16) | ~8 GB |
Add KV cache on top (scales with ContextParameters.ContextSize). For long contexts, reduce TypeV to Q8_0 to claw back memory.
Using a vision preset
Same pattern as any other preset. Pass images via the media parameter of SendMessageAsync or SendMessageToSessionAsync:
using Aspose.LLM;
using Aspose.LLM.Abstractions.Parameters.Presets;
var preset = new Qwen3VL2BPreset();
using var api = AsposeLLMApi.Create(preset);
byte[] imageBytes = File.ReadAllBytes("document.png");
string reply = await api.SendMessageAsync(
"Transcribe the text in this image verbatim.",
media: new[] { imageBytes });
Console.WriteLine(reply);
See Attaching images for format and size rules.
Customizing a vision preset
The same override patterns as text presets apply. See Presets for the three approaches:
- Override before
Create— tweak fields on the preset instance. - Subclass — inherit from a built-in vision preset and set defaults in the constructor.
- From scratch — extend
PresetCoreBaseand populate bothBaseModelSourceParametersandMmprojSourceParameters.
Additional vision-only knobs live on MtmdContextParameters — control projector GPU offload, threading, and verbosity.
What’s next
- Attaching images — how to pass image bytes.
- Chat templates — how the SDK selects the right template per model.
- Multimodal context parameters — vision-side tuning knobs.
- Model source parameters — configure the
mmprojdownload source.