Vision presets

The SDK ships four built-in vision presets. Each preset configures both the base language model and its multimodal projector (mmproj) — the two files load together on first Create.

Available presets

Preset	Model	Base context	Quantization	mmproj file
`Qwen25VL3BPreset`	Qwen 2.5 VL 3B Instruct	128 000	UD-IQ2_XXS	`mmproj-F16.gguf`
`Qwen3VL2BPreset`	Qwen 3 VL 2B Instruct	262 144	Q4_K_M	`mmproj-Qwen3VL-2B-Instruct-Q8_0.gguf`
`Gemma3VisionPreset`	Gemma 3 Vision (Latex fine-tune)	8 096	Q4_K_M	`Gemma-3-Vision-Latex.mmproj-f16.gguf`
`Ministral3VisionPreset`	Ministral 3 8B Instruct (Mistral AI, 2512 release)	262 144	Q4_K_M	`Ministral-3-8B-Instruct-2512-BF16-mmproj.gguf`

See Supported presets for the Hugging Face source repositories.

Picker

Need	Try
Smallest footprint, long context	`Qwen3VL2BPreset` (2B parameters, 262K context)
General-purpose vision Q&A	`Qwen25VL3BPreset` (3B, 128K)
Text-heavy images (documents, LaTeX)	`Gemma3VisionPreset`
Strongest reasoning on complex images	`Ministral3VisionPreset` (8B, 262K)

All four produce reasonable image descriptions and simple spatial reasoning. For OCR-style tasks on dense text, lean toward Gemma 3 Vision or Ministral 3 — the larger projectors handle small text better.

Memory

Vision presets load two files: the base model and the projector. Add the projector memory footprint on top of the base model — typically 200 MB to 2 GB depending on precision.

Preset	Base (VRAM/RAM)	Projector	Total
`Qwen25VL3BPreset` (UD-IQ2_XXS)	~2 GB	~0.8 GB (F16)	~3 GB
`Qwen3VL2BPreset`	~2 GB	~0.5 GB (Q8_0)	~2.5 GB
`Gemma3VisionPreset`	~3 GB	~1 GB (F16)	~4 GB
`Ministral3VisionPreset`	~6 GB	~2 GB (BF16)	~8 GB

Add KV cache on top (scales with ContextParameters.ContextSize). For long contexts, reduce TypeV to Q8_0 to claw back memory.

Using a vision preset

Same pattern as any other preset. Pass images via the media parameter of SendMessageAsync or SendMessageToSessionAsync:

using Aspose.LLM;
using Aspose.LLM.Abstractions.Parameters.Presets;

var preset = new Qwen3VL2BPreset();
using var api = AsposeLLMApi.Create(preset);

byte[] imageBytes = File.ReadAllBytes("document.png");

string reply = await api.SendMessageAsync(
    "Transcribe the text in this image verbatim.",
    media: new[] { imageBytes });

Console.WriteLine(reply);

See Attaching images for format and size rules.

Customizing a vision preset

The same override patterns as text presets apply. See Presets for the three approaches:

Override before Create — tweak fields on the preset instance.
Subclass — inherit from a built-in vision preset and set defaults in the constructor.
From scratch — extend PresetCoreBase and populate both BaseModelSourceParameters and MmprojSourceParameters.

Additional vision-only knobs live on MtmdContextParameters — control projector GPU offload, threading, and verbosity.

What’s next

Attaching images — how to pass image bytes.
Chat templates — how the SDK selects the right template per model.
Multimodal context parameters — vision-side tuning knobs.
Model source parameters — configure the mmproj download source.

Attaching images