Multimodal
Contents
[
Hide
]
Aspose.LLM for .NET supports image input alongside text through mtmd — the llama.cpp multimodal layer — and a small set of built-in vision presets. This section covers everything you need to work with images: picking a preset, attaching images to messages, understanding chat templates, and diagnosing common problems.
The SDK does not support audio input in the current release, even though the underlying mtmd layer can handle audio chunks.
Sections
- Vision presets — built-in presets with their model sources, projector sources, and picker guidance.
- Attaching images —
MediaAttachment, supported formats (JPEG, PNG, BMP, GIF, WebP), the 50 MB limit, and how to pass images toSendMessageAsync. - Chat templates — the eight vision chat templates the SDK recognizes and how auto-selection works.
- Debugging vision — tagged logs,
parse_mm_logs.zsh, and common misalignments.
At a glance
Minimum vision flow:
using Aspose.LLM;
using Aspose.LLM.Abstractions.Parameters.Presets;
var license = new Aspose.LLM.License();
license.SetLicense("Aspose.LLM.lic");
var preset = new Qwen25VL3BPreset(); // or any vision preset
using var api = AsposeLLMApi.Create(preset);
byte[] imageBytes = File.ReadAllBytes("photo.jpg");
string reply = await api.SendMessageAsync(
"Describe this image in one short sentence.",
media: new[] { imageBytes });
Console.WriteLine(reply);
media is IEnumerable<byte[]> — you can pass one or several images per message.
What’s next
- Vision presets — pick the right built-in preset.
- Attaching images — formats and limits.
- Supported presets — quick catalog.