Multimodal

Aspose.LLM for .NET supports image input alongside text through mtmd — the llama.cpp multimodal layer — and a small set of built-in vision presets. This section covers everything you need to work with images: picking a preset, attaching images to messages, understanding chat templates, and diagnosing common problems.

The SDK does not support audio input in the current release, even though the underlying mtmd layer can handle audio chunks.

Sections

Vision presets — built-in presets with their model sources, projector sources, and picker guidance.
Attaching images — MediaAttachment, supported formats (JPEG, PNG, BMP, GIF, WebP), the 50 MB limit, and how to pass images to SendMessageAsync.
Chat templates — the eight vision chat templates the SDK recognizes and how auto-selection works.
Debugging vision — tagged logs, parse_mm_logs.zsh, and common misalignments.

At a glance

Minimum vision flow:

using Aspose.LLM;
using Aspose.LLM.Abstractions.Parameters.Presets;

var license = new Aspose.LLM.License();
license.SetLicense("Aspose.LLM.lic");

var preset = new Qwen25VL3BPreset();   // or any vision preset
using var api = AsposeLLMApi.Create(preset);

byte[] imageBytes = File.ReadAllBytes("photo.jpg");

string reply = await api.SendMessageAsync(
    "Describe this image in one short sentence.",
    media: new[] { imageBytes });

Console.WriteLine(reply);

media is IEnumerable<byte[]> — you can pass one or several images per message.

What’s next

Vision presets — pick the right built-in preset.
Attaching images — formats and limits.
Supported presets — quick catalog.

License class API reference