Garbled output
The model loaded successfully, but its replies are nonsensical, repetitive, or broken in a way that points to misconfiguration rather than a model problem.
Symptom
- Replies contain literal marker tokens like
<|im_start|>,<image>,<|endoftext|>. - The model produces the same phrase or token repeatedly.
- Output is truncated mid-sentence after exactly N tokens.
- Output is coherent at first, then devolves into nonsense after a few hundred tokens.
- Vision replies describe something different from the actual image.
Cause
- Chat template mismatch — the engine picked the wrong template for the model.
- Repetition penalty too low (or zero) — the model loops.
MaxTokenstoo low for a reasoning model — truncation mid-reasoning.- KV cache cleanup dropped important context — middle of a long session.
- Wrong preset for the model — a custom GGUF paired with a preset that does not match its architecture.
- Aggressive KV quantization — on long contexts, Q4 K/V can degrade quality.
Resolution
1. Literal marker tokens in the output
Diagnosis: template mismatch. The engine fell back to a generic template; the model’s actual markers appear as text.
Enable debug logging and look for [MM] selected template: ...:
preset.EngineParameters.EnableDebugLogging = true;
If the template is fallback or does not match your model family, the model needs a supported template. Options:
- Verify you are using the correct built-in preset (see Supported presets).
- For a custom GGUF, try a different export from the same model with richer metadata.
- For vision: see Chat templates.
2. Repetition loops
Diagnosis: the model generates the same phrase in a loop.
Raise repetition penalty:
preset.SamplerParameters.RepetitionPenalty = 1.15f; // default 1.1
If that still loops, try DRY (Don’t Repeat Yourself):
preset.SamplerParameters.DryMultiplier = 0.8f;
preset.SamplerParameters.DryBase = 1.75f;
preset.SamplerParameters.DryAllowedLength = 3;
See Sampler parameters for the full repetition knob set.
3. Truncated output (cuts off mid-sentence)
Diagnosis: MaxTokens limit hit before the model finished.
Raise the budget:
preset.ChatParameters.MaxTokens = 2048; // from default 2048; raise further for long answers
For reasoning models (Qwen3, DeepSeek-R1) that emit <think> blocks, budget 1024-2048 for the thinking plus the answer.
4. Coherent start, garbled after a while
Diagnosis: KV cache cleanup evicted critical context mid-session, or KV quantization is too aggressive at long contexts.
Change the cleanup policy:
preset.ChatParameters.CacheCleanupStrategy =
CacheCleanupStrategy.KeepSystemPromptAndFirstUserMessage;
Or upgrade KV precision:
preset.ContextParameters.TypeK = GgmlType.F16;
preset.ContextParameters.TypeV = GgmlType.F16;
F16 is the safe default; drop to Q8_0 only with memory pressure.
5. Model produces the wrong answer for clear questions
Diagnosis: custom GGUF paired with a wrong preset, or the preset’s sampler settings do not match the model’s training distribution.
- Use a built-in preset for the model family, or set up a custom preset from scratch rather than reusing an unrelated built-in.
- Start with conservative sampler settings:
Temperature = 0.3,TopP = 0.9. - Test with the reference prompt from the model’s Hugging Face page to rule out sampling issues.
6. Vision: reply describes the wrong thing
Diagnosis: the image was not delivered correctly to the model.
- Enable
MtmdContextParameters.PrintTimings = trueto verify the image was processed. - Enable debug logging and look for
[MM]lines — confirm image chunks are tokenized. - See Debugging vision.
Prevention
- Stick to built-in presets when possible.
- Test new models with simple prompts first; check the output matches the model’s reference outputs.
- Run with debug logging in staging to catch template fallbacks before production.
What’s next
- Sampler parameters — repetition penalties, DRY.
- Chat parameters —
MaxTokens,CacheCleanupStrategy. - Chat templates — vision template selection.
- Debugging vision — multimodal-specific diagnosis.