Long context tuning
Several built-in presets support very long contexts: Llama32Preset (131K), Oss20Preset (131K), DeepSeekCoder2Preset (163K), Qwen3VL2BPreset (262K), Ministral3VisionPreset (262K). Running them at full context takes specific tuning — flash attention, KV dtype, sometimes YaRN.
When to use this pattern
- Summarize or query long documents (books, code repositories, legal contracts).
- Multi-hour conversations that must not lose earlier turns.
- Search-augmented QA where the retrieved context is large.
- Vision + text with many image turns.
Prerequisites
- A preset with a long default context, or a custom preset with your target
ContextSize. - Enough memory. At 131K context, a 7B model’s KV cache alone is 8-16 GB depending on KV dtype. Budget accordingly.
Pick a long-context preset
| Preset | Default context | Notes |
|---|---|---|
Llama32Preset |
131 072 | 3B; good fit for long-document summarization. |
Oss20Preset |
131 072 | 20B; better reasoning at length. |
DeepSeekCoder2Preset |
163 840 | Code-focused. |
Qwen3VL2BPreset |
262 144 | Vision + long context. |
Ministral3VisionPreset |
262 144 | 8B vision + 262K. |
All of these already have sensible long-context configuration baked in. You mostly need to decide how much of the context to actually use and pay the memory cost.
Enable flash attention
Flash attention is an order-of-magnitude improvement for long-context memory and speed. Enable unconditionally when it is supported:
using Aspose.LLM.Abstractions.Models;
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;
Without flash attention, KV reads scale quadratically with context size — it is effectively unworkable past ~32K tokens.
Quantize the KV cache
F16 KV cache at 131K context is ~8 GB on a 7B model. Q8_0 halves that. Q4 halves again.
preset.ContextParameters.TypeK = GgmlType.F16; // keep K precise
preset.ContextParameters.TypeV = GgmlType.Q8_0; // quantize V aggressively
V is less sensitive than K — quantizing V to Q8_0 has minor quality impact; quantizing K to Q8 has a visible impact. Start with TypeV = Q8_0 and leave TypeK = F16 unless memory demands more.
Use the full context window
If you want to actually use 131K tokens, ensure ContextSize matches:
preset.ContextParameters.ContextSize = 131072;
The built-in preset already sets this; you only override when trimming for memory.
Memory planning
Approximate KV cache size for a model with L layers, H heads of dim D, context C, dtype T:
KV bytes ≈ 2 × L × H × D × C × T_bytes
Rough table for a 7B model (32 layers, 32 heads × 128 dim) at 131K context:
| KV dtype | Per-token bytes | 131K tokens | Total |
|---|---|---|---|
| F16 (2 bytes) | 32 × 32 × 128 × 2 × 2 = 524 288 | 131 072 | ~64 GB |
That number is misleading because llama.cpp implements grouped-query attention and compressed layouts; real-world KV cache for a 7B at 131K is more like 8-16 GB F16. Benchmark on your hardware rather than computing from first principles.
Full example — document summarization
using Aspose.LLM;
using Aspose.LLM.Abstractions.Models;
using Aspose.LLM.Abstractions.Parameters.Presets;
var license = new Aspose.LLM.License();
license.SetLicense("Aspose.LLM.lic");
var preset = new Llama32Preset(); // 3B, 131K context
preset.ChatParameters.SystemPrompt =
"You summarize long documents. Produce a concise 5-bullet summary of the key points.";
preset.ChatParameters.MaxTokens = 512;
// Enable flash attention — essential at long contexts.
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;
// Quantize V cache to save VRAM.
preset.ContextParameters.TypeV = GgmlType.Q8_0;
// Full GPU offload.
preset.BaseModelInferenceParameters.GpuLayers = 999;
using var api = AsposeLLMApi.Create(preset);
string longDocument = File.ReadAllText("contract.txt"); // up to ~80K tokens
string summary = await api.SendMessageAsync(
$"Summarize this document:\n\n{longDocument}");
Console.WriteLine(summary);
Multi-turn long-context
When a conversation runs for a long time, combine long context with a cache cleanup strategy that preserves the original question:
preset.ChatParameters.CacheCleanupStrategy =
CacheCleanupStrategy.KeepSystemPromptAndFirstUserMessage;
The model keeps the system prompt and the initial user turn regardless of how much middle history evicts.
YaRN scaling (for stretching beyond training context)
If your custom preset uses a model trained on fewer tokens than you want to run at, YaRN can interpolate positions. Use only when:
- Your model’s GGUF metadata specifies a smaller training context than you need.
- You are prepared for some quality degradation at very long reaches.
preset.ContextParameters.RopeScalingType = RopeScalingType.Yarn;
preset.ContextParameters.YarnOrigCtx = 32768; // the model's original training length
preset.ContextParameters.ContextSize = 131072; // where you want to go
Built-in presets with long contexts do not need this — they are trained on or calibrated for their declared ContextSize.
Performance at long contexts
Throughput drops as context grows. A typical pattern:
- At 4K tokens of prompt: 80-100 t/s on GPU.
- At 32K: 40-60 t/s.
- At 131K: 10-25 t/s.
Budget time for the first response — filling a 131K context takes tens of seconds to minutes even on fast GPUs. Subsequent turns in the same session reuse the KV cache and are much faster.
What’s next
- Context parameters — full list of context knobs.
- Low-memory tuning — when long context is not worth the memory cost.
- Cache management — keep sessions fresh over time.