FlashAttentionMode
Contents
[
Hide
]
FlashAttentionMode controls flash attention — a fused-kernel optimization that reduces memory usage and speeds up attention, especially at long contexts. Prefer FlashAttentionMode over the legacy FlashAttention boolean.
Quick reference
| Type | FlashAttentionType? enum |
| Default | null (use model / runtime default) |
| Values | Auto (-1), Disabled (0), Enabled (1) |
| Category | Attention |
| Field on | ContextParameters.FlashAttentionMode |
What it does
Flash attention implements attention in a single fused kernel that tiles the computation. This avoids materializing the full N × N attention matrix in memory, which is a big win at long contexts.
| Value | Behavior |
|---|---|
Auto (-1) |
Runtime picks based on backend support. Recommended. |
Disabled (0) |
Never use flash attention. Slower and more memory at long contexts. |
Enabled (1) |
Force flash attention. Fails on backends that do not support it. |
On long contexts (> 8K tokens), flash attention is typically 20-40 % faster and meaningfully reduces peak memory. At short contexts the benefit is small.
When to change it
| Scenario | Value |
|---|---|
| Default (recommended) | Auto |
| Explicit opt-in for benchmarking | Enabled |
| Debugging / disabling suspected flash-attention bug | Disabled |
| Backend without flash attention support | Runtime picks Disabled automatically when Auto |
Example
using Aspose.LLM.Abstractions.Models;
var preset = new Qwen25Preset();
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;
preset.ContextParameters.ContextSize = 32768; // long context benefits most from FA
using var api = AsposeLLMApi.Create(preset);
Interactions
ContextSize— larger contexts benefit more from flash attention.TypeK,TypeV— flash attention works with quantized KV cache.FlashAttention— legacy boolean; preferFlashAttentionMode.- Acceleration backends — CUDA, Metal, HIP, Vulkan all support flash attention on recent drivers.
What’s next
- FlashAttention — legacy field; documented for completeness.
- ContextSize — the axis where FA matters most.
- Long context tuning — practical recipe.