FlashAttentionMode

FlashAttentionMode controls flash attention — a fused-kernel optimization that reduces memory usage and speeds up attention, especially at long contexts. Prefer FlashAttentionMode over the legacy FlashAttention boolean.

Quick reference

Type FlashAttentionType? enum
Default null (use model / runtime default)
Values Auto (-1), Disabled (0), Enabled (1)
Category Attention
Field on ContextParameters.FlashAttentionMode

What it does

Flash attention implements attention in a single fused kernel that tiles the computation. This avoids materializing the full N × N attention matrix in memory, which is a big win at long contexts.

Value Behavior
Auto (-1) Runtime picks based on backend support. Recommended.
Disabled (0) Never use flash attention. Slower and more memory at long contexts.
Enabled (1) Force flash attention. Fails on backends that do not support it.

On long contexts (> 8K tokens), flash attention is typically 20-40 % faster and meaningfully reduces peak memory. At short contexts the benefit is small.

When to change it

Scenario Value
Default (recommended) Auto
Explicit opt-in for benchmarking Enabled
Debugging / disabling suspected flash-attention bug Disabled
Backend without flash attention support Runtime picks Disabled automatically when Auto

Example

using Aspose.LLM.Abstractions.Models;

var preset = new Qwen25Preset();
preset.ContextParameters.FlashAttentionMode = FlashAttentionType.Enabled;
preset.ContextParameters.ContextSize = 32768;  // long context benefits most from FA

using var api = AsposeLLMApi.Create(preset);

Interactions

  • ContextSize — larger contexts benefit more from flash attention.
  • TypeK, TypeV — flash attention works with quantized KV cache.
  • FlashAttention — legacy boolean; prefer FlashAttentionMode.
  • Acceleration backends — CUDA, Metal, HIP, Vulkan all support flash attention on recent drivers.

What’s next