Streaming-like responses
Aspose.LLM for .NET does not expose token-by-token streaming in the current release. SendMessageAsync and SendMessageToSessionAsync return the complete response as a single string after generation finishes. This page explains the limitation and the patterns that achieve a responsive user experience without streaming.
The limitation
Method signatures return Task<string> — a single result, not an async sequence:
public Task<string> SendMessageAsync(...);
public Task<string> SendMessageToSessionAsync(...);
Neither IAsyncEnumerable<string> nor a streaming callback is available. The native llama.cpp runtime can produce tokens one at a time, but the SDK’s public API wraps them into a full response before returning.
Practical impact
For typical chat UIs, the user sees the reply after the model finishes generating. On a modern GPU:
- Short replies (64 tokens) — ~0.5-1 second.
- Medium replies (256 tokens) — ~2-3 seconds.
- Long replies (1024 tokens) — ~8-15 seconds.
On CPU:
- Short replies — 3-10 seconds.
- Long replies — 1-3 minutes.
If your UX requires tokens to appear as they are generated (typical chat-window feel), the SDK in its current form does not provide that directly.
Workarounds
1. Cap MaxTokens for shorter replies
Reduce the worst case.
preset.ChatParameters.MaxTokens = 256;
A 256-token response completes in under a second on GPU, which is within the threshold where users typically do not need streaming feedback.
2. Break long tasks into smaller turns
Instead of one request that takes 30 seconds, design the conversation as multiple short turns. Each turn returns in 1-3 seconds; the user sees progress.
string outline = await api.SendMessageAsync("Outline a 500-word article on X.");
// Show outline to user
string section1 = await api.SendMessageAsync("Expand section 1 of that outline.");
// Show section 1
string section2 = await api.SendMessageAsync("Expand section 2.");
// Show section 2
3. Show “thinking” indicator
When the wait is unavoidable, communicate that work is in progress. A spinner, progress bar, or status text (“Analyzing your question…") is better than dead air.
For ASP.NET Core, return a 102 Processing response or emit server-sent events with status updates before the final reply.
4. Run ahead with speculative small model
Some applications run a small, fast model for an initial “draft” response while a larger model generates the final version. The small model’s output streams immediately; the larger response replaces it when ready. Implement at the application level:
// Fast preset for immediate feedback.
var fastPreset = new Phi4Preset();
// Slow, higher-quality preset for the real answer.
var slowPreset = new Qwen25Preset();
// ... two AsposeLLMApi instances via multi-process setup (one instance per process).
Only possible with two processes because of the single-instance constraint.
5. Queue + background job pattern for long tasks
For tasks that would take minutes (document summarization, batch analysis), respond immediately with a job ID and deliver the result via polling or webhook:
// POST /jobs — enqueue, return job ID.
// GET /jobs/{id} — poll status and result.
// POST /webhooks/llm — deliver when ready.
The user is not blocked on a single HTTP request. This is the standard long-running-job pattern for HTTP services.
What future releases may offer
Token streaming is a commonly requested feature. A future SDK version may expose:
IAsyncEnumerable<string>for streamed tokens.- Event-based callback during generation.
- Server-sent events integration for ASP.NET Core.
Until then, design your UX around the full-response API.
Comparison with alternatives
| Feature | Aspose.LLM for .NET | Hosted APIs (OpenAI, Anthropic) |
|---|---|---|
| Token streaming | Not exposed in current release | Yes |
| On-premise | Yes | No (hosted) |
| Model control | Full | Limited |
| Data privacy | Local | Sent to provider |
| Latency floor | Model-dependent | Network round-trip |
The main reason to use Aspose.LLM is on-premise execution with no data egress. If token streaming is a hard requirement and hosted APIs are acceptable, use a hosted provider. If on-premise is a hard requirement, plan the UX around the current full-response API.
What’s next
- Architecture — what happens during generation.
- Features — full capability list and limitations.
- Integration with ASP.NET Core — queue / job patterns for responsive HTTP.