Integration with ASP.NET Core
ASP.NET Core is the idiomatic way to put Aspose.LLM for .NET behind HTTP. The SDK registers as a DI singleton; requests flow through a handler that resolves the shared Engine and routes to per-user sessions.
When to use this pattern
- Chat backend for a web or mobile application.
- Internal microservice that serves LLM inference to other services.
- API gateway that aggregates LLM and other services.
- Long-running service that benefits from a single model load across many requests.
Prerequisites
- Install the NuGet package.
- Apply a license.
- Familiarity with Dependency injection.
Minimal API — single-user demo
using Aspose.LLM;
using Aspose.LLM.Abstractions.Parameters.Presets;
using Aspose.LLM.Core.DependencyInjection;
using Aspose.LLM.Core.Services;
var builder = WebApplication.CreateBuilder(args);
// Apply license at startup.
var license = new Aspose.LLM.License();
license.SetLicense("Aspose.LLM.lic");
// Register LLM services.
builder.Services.AddLlamaServices(new Qwen25Preset());
var app = builder.Build();
// Warm up the model on the first request (eager alternative: resolve Engine at startup).
app.MapPost("/chat", async (ChatRequest req, Engine engine, CancellationToken ct) =>
{
// Start a session or reuse an existing one.
string sessionId = req.SessionId ?? await engine.InitiateNewSession();
string reply = await engine.GetChatSessionResponse(sessionId, req.Message, null, ct);
return Results.Ok(new ChatResponse(sessionId, reply));
});
app.Run();
public record ChatRequest(string Message, string? SessionId = null);
public record ChatResponse(string SessionId, string Reply);
Start the host:
dotnet run
Send a request:
curl -X POST http://localhost:5000/chat \
-H "Content-Type: application/json" \
-d '{"message": "Hello!"}'
The response includes a session ID; pass it back on subsequent calls to continue the conversation.
Serialize inference behind a channel
Concurrent HTTP requests can arrive at the endpoint; native inference is serialized. Use a channel worker so HTTP threads do not block native code:
using System.Threading.Channels;
public record InferenceJob(
string SessionId,
string Message,
TaskCompletionSource<string> Result,
CancellationToken Token);
public class InferenceQueue
{
private readonly Channel<InferenceJob> _channel = Channel.CreateUnbounded<InferenceJob>();
public ChannelWriter<InferenceJob> Writer => _channel.Writer;
public ChannelReader<InferenceJob> Reader => _channel.Reader;
}
public class InferenceWorker : BackgroundService
{
private readonly InferenceQueue _queue;
private readonly Engine _engine;
private readonly ILogger<InferenceWorker> _logger;
public InferenceWorker(InferenceQueue queue, Engine engine, ILogger<InferenceWorker> logger)
{
_queue = queue;
_engine = engine;
_logger = logger;
}
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
await foreach (var job in _queue.Reader.ReadAllAsync(stoppingToken))
{
try
{
string reply = await _engine.GetChatSessionResponse(
job.SessionId, job.Message, null, job.Token);
job.Result.SetResult(reply);
}
catch (OperationCanceledException oce)
{
job.Result.SetCanceled(oce.CancellationToken);
}
catch (Exception ex)
{
_logger.LogError(ex, "Inference failed for session {SessionId}", job.SessionId);
job.Result.SetException(ex);
}
}
}
}
Register and use:
builder.Services.AddSingleton<InferenceQueue>();
builder.Services.AddHostedService<InferenceWorker>();
app.MapPost("/chat", async (
ChatRequest req,
Engine engine,
InferenceQueue queue,
CancellationToken ct) =>
{
string sessionId = req.SessionId ?? await engine.InitiateNewSession();
var tcs = new TaskCompletionSource<string>(TaskCreationOptions.RunContinuationsAsynchronously);
await queue.Writer.WriteAsync(new InferenceJob(sessionId, req.Message, tcs, ct), ct);
string reply = await tcs.Task;
return Results.Ok(new ChatResponse(sessionId, reply));
});
HTTP threads enqueue and await; the worker processes jobs sequentially.
Per-user sessions
Map external user identifiers to internal session IDs. A simple in-memory store:
public class UserSessionMap
{
private readonly Dictionary<string, string> _map = new();
public string? Get(string userId) => _map.TryGetValue(userId, out var s) ? s : null;
public void Set(string userId, string sessionId) => _map[userId] = sessionId;
}
builder.Services.AddSingleton<UserSessionMap>();
app.MapPost("/chat", async (
ChatRequest req,
Engine engine,
UserSessionMap sessions,
InferenceQueue queue,
HttpContext ctx,
CancellationToken ct) =>
{
// Pull user ID from auth or header.
string userId = ctx.Request.Headers["X-User-Id"].ToString();
if (string.IsNullOrEmpty(userId)) return Results.BadRequest("X-User-Id required.");
string sessionId = sessions.Get(userId)
?? await engine.InitiateNewSession(sessionId: $"user:{userId}");
sessions.Set(userId, sessionId);
var tcs = new TaskCompletionSource<string>(TaskCreationOptions.RunContinuationsAsynchronously);
await queue.Writer.WriteAsync(new InferenceJob(sessionId, req.Message, tcs, ct), ct);
string reply = await tcs.Task;
return Results.Ok(new ChatResponse(sessionId, reply));
});
For production, persist the session map in Redis or a database.
Warm-up at startup
The first request triggers model load (several seconds to minutes on a cold machine). Warm up during host startup so the first user does not wait:
app.Lifetime.ApplicationStarted.Register(async () =>
{
var engine = app.Services.GetRequiredService<Engine>(); // forces construction
_ = await engine.InitiateNewSession(); // optional: warm session creation
Console.WriteLine("LLM engine warmed up.");
});
Graceful shutdown
The hosted service handles StopAsync. To drain the queue before shutdown:
public class InferenceWorker : BackgroundService
{
// ... existing fields
public override async Task StopAsync(CancellationToken cancellationToken)
{
// Stop accepting new jobs.
_queue.Writer.Complete();
// Drain is handled by ExecuteAsync exiting when the channel is empty.
await base.StopAsync(cancellationToken);
}
}
Constraints
- Single
AsposeLLMApi/Engineper process. Stick to the DI path; do not constructAsposeLLMApimanually on the side. - Model on disk, not RAM. Model files live in the cache; first load takes time. Budget cold-start seconds.
- No streaming.
GetChatSessionResponsereturns the full text. For user-perceived responsiveness, return HTTP 200 after the full generation — typical 1-5 seconds. - Serialize inference. Do not hit
Engine.GetChatSessionResponseconcurrently; the channel pattern handles this.
Full project structure
MyLlmService/
├── Program.cs # Minimal API + DI registration
├── Endpoints/
│ └── ChatEndpoint.cs # /chat handler
├── Services/
│ ├── InferenceQueue.cs # Channel-backed queue
│ ├── InferenceWorker.cs # Hosted service draining the queue
│ └── UserSessionMap.cs # User → session state
├── Contracts/
│ └── ChatRequest.cs # DTOs
├── appsettings.json
└── MyLlmService.csproj
What’s next
- Dependency injection — full DI reference.
- Multiple concurrent sessions — the session-routing pattern.
- Cache management — keep session memory under control in long-running hosts.