PDF 分块和嵌入

概述

Aspose.PDF PDF 分块和嵌入 支持将 PDF 文档拆分为语义文本块，并生成用于检索增强生成 (RAG) 管道、语义搜索和其他 AI 驱动场景的向量嵌入。

该 API 以 Document 的扩展方法形式提供两个入口点：

方法	描述
`GetChunksAsync`	将 PDF 转换为 Markdown 并将其拆分为 `DocumentChunk` 对象。
`IngestAsync`	完整管道：对文档进行分块、生成嵌入，并将记录 upsert 到向量存储中。

先决条件

Aspose.PDF for .NET
Microsoft.Extensions.AI — 用于 IEmbeddingGenerator<string, Embedding<float>>
Microsoft.Extensions.VectorData — 用于 VectorStoreCollection<string, DocumentChunk>
嵌入提供程序（例如 OpenAI、Azure OpenAI 或任何 IEmbeddingGenerator 实现）
向量存储（例如 Azure AI Search、Qdrant、内存存储或任何 VectorStoreCollection 实现）

分块选项

通过 ChunkingOptions 配置分块行为：

var options = new Aspose.Pdf.AI.ChunkingOptions();

var customOptions = new Aspose.Pdf.AI.ChunkingOptions
{
    MaxChunkSize = 512,
    OverlapSize  = 50
};

属性	默认值	范围	描述
`MaxChunkSize`	`1000`	50–10 000	每个块的最大令牌数。
`OverlapSize`	`100`	0–(MaxChunkSize−1)	相邻块之间的重叠令牌数，用于保留边界处的上下文。

获取块

当您只需要文本块时，使用 GetChunksAsync——例如，检查块、进行后处理或将其传入自定义嵌入管道。

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task GetDocumentChunks()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    using (var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf"))
    {
        var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

        var chunks = await document.GetChunksAsync(options, sourceId: "my-document");

        foreach (var chunk in chunks)
        {
            Console.WriteLine(string.Format("Chunk {0}: {1}", chunk.Index, chunk.Id));
            Console.WriteLine(chunk.Content);
        }
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task GetDocumentChunks()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    using var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf");

    var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

    var chunks = await document.GetChunksAsync(options, sourceId: "my-document");

    foreach (var chunk in chunks)
    {
        Console.WriteLine(string.Format("Chunk {0}: {1}", chunk.Index, chunk.Id));
        Console.WriteLine(chunk.Content);
    }
}

分块、嵌入和向量存储摄入

使用 IngestAsync 在单次调用中运行完整管道：文档被分块，为每个块生成嵌入，所有记录 upsert 到向量存储中。

由于该 API 基于 Microsoft.Extensions.AI 抽象构建，任何嵌入提供程序和任何向量存储都可以互换使用。

OpenAI

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new OpenAIClient(ApiKey)
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var collection = new InMemoryVectorStore()
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using (var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf"))
    {
        var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

        await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new OpenAIClient(ApiKey)
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var collection = new InMemoryVectorStore()
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf");

    var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

    await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
}

Azure OpenAI 和 Azure AI Search

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new AzureOpenAIClient(new Uri(AzureOpenAIEndpoint), new AzureKeyCredential(AzureOpenAIKey))
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var searchClient = new SearchIndexClient(new Uri(AzureSearchEndpoint), new AzureKeyCredential(AzureSearchKey));
    var collection = new AzureAISearchVectorStore(searchClient)
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using (var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf"))
    {
        var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

        await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new AzureOpenAIClient(new Uri(AzureOpenAIEndpoint), new AzureKeyCredential(AzureOpenAIKey))
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var searchClient = new SearchIndexClient(new Uri(AzureSearchEndpoint), new AzureKeyCredential(AzureSearchKey));
    var collection = new AzureAISearchVectorStore(searchClient)
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf");

    var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

    await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
}

DocumentChunk 参考

GetChunksAsync 返回或 IngestAsync 存储的每个 DocumentChunk 具有以下属性：

属性	类型	描述
`Id`	`string`	格式为 `{sourceId}_chunk_{index}` 的唯一标识符。
`Index`	`int`	该块在文档中从零开始的位置。
`Content`	`string`	块的文本内容。在向量存储中作为全文搜索数据存储。
`Context`	`string`	从文档中提取的结构上下文（例如标题路径）。帮助 RAG 管道了解块在文档中的来源位置。
`Embedding`	`ReadOnlyMemory<float>?`	向量嵌入，由 `IngestAsync` 填充。单独使用 `GetChunksAsync` 时为 `null`。
`Metadata`	`IDictionary<string, string>`	用于自定义元数据的可扩展键值存储。

向量存储架构

使用 DocumentChunk.GetVectorDefinition(dimensions) 获取与嵌入模型输出大小匹配的集合架构：

var definition = Aspose.Pdf.AI.DocumentChunk.GetVectorDefinition(dimensions: 1536);

PDF AI 副驾驶使用注释