Fragmentação e Incorporação de PDF

Visão geral

Aspose.PDF PDF Chunking and Embedding permite dividir documentos PDF em fragmentos de texto semânticos e gerar embeddings vetoriais para uso em pipelines de Geração Aumentada por Recuperação (RAG), pesquisa semântica e outros cenários impulsionados por IA.

A API fornece dois pontos de entrada como métodos de extensão em Document:

Método	Descrição
`GetChunksAsync`	Converte um PDF para Markdown e o divide em objetos `DocumentChunk`.
`IngestAsync`	Pipeline completo: fragmenta o documento, gera embeddings e realiza upsert de registros em um armazenamento de vetores.

Pré-requisitos

Aspose.PDF for .NET
Microsoft.Extensions.AI — para IEmbeddingGenerator<string, Embedding<float>>
Microsoft.Extensions.VectorData — para VectorStoreCollection<string, DocumentChunk>
Um provedor de embeddings (por exemplo, OpenAI, Azure OpenAI ou qualquer implementação IEmbeddingGenerator)
Um armazenamento de vetores (por exemplo, Azure AI Search, Qdrant, in-memory ou qualquer implementação VectorStoreCollection)

Opções de Fragmentação

Configure o comportamento de fragmentação via ChunkingOptions:

var options = new Aspose.Pdf.AI.ChunkingOptions();

var customOptions = new Aspose.Pdf.AI.ChunkingOptions
{
    MaxChunkSize = 512,
    OverlapSize  = 50
};

Propriedade	Padrão	Intervalo	Descrição
`MaxChunkSize`	`1000`	50–10 000	Número máximo de tokens por fragmento.
`OverlapSize`	`100`	0–(MaxChunkSize−1)	Número de tokens sobrepostos entre fragmentos consecutivos, preservando o contexto nas fronteiras.

Obtendo Fragmentos

Use GetChunksAsync quando precisar apenas dos fragmentos de texto — por exemplo, para inspecioná-los, pós-processá-los ou alimentá-los em um pipeline de embedding personalizado.

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task GetDocumentChunks()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    using (var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf"))
    {
        var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

        var chunks = await document.GetChunksAsync(options, sourceId: "my-document");

        foreach (var chunk in chunks)
        {
            Console.WriteLine(string.Format("Chunk {0}: {1}", chunk.Index, chunk.Id));
            Console.WriteLine(chunk.Content);
        }
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task GetDocumentChunks()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    using var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf");

    var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

    var chunks = await document.GetChunksAsync(options, sourceId: "my-document");

    foreach (var chunk in chunks)
    {
        Console.WriteLine(string.Format("Chunk {0}: {1}", chunk.Index, chunk.Id));
        Console.WriteLine(chunk.Content);
    }
}

Fragmentação, Embedding e Ingestão no Armazenamento de Vetores

Use IngestAsync para executar o pipeline completo em uma única chamada: o documento é fragmentado, embeddings são gerados para cada fragmento e todos os registros são inseridos via upsert no armazenamento de vetores.

Como a API é construída sobre as abstrações Microsoft.Extensions.AI, qualquer provedor de embeddings e qualquer armazenamento de vetores podem ser usados de forma intercambiável.

OpenAI

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new OpenAIClient(ApiKey)
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var collection = new InMemoryVectorStore()
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using (var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf"))
    {
        var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

        await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new OpenAIClient(ApiKey)
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var collection = new InMemoryVectorStore()
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf");

    var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

    await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
}

Azure OpenAI e Azure AI Search

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new AzureOpenAIClient(new Uri(AzureOpenAIEndpoint), new AzureKeyCredential(AzureOpenAIKey))
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var searchClient = new SearchIndexClient(new Uri(AzureSearchEndpoint), new AzureKeyCredential(AzureSearchKey));
    var collection = new AzureAISearchVectorStore(searchClient)
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using (var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf"))
    {
        var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

        await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new AzureOpenAIClient(new Uri(AzureOpenAIEndpoint), new AzureKeyCredential(AzureOpenAIKey))
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var searchClient = new SearchIndexClient(new Uri(AzureSearchEndpoint), new AzureKeyCredential(AzureSearchKey));
    var collection = new AzureAISearchVectorStore(searchClient)
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf");

    var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

    await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
}

Referência de DocumentChunk

Cada DocumentChunk retornado por GetChunksAsync ou armazenado por IngestAsync possui as seguintes propriedades:

Propriedade	Tipo	Descrição
`Id`	`string`	Identificador único no formato `{sourceId}_chunk_{index}`.
`Index`	`int`	Posição baseada em zero deste fragmento no documento.
`Content`	`string`	O conteúdo de texto do fragmento. Armazenado como dados de pesquisa de texto completo no armazenamento de vetores.
`Context`	`string`	Contexto estrutural extraído do documento (por exemplo, caminho de cabeçalho). Ajuda os pipelines RAG a entender de onde no documento o fragmento veio.
`Embedding`	`ReadOnlyMemory<float>?`	Embedding vetorial, preenchido por `IngestAsync`. `null` ao usar apenas `GetChunksAsync`.
`Metadata`	`IDictionary<string, string>`	Armazenamento extensível de chave-valor para metadados personalizados.

Esquema do Armazenamento de Vetores

Use DocumentChunk.GetVectorDefinition(dimensions) para obter o esquema de coleção correspondente ao tamanho de saída do modelo de embedding:

var definition = Aspose.Pdf.AI.DocumentChunk.GetVectorDefinition(dimensions: 1536);

Copiloto de IA PDF Trabalhando com Anotações