PDF Chunking and Embedding

Overview

Aspose.PDF PDF Chunking and Embedding enables splitting PDF documents into semantic text chunks and generating vector embeddings for use in Retrieval-Augmented Generation (RAG) pipelines, semantic search, and other AI-driven scenarios.

The API provides two entry points as extension methods on Document:

Method	Description
`GetChunksAsync`	Converts a PDF to Markdown and splits it into `DocumentChunk` objects.
`IngestAsync`	Full pipeline: chunks the document, generates embeddings, and upserts records into a vector store.

Prerequisites

Aspose.PDF for .NET
Microsoft.Extensions.AI — for IEmbeddingGenerator<string, Embedding<float>>
Microsoft.Extensions.VectorData — for VectorStoreCollection<string, DocumentChunk>
An embedding provider (e.g. OpenAI, Azure OpenAI, or any IEmbeddingGenerator implementation)
A vector store (e.g. Azure AI Search, Qdrant, in-memory, or any VectorStoreCollection implementation)

Chunking Options

Configure chunking behavior via ChunkingOptions:

var options = new Aspose.Pdf.AI.ChunkingOptions();

var customOptions = new Aspose.Pdf.AI.ChunkingOptions
{
    MaxChunkSize = 512,
    OverlapSize  = 50
};

Property	Default	Range	Description
`MaxChunkSize`	`1000`	50–10 000	Maximum number of tokens per chunk.
`OverlapSize`	`100`	0–(MaxChunkSize−1)	Number of overlapping tokens between consecutive chunks, preserving context at boundaries.

Getting Chunks

Use GetChunksAsync when you only need the text chunks — for example, to inspect them, post-process them, or feed them into a custom embedding pipeline.

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task GetDocumentChunks()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    using (var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf"))
    {
        var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

        var chunks = await document.GetChunksAsync(options, sourceId: "my-document");

        foreach (var chunk in chunks)
        {
            Console.WriteLine(string.Format("Chunk {0}: {1}", chunk.Index, chunk.Id));
            Console.WriteLine(chunk.Content);
        }
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task GetDocumentChunks()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    using var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf");

    var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

    var chunks = await document.GetChunksAsync(options, sourceId: "my-document");

    foreach (var chunk in chunks)
    {
        Console.WriteLine(string.Format("Chunk {0}: {1}", chunk.Index, chunk.Id));
        Console.WriteLine(chunk.Content);
    }
}

Chunking, Embedding, and Vector Store Ingestion

Use IngestAsync to run the complete pipeline in a single call: the document is chunked, embeddings are generated for each chunk, and all records are upserted into the vector store.

Because the API is built on Microsoft.Extensions.AI abstractions, any embedding provider and any vector store can be used interchangeably.

OpenAI

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new OpenAIClient(ApiKey)
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var collection = new InMemoryVectorStore()
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using (var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf"))
    {
        var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

        await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new OpenAIClient(ApiKey)
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var collection = new InMemoryVectorStore()
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf");

    var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

    await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
}

Azure OpenAI and Azure AI Search

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new AzureOpenAIClient(new Uri(AzureOpenAIEndpoint), new AzureKeyCredential(AzureOpenAIKey))
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var searchClient = new SearchIndexClient(new Uri(AzureSearchEndpoint), new AzureKeyCredential(AzureSearchKey));
    var collection = new AzureAISearchVectorStore(searchClient)
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using (var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf"))
    {
        var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

        await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new AzureOpenAIClient(new Uri(AzureOpenAIEndpoint), new AzureKeyCredential(AzureOpenAIKey))
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var searchClient = new SearchIndexClient(new Uri(AzureSearchEndpoint), new AzureKeyCredential(AzureSearchKey));
    var collection = new AzureAISearchVectorStore(searchClient)
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf");

    var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

    await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
}

DocumentChunk Reference

Each DocumentChunk returned by GetChunksAsync or stored by IngestAsync has the following properties:

Property	Type	Description
`Id`	`string`	Unique identifier in the format `{sourceId}_chunk_{index}`.
`Index`	`int`	Zero-based position of this chunk in the document.
`Content`	`string`	The text content of the chunk. Stored as full-text-search data in the vector store.
`Context`	`string`	Structural context extracted from the document (e.g., heading path). Helps RAG pipelines understand where in the document the chunk came from.
`Embedding`	`ReadOnlyMemory<float>?`	Vector embedding, populated by `IngestAsync`. `null` when using `GetChunksAsync` alone.
`Metadata`	`IDictionary<string, string>`	Extensible key-value store for custom metadata.

Vector Store Schema

Use DocumentChunk.GetVectorDefinition(dimensions) to get the collection schema matching the embedding model’s output size:

var definition = Aspose.Pdf.AI.DocumentChunk.GetVectorDefinition(dimensions: 1536);

PDF AI Copilot Working with Annotations