PDF Chunking and Embedding

Overview

Aspose.PDF PDF Chunking and Embedding enables splitting PDF documents into semantic text chunks and generating vector embeddings for use in Retrieval-Augmented Generation (RAG) pipelines, semantic search, and other AI-driven scenarios.

The API provides two entry points as extension methods on Document:

Method Description
GetChunksAsync Converts a PDF to Markdown and splits it into DocumentChunk objects.
IngestAsync Full pipeline: chunks the document, generates embeddings, and upserts records into a vector store.

Prerequisites

  • Aspose.PDF for .NET
  • Microsoft.Extensions.AI — for IEmbeddingGenerator<string, Embedding<float>>
  • Microsoft.Extensions.VectorData — for VectorStoreCollection<string, DocumentChunk>
  • An embedding provider (e.g. OpenAI, Azure OpenAI, or any IEmbeddingGenerator implementation)
  • A vector store (e.g. Azure AI Search, Qdrant, in-memory, or any VectorStoreCollection implementation)

Chunking Options

Configure chunking behavior via ChunkingOptions:

var options = new Aspose.Pdf.AI.ChunkingOptions();

var customOptions = new Aspose.Pdf.AI.ChunkingOptions
{
    MaxChunkSize = 512,
    OverlapSize  = 50
};
Property Default Range Description
MaxChunkSize 1000 50–10 000 Maximum number of tokens per chunk.
OverlapSize 100 0–(MaxChunkSize−1) Number of overlapping tokens between consecutive chunks, preserving context at boundaries.

Getting Chunks

Use GetChunksAsync when you only need the text chunks — for example, to inspect them, post-process them, or feed them into a custom embedding pipeline.

Chunking, Embedding, and Vector Store Ingestion

Use IngestAsync to run the complete pipeline in a single call: the document is chunked, embeddings are generated for each chunk, and all records are upserted into the vector store.

Because the API is built on Microsoft.Extensions.AI abstractions, any embedding provider and any vector store can be used interchangeably.

OpenAI

DocumentChunk Reference

Each DocumentChunk returned by GetChunksAsync or stored by IngestAsync has the following properties:

Property Type Description
Id string Unique identifier in the format {sourceId}_chunk_{index}.
Index int Zero-based position of this chunk in the document.
Content string The text content of the chunk. Stored as full-text-search data in the vector store.
Context string Structural context extracted from the document (e.g., heading path). Helps RAG pipelines understand where in the document the chunk came from.
Embedding ReadOnlyMemory<float>? Vector embedding, populated by IngestAsync. null when using GetChunksAsync alone.
Metadata IDictionary<string, string> Extensible key-value store for custom metadata.

Vector Store Schema

Use DocumentChunk.GetVectorDefinition(dimensions) to get the collection schema matching the embedding model’s output size:

var definition = Aspose.Pdf.AI.DocumentChunk.GetVectorDefinition(dimensions: 1536);