Aspose.PDF PDF Chunking and Embedding enables splitting PDF documents into semantic text chunks and generating vector embeddings for use in Retrieval-Augmented Generation (RAG) pipelines, semantic search, and other AI-driven scenarios.
The API provides two entry points as extension methods on Document:
Method
Description
GetChunksAsync
Converts a PDF to Markdown and splits it into DocumentChunk objects.
IngestAsync
Full pipeline: chunks the document, generates embeddings, and upserts records into a vector store.
Prerequisites
Aspose.PDF for .NET
Microsoft.Extensions.AI — for IEmbeddingGenerator<string, Embedding<float>>
Microsoft.Extensions.VectorData — for VectorStoreCollection<string, DocumentChunk>
An embedding provider (e.g. OpenAI, Azure OpenAI, or any IEmbeddingGenerator implementation)
A vector store (e.g. Azure AI Search, Qdrant, in-memory, or any VectorStoreCollection implementation)
Number of overlapping tokens between consecutive chunks, preserving context at boundaries.
Getting Chunks
Use GetChunksAsync when you only need the text chunks — for example, to inspect them, post-process them, or feed them into a custom embedding pipeline.
Chunking, Embedding, and Vector Store Ingestion
Use IngestAsync to run the complete pipeline in a single call: the document is chunked, embeddings are generated for each chunk, and all records are upserted into the vector store.
Because the API is built on Microsoft.Extensions.AI abstractions, any embedding provider and any vector store can be used interchangeably.
OpenAI
Azure OpenAI and Azure AI Search
DocumentChunk Reference
Each DocumentChunk returned by GetChunksAsync or stored by IngestAsync has the following properties:
Property
Type
Description
Id
string
Unique identifier in the format {sourceId}_chunk_{index}.
Index
int
Zero-based position of this chunk in the document.
Content
string
The text content of the chunk. Stored as full-text-search data in the vector store.
Context
string
Structural context extracted from the document (e.g., heading path). Helps RAG pipelines understand where in the document the chunk came from.
Embedding
ReadOnlyMemory<float>?
Vector embedding, populated by IngestAsync. null when using GetChunksAsync alone.
Metadata
IDictionary<string, string>
Extensible key-value store for custom metadata.
Vector Store Schema
Use DocumentChunk.GetVectorDefinition(dimensions) to get the collection schema matching the embedding model’s output size: