PDF チャンキングと埋め込み

概要

Aspose.PDF PDF チャンキングと埋め込み は、PDF ドキュメントをセマンティックテキストチャンクに分割し、Retrieval-Augmented Generation (RAG) パイプライン、セマンティック検索、その他の AI 駆動シナリオで使用するためのベクター埋め込みを生成する機能を提供します。

API は Document の拡張メソッドとして 2 つのエントリーポイントを提供します。

メソッド	説明
`GetChunksAsync`	PDF を Markdown に変換し、`DocumentChunk` オブジェクトに分割します。
`IngestAsync`	完全なパイプライン: ドキュメントをチャンク化し、埋め込みを生成し、ベクターストアにレコードをアップサートします。

前提条件

Aspose.PDF for .NET
Microsoft.Extensions.AI — IEmbeddingGenerator<string, Embedding<float>> 用
Microsoft.Extensions.VectorData — VectorStoreCollection<string, DocumentChunk> 用
埋め込みプロバイダー（例: OpenAI、Azure OpenAI、または任意の IEmbeddingGenerator 実装）
ベクターストア（例: Azure AI Search、Qdrant、インメモリ、または任意の VectorStoreCollection 実装）

チャンキングオプション

ChunkingOptions でチャンキング動作を設定します。

var options = new Aspose.Pdf.AI.ChunkingOptions();

var customOptions = new Aspose.Pdf.AI.ChunkingOptions
{
    MaxChunkSize = 512,
    OverlapSize  = 50
};

プロパティ	デフォルト	範囲	説明
`MaxChunkSize`	`1000`	50–10 000	チャンクあたりの最大トークン数。
`OverlapSize`	`100`	0–(MaxChunkSize−1)	連続するチャンク間の重複トークン数。境界でのコンテキストを保持します。

チャンクの取得

テキストチャンクのみが必要な場合（例: 検査、後処理、またはカスタム埋め込みパイプラインへの入力）に GetChunksAsync を使用します。

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task GetDocumentChunks()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    using (var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf"))
    {
        var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

        var chunks = await document.GetChunksAsync(options, sourceId: "my-document");

        foreach (var chunk in chunks)
        {
            Console.WriteLine(string.Format("Chunk {0}: {1}", chunk.Index, chunk.Id));
            Console.WriteLine(chunk.Content);
        }
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task GetDocumentChunks()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    using var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf");

    var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

    var chunks = await document.GetChunksAsync(options, sourceId: "my-document");

    foreach (var chunk in chunks)
    {
        Console.WriteLine(string.Format("Chunk {0}: {1}", chunk.Index, chunk.Id));
        Console.WriteLine(chunk.Content);
    }
}

チャンキング、埋め込み、ベクターストアへの取り込み

IngestAsync を使用して、1 回の呼び出しで完全なパイプラインを実行します: ドキュメントがチャンク化され、各チャンクに対して埋め込みが生成され、すべてのレコードがベクターストアにアップサートされます。

API は Microsoft.Extensions.AI 抽象化に基づいて構築されているため、任意の埋め込みプロバイダーと任意のベクターストアを互換的に使用できます。

OpenAI

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new OpenAIClient(ApiKey)
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var collection = new InMemoryVectorStore()
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using (var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf"))
    {
        var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

        await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new OpenAIClient(ApiKey)
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var collection = new InMemoryVectorStore()
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf");

    var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

    await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
}

Azure OpenAI と Azure AI Search

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new AzureOpenAIClient(new Uri(AzureOpenAIEndpoint), new AzureKeyCredential(AzureOpenAIKey))
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var searchClient = new SearchIndexClient(new Uri(AzureSearchEndpoint), new AzureKeyCredential(AzureSearchKey));
    var collection = new AzureAISearchVectorStore(searchClient)
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using (var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf"))
    {
        var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

        await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new AzureOpenAIClient(new Uri(AzureOpenAIEndpoint), new AzureKeyCredential(AzureOpenAIKey))
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var searchClient = new SearchIndexClient(new Uri(AzureSearchEndpoint), new AzureKeyCredential(AzureSearchKey));
    var collection = new AzureAISearchVectorStore(searchClient)
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf");

    var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

    await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
}

DocumentChunk リファレンス

GetChunksAsync によって返される、または IngestAsync によって保存される各 DocumentChunk には、次のプロパティがあります。

プロパティ	タイプ	説明
`Id`	`string`	`{sourceId}_chunk_{index}` 形式の一意の識別子。
`Index`	`int`	ドキュメント内のこのチャンクのゼロベースの位置。
`Content`	`string`	チャンクのテキストコンテンツ。ベクターストアで全文検索データとして保存されます。
`Context`	`string`	ドキュメントから抽出された構造的コンテキスト（例: 見出しパス）。RAG パイプラインがドキュメントのどこからチャンクが取得されたかを理解するのに役立ちます。
`Embedding`	`ReadOnlyMemory<float>?`	`IngestAsync` によって設定されるベクター埋め込み。`GetChunksAsync` のみを使用する場合は `null`。
`Metadata`	`IDictionary<string, string>`	カスタムメタデータ用の拡張可能なキー値ストア。

ベクターストアスキーマ

DocumentChunk.GetVectorDefinition(dimensions) を使用して、埋め込みモデルの出力サイズに一致するコレクションスキーマを取得します。

var definition = Aspose.Pdf.AI.DocumentChunk.GetVectorDefinition(dimensions: 1536);

PDF AI コパイロット注釈の操作