PDF 청킹 및 임베딩

개요

Aspose.PDF PDF 청킹 및 임베딩 기능을 사용하면 PDF 문서를 의미론적 텍스트 청크로 분할하고 Retrieval-Augmented Generation (RAG) 파이프라인, 의미론적 검색 및 기타 AI 기반 시나리오에서 사용하기 위한 벡터 임베딩을 생성할 수 있습니다.

API는 Document의 확장 메서드로 두 개의 진입점을 제공합니다.

메서드	설명
`GetChunksAsync`	PDF를 Markdown으로 변환하고 `DocumentChunk` 객체로 분할합니다.
`IngestAsync`	전체 파이프라인: 문서를 청크로 나누고 임베딩을 생성하며 벡터 저장소에 레코드를 업서트합니다.

사전 요구 사항

Aspose.PDF for .NET
Microsoft.Extensions.AI — IEmbeddingGenerator<string, Embedding<float>> 용
Microsoft.Extensions.VectorData — VectorStoreCollection<string, DocumentChunk> 용
임베딩 공급자(예: OpenAI, Azure OpenAI 또는 모든 IEmbeddingGenerator 구현)
벡터 저장소(예: Azure AI Search, Qdrant, 인메모리 또는 모든 VectorStoreCollection 구현)

청킹 옵션

ChunkingOptions를 통해 청킹 동작을 구성합니다.

var options = new Aspose.Pdf.AI.ChunkingOptions();

var customOptions = new Aspose.Pdf.AI.ChunkingOptions
{
    MaxChunkSize = 512,
    OverlapSize  = 50
};

속성	기본값	범위	설명
`MaxChunkSize`	`1000`	50–10 000	청크당 최대 토큰 수.
`OverlapSize`	`100`	0–(MaxChunkSize−1)	연속 청크 간 겹치는 토큰 수로, 경계에서 컨텍스트를 보존합니다.

청크 가져오기

텍스트 청크만 필요한 경우(예: 검사, 후처리 또는 사용자 정의 임베딩 파이프라인에 입력) GetChunksAsync를 사용합니다.

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task GetDocumentChunks()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    using (var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf"))
    {
        var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

        var chunks = await document.GetChunksAsync(options, sourceId: "my-document");

        foreach (var chunk in chunks)
        {
            Console.WriteLine(string.Format("Chunk {0}: {1}", chunk.Index, chunk.Id));
            Console.WriteLine(chunk.Content);
        }
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task GetDocumentChunks()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    using var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf");

    var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

    var chunks = await document.GetChunksAsync(options, sourceId: "my-document");

    foreach (var chunk in chunks)
    {
        Console.WriteLine(string.Format("Chunk {0}: {1}", chunk.Index, chunk.Id));
        Console.WriteLine(chunk.Content);
    }
}

청킹, 임베딩 및 벡터 저장소 수집

IngestAsync를 사용하여 단일 호출로 전체 파이프라인을 실행합니다: 문서가 청크로 나뉘고 각 청크에 대한 임베딩이 생성되며 모든 레코드가 벡터 저장소에 업서트됩니다.

API가 Microsoft.Extensions.AI 추상화를 기반으로 구축되었으므로 임의의 임베딩 공급자와 임의의 벡터 저장소를 상호 교환적으로 사용할 수 있습니다.

OpenAI

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new OpenAIClient(ApiKey)
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var collection = new InMemoryVectorStore()
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using (var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf"))
    {
        var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

        await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new OpenAIClient(ApiKey)
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var collection = new InMemoryVectorStore()
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf");

    var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

    await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
}

Azure OpenAI 및 Azure AI Search

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new AzureOpenAIClient(new Uri(AzureOpenAIEndpoint), new AzureKeyCredential(AzureOpenAIKey))
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var searchClient = new SearchIndexClient(new Uri(AzureSearchEndpoint), new AzureKeyCredential(AzureSearchKey));
    var collection = new AzureAISearchVectorStore(searchClient)
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using (var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf"))
    {
        var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

        await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static async Task IngestDocument()
{
    var dataDir = RunExamples.GetDataDir_AsposePdf_AI();

    var embeddingGenerator = new AzureOpenAIClient(new Uri(AzureOpenAIEndpoint), new AzureKeyCredential(AzureOpenAIKey))
        .AsEmbeddingGenerator("text-embedding-ada-002");

    var searchClient = new SearchIndexClient(new Uri(AzureSearchEndpoint), new AzureKeyCredential(AzureSearchKey));
    var collection = new AzureAISearchVectorStore(searchClient)
        .GetCollection<string, Aspose.Pdf.AI.DocumentChunk>("pdf-chunks");

    await collection.EnsureCollectionExistsAsync();

    using var document = new Aspose.Pdf.Document(dataDir + "RagGuide.pdf");

    var options = new Aspose.Pdf.AI.ChunkingOptions { MaxChunkSize = 512, OverlapSize = 50 };

    await document.IngestAsync(options, "my-document", embeddingGenerator, collection);
}

DocumentChunk 참조

GetChunksAsync에서 반환되거나 IngestAsync에서 저장된 각 DocumentChunk에는 다음 속성이 있습니다.

속성	유형	설명
`Id`	`string`	`{sourceId}_chunk_{index}` 형식의 고유 식별자.
`Index`	`int`	문서에서 이 청크의 0 기반 위치.
`Content`	`string`	청크의 텍스트 내용. 벡터 저장소에서 전문 검색 데이터로 저장됩니다.
`Context`	`string`	문서에서 추출된 구조적 컨텍스트(예: 제목 경로). RAG 파이프라인이 문서에서 청크가 어디에서 왔는지 이해하는 데 도움을 줍니다.
`Embedding`	`ReadOnlyMemory<float>?`	`IngestAsync`에 의해 채워지는 벡터 임베딩. `GetChunksAsync`만 사용할 때는 `null`.
`Metadata`	`IDictionary<string, string>`	사용자 정의 메타데이터를 위한 확장 가능한 키-값 저장소.

벡터 저장소 스키마

DocumentChunk.GetVectorDefinition(dimensions)를 사용하여 임베딩 모델의 출력 크기와 일치하는 컬렉션 스키마를 가져옵니다.

var definition = Aspose.Pdf.AI.DocumentChunk.GetVectorDefinition(dimensions: 1536);

PDF AI 코파일럿 주석 작업하기