C#에서 OCR을 사용하여 PDF에서 텍스트 추출

개요

일반적인 텍스트 추출은 PDF 문서의 텍스트 레이어를 읽습니다. 페이지가 스캔된 이미지이거나 선택 가능한 텍스트가 없는 경우, TextFragmentAbsorber와 같은 클래스는 읽을 텍스트가 없기 때문에 아무것도 반환하지 않습니다.

이러한 경우 Aspose.PDF for .NET는 OcrTextAbsorber 클래스(네임스페이스 Aspose.Pdf.Ocr)를 제공합니다. 이 클래스는 OCR(광학 문자 인식)을 사용하여 모든 PDF 문서의 페이지에서 일반 텍스트를 인식하고 문자열로 반환합니다. Aspose.PDF의 표준 absorber/visitor 패턴을 따르므로 다른 absorber와 동일한 Accept 진입점에 연결됩니다.

PDF 단일 페이지에서 텍스트 인식

OcrTextAbsorber를 생성하고 페이지의 Accept 메서드를 호출한 다음 Text 속성에서 결과를 읽습니다. absorber.Visit(page) 호출은 page.Accept(absorber)와 직접적으로 동일합니다.

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextOnPage()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
    {
        // Create OCR text absorber
        var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber();

        // Recognize text on the first page
        document.Pages[1].Accept(absorber);

        // Get the recognized text
        string pageText = absorber.Text;
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextOnPage()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open PDF document
    using var document = new Aspose.Pdf.Document(dataDir + "input.pdf");

    // Create OCR text absorber
    var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber();

    // Recognize text on the first page
    document.Pages[1].Accept(absorber);

    // Get the recognized text
    string pageText = absorber.Text;
}

전체 PDF 문서에서 텍스트 인식

모든 페이지를 인식하려면 Pages 컬렉션의 Accept 메서드를 호출합니다. 각 페이지의 인식된 텍스트는 옵션의 페이지 구분 기호를 사용하여 결합됩니다.

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextInDocument()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
    {
        // Create OCR text absorber
        var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber();

        // Recognize text on every page; page texts are joined with the page separator
        document.Pages.Accept(absorber);

        // Get the recognized text
        string text = absorber.Text;
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextInDocument()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open PDF document
    using var document = new Aspose.Pdf.Document(dataDir + "input.pdf");

    // Create OCR text absorber
    var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber();

    // Recognize text on every page; page texts are joined with the page separator
    document.Pages.Accept(absorber);

    // Get the recognized text
    string text = absorber.Text;
}

인식 옵션 구성

인식은 생성자에 전달되는 OcrTextRecognitionOptions 객체로 구성됩니다. 동일한 옵션은 생성 후 absorber의 Options 속성을 통해서도 사용할 수 있으며, 이를 변경하면 다음 인식 호출에 영향을 줍니다.

멤버	기본값	의미	검증
`Language`	`OcrLanguage.English`	인식 언어.	—
`Resolution`	`300`	인식 해상도(DPI). 실용적인 범위는 약 200~600입니다. 값이 높을수록 메모리/CPU를 더 많이 사용하지만 정확도 향상은 미미합니다.	`<= 0`이면 `ArgumentOutOfRangeException`을 발생시킵니다.
`PageSeparator`	`"\n\n"`	연속된 페이지의 인식 텍스트 사이에 삽입됩니다(첫 페이지 앞에는 삽입되지 않음). `string.Empty`는 구분 없이 페이지를 연결합니다.	`null`로 설정하면 `ArgumentNullException`을 발생시킵니다.

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextWithOptions()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
    {
        // Configure recognition options
        var options = new Aspose.Pdf.Ocr.OcrTextRecognitionOptions();
        options.Language = Aspose.Pdf.Ocr.OcrLanguage.Russian;
        options.Resolution = 400;          // higher DPI for small or low-quality text
        options.PageSeparator = "\n---\n"; // custom separator between pages

        // Create OCR text absorber with the options
        var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber(options);
        document.Pages.Accept(absorber);

        // Get the recognized text
        string text = absorber.Text;
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextWithOptions()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open PDF document
    using var document = new Aspose.Pdf.Document(dataDir + "input.pdf");

    // Configure recognition options
    var options = new Aspose.Pdf.Ocr.OcrTextRecognitionOptions();
    options.Language = Aspose.Pdf.Ocr.OcrLanguage.Russian;
    options.Resolution = 400;          // higher DPI for small or low-quality text
    options.PageSeparator = "\n---\n"; // custom separator between pages

    // Create OCR text absorber with the options
    var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber(options);
    document.Pages.Accept(absorber);

    // Get the recognized text
    string text = absorber.Text;
}

자동 언어 감지

문서 언어를 알 수 없는 경우 Language를 OcrLanguage.Auto로 설정하여 자동으로 감지합니다. 인식 언어는 OcrLanguage 열거형으로 선택하며 영어(기본값), 아랍어, 중국어, 프랑스어, 독일어, 인도네시아어, 이탈리아어, 일본어, 카자흐어, 한국어, 폴란드어, 포르투갈어, 러시아어, 스페인어, 우크라이나어 및 Auto를 지원합니다.

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextWithAutoLanguage()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Configure recognition options with automatic language detection
    var options = new Aspose.Pdf.Ocr.OcrTextRecognitionOptions();
    options.Language = Aspose.Pdf.Ocr.OcrLanguage.Auto; // detect the language automatically

    // Create OCR text absorber with the options
    var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber(options);

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
    {
        document.Pages.Accept(absorber);

        // Get the recognized text
        string text = absorber.Text;
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextWithAutoLanguage()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Configure recognition options with automatic language detection
    var options = new Aspose.Pdf.Ocr.OcrTextRecognitionOptions();
    options.Language = Aspose.Pdf.Ocr.OcrLanguage.Auto; // detect the language automatically

    // Create OCR text absorber with the options
    var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber(options);

    // Open PDF document
    using var document = new Aspose.Pdf.Document(dataDir + "input.pdf");
    document.Pages.Accept(absorber);

    // Get the recognized text
    string text = absorber.Text;
}

인식 결과가 반환되는 방식

Text는 누적되지 않고 대체됩니다. Accept/Visit를 호출할 때마다 해당 호출의 결과로 Text가 덮어쓰여집니다. 여러 결과를 유지하려면 호출할 때마다 읽으십시오. 첫 호출 전과 페이지가 없는 문서의 경우 string.Empty입니다.
여러 페이지 결합. 각 페이지의 텍스트는 Options.PageSeparator(기본값 "\n\n")를 사용하여 연결됩니다. 첫 페이지 앞에는 구분 기호가 추가되지 않습니다. string.Empty는 구분 없이 페이지를 결합합니다.
해상도. 300 DPI가 기본값이자 실용적인 최적값이며, 약 200~600이 유용한 범위입니다.

PDF에서 SuperScripts 및 SubScripts 텍스트 추출