在 C# 中使用 OCR 从 PDF 提取文本

概述

常规的文本提取读取 PDF 文档的文本层。当某个页面是扫描图像或没有可选择的文本时，诸如 TextFragmentAbsorber 之类的类不会返回任何内容，因为没有可读取的文本。

对于这些情况，Aspose.PDF for .NET 提供了 OcrTextAbsorber 类（命名空间 Aspose.Pdf.Ocr）。它使用 OCR（光学字符识别）识别任意 PDF 文档页面上的纯文本，并以字符串形式返回。它遵循标准的 Aspose.PDF absorber/visitor 模式，因此接入与其他 absorber 相同的 Accept 入口点。

识别 PDF 单个页面上的文本

创建一个 OcrTextAbsorber，调用页面的 Accept 方法，然后从 Text 属性读取结果。absorber.Visit(page) 调用与 page.Accept(absorber) 直接等效。

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextOnPage()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
    {
        // Create OCR text absorber
        var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber();

        // Recognize text on the first page
        document.Pages[1].Accept(absorber);

        // Get the recognized text
        string pageText = absorber.Text;
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextOnPage()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open PDF document
    using var document = new Aspose.Pdf.Document(dataDir + "input.pdf");

    // Create OCR text absorber
    var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber();

    // Recognize text on the first page
    document.Pages[1].Accept(absorber);

    // Get the recognized text
    string pageText = absorber.Text;
}

识别整个 PDF 文档中的文本

调用 Pages 集合的 Accept 方法以识别每个页面。每个页面识别出的文本使用选项中的页面分隔符连接。

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextInDocument()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
    {
        // Create OCR text absorber
        var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber();

        // Recognize text on every page; page texts are joined with the page separator
        document.Pages.Accept(absorber);

        // Get the recognized text
        string text = absorber.Text;
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextInDocument()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open PDF document
    using var document = new Aspose.Pdf.Document(dataDir + "input.pdf");

    // Create OCR text absorber
    var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber();

    // Recognize text on every page; page texts are joined with the page separator
    document.Pages.Accept(absorber);

    // Get the recognized text
    string text = absorber.Text;
}

配置识别选项

识别通过传递给构造函数的 OcrTextRecognitionOptions 对象进行配置。创建后也可以通过 absorber 的 Options 属性使用相同的选项，更改它们会影响下一次识别调用。

成员	默认值	含义	验证
`Language`	`OcrLanguage.English`	识别语言。	—
`Resolution`	`300`	识别分辨率（DPI）。实用范围约为 200–600。较高的值会占用更多内存/CPU，但精度提升甚微。	如果 `<= 0`，则抛出 `ArgumentOutOfRangeException`。
`PageSeparator`	`"\n\n"`	插入在连续页面识别文本之间（不在第一页之前）。`string.Empty` 将各页面无间隔地连接。	如果设置为 `null`，则抛出 `ArgumentNullException`。

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextWithOptions()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
    {
        // Configure recognition options
        var options = new Aspose.Pdf.Ocr.OcrTextRecognitionOptions();
        options.Language = Aspose.Pdf.Ocr.OcrLanguage.Russian;
        options.Resolution = 400;          // higher DPI for small or low-quality text
        options.PageSeparator = "\n---\n"; // custom separator between pages

        // Create OCR text absorber with the options
        var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber(options);
        document.Pages.Accept(absorber);

        // Get the recognized text
        string text = absorber.Text;
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextWithOptions()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open PDF document
    using var document = new Aspose.Pdf.Document(dataDir + "input.pdf");

    // Configure recognition options
    var options = new Aspose.Pdf.Ocr.OcrTextRecognitionOptions();
    options.Language = Aspose.Pdf.Ocr.OcrLanguage.Russian;
    options.Resolution = 400;          // higher DPI for small or low-quality text
    options.PageSeparator = "\n---\n"; // custom separator between pages

    // Create OCR text absorber with the options
    var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber(options);
    document.Pages.Accept(absorber);

    // Get the recognized text
    string text = absorber.Text;
}

自动语言检测

当文档语言未知时，将 Language 设置为 OcrLanguage.Auto 以自动检测。识别语言通过 OcrLanguage 枚举选择，支持英语（默认）、阿拉伯语、中文、法语、德语、印度尼西亚语、意大利语、日语、哈萨克语、韩语、波兰语、葡萄牙语、俄语、西班牙语、乌克兰语和 Auto。

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextWithAutoLanguage()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Configure recognition options with automatic language detection
    var options = new Aspose.Pdf.Ocr.OcrTextRecognitionOptions();
    options.Language = Aspose.Pdf.Ocr.OcrLanguage.Auto; // detect the language automatically

    // Create OCR text absorber with the options
    var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber(options);

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
    {
        document.Pages.Accept(absorber);

        // Get the recognized text
        string text = absorber.Text;
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextWithAutoLanguage()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Configure recognition options with automatic language detection
    var options = new Aspose.Pdf.Ocr.OcrTextRecognitionOptions();
    options.Language = Aspose.Pdf.Ocr.OcrLanguage.Auto; // detect the language automatically

    // Create OCR text absorber with the options
    var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber(options);

    // Open PDF document
    using var document = new Aspose.Pdf.Document(dataDir + "input.pdf");
    document.Pages.Accept(absorber);

    // Get the recognized text
    string text = absorber.Text;
}

识别结果的返回方式

Text 会被替换，而非累积。 每次 Accept/Visit 调用都会用该次调用的结果覆盖 Text；如需保留多个结果，请在每次调用后读取。在首次调用之前以及对于没有页面的文档，它为 string.Empty。
多页合并。 每个页面的文本使用 Options.PageSeparator（默认 "\n\n"）连接；第一页之前不添加分隔符。string.Empty 将各页面无间隔地合并。
分辨率。 300 DPI 是默认值和实用的最佳点；约 200–600 是有用的范围。

从PDF中提取上标和下标文本