C#でOCRを使用してPDFからテキストを抽出する

概要

通常のテキスト抽出は、PDF文書のテキストレイヤーを読み取ります。ページがスキャンされた画像である場合や、選択可能なテキストがない場合、TextFragmentAbsorberなどのクラスは、読み取るテキストがないため何も返しません。

このような場合、Aspose.PDF for .NETはOcrTextAbsorberクラス（名前空間Aspose.Pdf.Ocr）を提供します。これは、OCR（光学文字認識）を使用して任意のPDF文書のページ上のプレーンテキストを認識し、文字列として返します。Aspose.PDFの標準的なabsorber/visitorパターンに従っているため、他のabsorberと同じAcceptのエントリポイントに接続できます。

PDFの単一ページでテキストを認識する

OcrTextAbsorberを作成し、ページのAcceptメソッドを呼び出して、Textプロパティから結果を読み取ります。absorber.Visit(page)の呼び出しはpage.Accept(absorber)と直接同等です。

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextOnPage()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
    {
        // Create OCR text absorber
        var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber();

        // Recognize text on the first page
        document.Pages[1].Accept(absorber);

        // Get the recognized text
        string pageText = absorber.Text;
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextOnPage()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open PDF document
    using var document = new Aspose.Pdf.Document(dataDir + "input.pdf");

    // Create OCR text absorber
    var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber();

    // Recognize text on the first page
    document.Pages[1].Accept(absorber);

    // Get the recognized text
    string pageText = absorber.Text;
}

PDF文書全体でテキストを認識する

PagesコレクションのAcceptメソッドを呼び出して、すべてのページを認識します。各ページの認識されたテキストは、オプションのページ区切り文字を使用して結合されます。

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextInDocument()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
    {
        // Create OCR text absorber
        var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber();

        // Recognize text on every page; page texts are joined with the page separator
        document.Pages.Accept(absorber);

        // Get the recognized text
        string text = absorber.Text;
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextInDocument()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open PDF document
    using var document = new Aspose.Pdf.Document(dataDir + "input.pdf");

    // Create OCR text absorber
    var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber();

    // Recognize text on every page; page texts are joined with the page separator
    document.Pages.Accept(absorber);

    // Get the recognized text
    string text = absorber.Text;
}

認識オプションを設定する

認識は、コンストラクターに渡されるOcrTextRecognitionOptionsオブジェクトで設定します。同じオプションは、作成後にabsorberのOptionsプロパティからも利用でき、変更すると次回の認識呼び出しに影響します。

メンバー	デフォルト	意味	検証
`Language`	`OcrLanguage.English`	認識言語。	—
`Resolution`	`300`	認識解像度（DPI）。実用的な範囲は約200〜600。値を高くするとメモリ/CPUを多く消費しますが、精度の向上はわずかです。	`<= 0`の場合は`ArgumentOutOfRangeException`をスローします。
`PageSeparator`	`"\n\n"`	連続するページの認識テキストの間に挿入されます（最初のページの前には挿入されません）。`string.Empty`はページを区切りなしで連結します。	`null`に設定すると`ArgumentNullException`をスローします。

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextWithOptions()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
    {
        // Configure recognition options
        var options = new Aspose.Pdf.Ocr.OcrTextRecognitionOptions();
        options.Language = Aspose.Pdf.Ocr.OcrLanguage.Russian;
        options.Resolution = 400;          // higher DPI for small or low-quality text
        options.PageSeparator = "\n---\n"; // custom separator between pages

        // Create OCR text absorber with the options
        var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber(options);
        document.Pages.Accept(absorber);

        // Get the recognized text
        string text = absorber.Text;
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextWithOptions()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Open PDF document
    using var document = new Aspose.Pdf.Document(dataDir + "input.pdf");

    // Configure recognition options
    var options = new Aspose.Pdf.Ocr.OcrTextRecognitionOptions();
    options.Language = Aspose.Pdf.Ocr.OcrLanguage.Russian;
    options.Resolution = 400;          // higher DPI for small or low-quality text
    options.PageSeparator = "\n---\n"; // custom separator between pages

    // Create OCR text absorber with the options
    var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber(options);
    document.Pages.Accept(absorber);

    // Get the recognized text
    string text = absorber.Text;
}

言語の自動検出

文書の言語が不明な場合は、LanguageをOcrLanguage.Autoに設定して自動的に検出します。認識言語はOcrLanguage列挙型で選択し、英語（デフォルト）、アラビア語、中国語、フランス語、ドイツ語、インドネシア語、イタリア語、日本語、カザフ語、韓国語、ポーランド語、ポルトガル語、ロシア語、スペイン語、ウクライナ語、およびAutoをサポートします。

.NET Core 3.1

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextWithAutoLanguage()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Configure recognition options with automatic language detection
    var options = new Aspose.Pdf.Ocr.OcrTextRecognitionOptions();
    options.Language = Aspose.Pdf.Ocr.OcrLanguage.Auto; // detect the language automatically

    // Create OCR text absorber with the options
    var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber(options);

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
    {
        document.Pages.Accept(absorber);

        // Get the recognized text
        string text = absorber.Text;
    }
}

.NET 8

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void RecognizeTextWithAutoLanguage()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Text();

    // Configure recognition options with automatic language detection
    var options = new Aspose.Pdf.Ocr.OcrTextRecognitionOptions();
    options.Language = Aspose.Pdf.Ocr.OcrLanguage.Auto; // detect the language automatically

    // Create OCR text absorber with the options
    var absorber = new Aspose.Pdf.Ocr.OcrTextAbsorber(options);

    // Open PDF document
    using var document = new Aspose.Pdf.Document(dataDir + "input.pdf");
    document.Pages.Accept(absorber);

    // Get the recognized text
    string text = absorber.Text;
}

認識結果が返される仕組み

Textは蓄積されず置き換えられます。 Accept/Visitを呼び出すたびに、その呼び出しの結果でTextが上書きされます。複数の結果を保持するには、呼び出しごとに読み取ってください。最初の呼び出し前およびページのない文書の場合はstring.Emptyになります。
複数ページの結合。 各ページのテキストはOptions.PageSeparator（デフォルトは"\n\n"）を使用して連結されます。最初のページの前には区切り文字は追加されません。string.Emptyはページを区切りなしで結合します。
解像度。 300 DPIがデフォルトであり、実用的な最適値です。約200〜600が有用な範囲です。

PDFからスーパースクリプトとサブスクリプトのテキストを抽出する