Extract Text from PDF using OCR in C#

Overview

Regular text extraction reads the text layer of a PDF document. When a page is a scanned image or otherwise contains no selectable text, classes such as TextFragmentAbsorber return nothing because there is no text to extract.

For these cases, Aspose.PDF for .NET provides the OcrTextAbsorber class (namespace Aspose.Pdf.Ocr). It recognizes plain text on the pages of any PDF document using OCR (Optical Character Recognition) and returns it as a string. It follows the standard Aspose.PDF absorber/visitor pattern, so it plugs into the same Accept entry points as other absorbers.

Recognize Text on a Single PDF Page

Create an OcrTextAbsorber, call the Accept method of the page, and read the result from the Text property. The absorber.Visit(page) call is a direct equivalent of page.Accept(absorber).

Recognize Text in a Whole PDF Document

Call the Accept method of the Pages collection to recognize every page. The recognized text of each page is joined using the page separator from the options.

Configure Recognition Options

Recognition is configured with an OcrTextRecognitionOptions object passed to the constructor. The same options are also available after construction through the absorber’s Options property, and changing them affects the next recognition call.

Member Default Meaning Validation
Language OcrLanguage.English Recognition language.
Resolution 300 Recognition resolution, in DPI. Practical range ~200–600. Higher values cost memory/CPU with little accuracy gain. Throws ArgumentOutOfRangeException if <= 0.
PageSeparator "\n\n" Inserted between consecutive pages’ recognized text (not before the first page). string.Empty concatenates pages with no break. Throws ArgumentNullException if set to null.

Automatic Language Detection

When the document language is unknown, set Language to OcrLanguage.Auto to detect it automatically. The recognition language is selected with the OcrLanguage enumeration, which supports English (default), Arabic, Chinese, French, German, Indonesian, Italian, Japanese, Kazakh, Korean, Polish, Portuguese, Russian, Spanish, Ukrainian, and Auto.

How Recognition Results Are Returned

  • Text is replaced, not accumulated. Every Accept/Visit call overwrites Text with the result of that call; read it after each call to keep multiple results. It is string.Empty before the first call and for a document with no pages.
  • Multi-page join. Per-page texts are concatenated using Options.PageSeparator (default "\n\n"); no separator is added before the first page. string.Empty joins pages with no break.
  • Resolution. 300 DPI is the default and practical sweet spot; ~200–600 is the useful range.