Analyzing your prompt, please hold on...
An error occurred while retrieving the results. Please refresh the page and try again.
Regular text extraction reads the text layer of a PDF document. When a page is a scanned image or otherwise contains no selectable text, classes such as TextFragmentAbsorber return nothing because there is no text to extract.
For these cases, Aspose.PDF for .NET provides the OcrTextAbsorber class (namespace Aspose.Pdf.Ocr). It recognizes plain text on the pages of any PDF document using OCR (Optical Character Recognition) and returns it as a string. It follows the standard Aspose.PDF absorber/visitor pattern, so it plugs into the same Accept entry points as other absorbers.
Create an OcrTextAbsorber, call the Accept method of the page, and read the result from the Text property. The absorber.Visit(page) call is a direct equivalent of page.Accept(absorber).
Call the Accept method of the Pages collection to recognize every page. The recognized text of each page is joined using the page separator from the options.
Recognition is configured with an OcrTextRecognitionOptions object passed to the constructor. The same options are also available after construction through the absorber’s Options property, and changing them affects the next recognition call.
| Member | Default | Meaning | Validation |
|---|---|---|---|
Language |
OcrLanguage.English |
Recognition language. | — |
Resolution |
300 |
Recognition resolution, in DPI. Practical range ~200–600. Higher values cost memory/CPU with little accuracy gain. | Throws ArgumentOutOfRangeException if <= 0. |
PageSeparator |
"\n\n" |
Inserted between consecutive pages’ recognized text (not before the first page). string.Empty concatenates pages with no break. |
Throws ArgumentNullException if set to null. |
When the document language is unknown, set Language to OcrLanguage.Auto to detect it automatically. The recognition language is selected with the OcrLanguage enumeration, which supports English (default), Arabic, Chinese, French, German, Indonesian, Italian, Japanese, Kazakh, Korean, Polish, Portuguese, Russian, Spanish, Ukrainian, and Auto.
Text is replaced, not accumulated. Every Accept/Visit call overwrites Text with the result of that call; read it after each call to keep multiple results. It is string.Empty before the first call and for a document with no pages.Options.PageSeparator (default "\n\n"); no separator is added before the first page. string.Empty joins pages with no break.Analyzing your prompt, please hold on...
An error occurred while retrieving the results. Please refresh the page and try again.