Basic Text Extraction using Java

Basic text extraction is the starting point for reading PDF content in Java. Aspose.PDF provides two common approaches:

Use TextAbsorber when you need a plain text result from a document or page.
Use ParagraphAbsorber when you need to preserve page, section, paragraph, line, and fragment grouping.

PDF pages do not store text like a word-processing document, so the extracted order depends on the page content stream and layout. For region-specific extraction, geometry details, multi-column layouts, annotations, highlighted text, or superscript and subscript detection, use the related extraction articles in this section.

Extract text from all pages

Use TextAbsorber to collect a flat text stream from the whole document and write it to a file. This is the simplest option when you only need the readable text content and do not need paragraph boundaries or coordinates.

Open the source PDF in a Document instance.
Create a TextAbsorber to accumulate text across the whole document.
Call document.getPages().accept(textAbsorber) so every Page is visited by the absorber.
Write the extracted text buffer to the output file.

public static void extractTextFromAllPages(Path inputFile, Path outputFile) throws Exception {
    try (Document document = new Document(inputFile.toString())) {
        TextAbsorber textAbsorber = new TextAbsorber();
        document.getPages().accept(textAbsorber);
        Files.writeString(outputFile, textAbsorber.getText());
    }
}

Extract text from a specific page

Apply the absorber only to the page you need. Page numbers in the Document pages collection are 1-based, so get_Item(1) reads the first page.

Open the source PDF in a Document instance.
Create a TextAbsorber for single-page extraction.
Call accept(textAbsorber) on the target Page selected by page number.
Write the extracted text buffer to the output file.

public static void extractTextFromPage(Path inputFile, Path outputFile, int pageNumber) throws Exception {
    try (Document document = new Document(inputFile.toString())) {
        TextAbsorber textAbsorber = new TextAbsorber();
        document.getPages().get_Item(pageNumber).accept(textAbsorber);
        Files.writeString(outputFile, textAbsorber.getText());
    }
}

Extract text by paragraph structure

Use ParagraphAbsorber when you need structural grouping instead of a single plain text stream. It returns page markups with sections, paragraphs, lines, and TextFragment objects, which is useful when the output must preserve logical blocks of text.

Open the source PDF in a Document instance.
Create a ParagraphAbsorber and visit the whole document to build page markup results.
Iterate through the page markups, sections, paragraphs, lines, and TextFragment objects exposed by the absorber.
Build the output text with explicit page, section, and paragraph numbering so structural grouping is preserved.
Write the extracted paragraph text to the output file.

public static void extractParagraphsFromPdf(Path inputFile, Path outputFile) throws Exception {
    try (Document document = new Document(inputFile.toString())) {
        ParagraphAbsorber absorber = new ParagraphAbsorber();
        absorber.visit(document);

        StringBuilder text = new StringBuilder();
        for (PageMarkup pageMarkup : absorber.getPageMarkups()) {
            int sectionIndex = 1;
            for (MarkupSection section : pageMarkup.getSections()) {
                int paragraphIndex = 1;
                for (MarkupParagraph paragraph : section.getParagraphs()) {
                    StringBuilder paragraphText = new StringBuilder();
                    for (List<TextFragment> line : paragraph.getLines()) {
                        for (TextFragment fragment : line) {
                            paragraphText.append(fragment.getText());
                        }
                        paragraphText.append("\r\n");
                    }
                    text.append("Page ").append(pageMarkup.getNumber())
                            .append(", Section ").append(sectionIndex)
                            .append(", Paragraph ").append(paragraphIndex)
                            .append(":\n");
                    text.append(paragraphText).append("\n");
                    paragraphIndex++;
                }
                sectionIndex++;
            }
        }

        Files.writeString(outputFile, text.toString());
    }
}

Region-Based Extraction using Java