Improving Text Extraction from Multi-Column PDFs

Multi-column layouts often require extra processing to improve reading order and extraction quality.

Extract text after reducing font size

This technique updates the text fragment font sizes, saves the adjusted document to memory, and then extracts text from the transformed result.

Open the source PDF in a Document instance.
Create a TextFragmentAbsorber and visit all document pages to collect TextFragment objects.
Iterate through the fragments and reduce each font size by the requested ratio so dense column layout can be normalized before extraction.
Save the adjusted Document to an in-memory byte stream.
Reopen a second Document from that memory buffer.
Create a TextAbsorber, visit all pages of the transformed document, and write the extracted text to the output file.

public static void extractTextReduceFont(Path inputFile, Path outputFile, double reduceRatio) throws Exception {
    try (Document document = new Document(inputFile.toString())) {
        TextFragmentAbsorber fragmentAbsorber = new TextFragmentAbsorber();
        document.getPages().accept(fragmentAbsorber);
        for (TextFragment fragment : fragmentAbsorber.getTextFragments()) {
            fragment.getTextState().setFontSize((float) (fragment.getTextState().getFontSize() * reduceRatio));
        }

        ByteArrayOutputStream stream = new ByteArrayOutputStream();
        document.save(stream);
        try (Document document2 = new Document(new ByteArrayInputStream(stream.toByteArray()))) {
            TextAbsorber textAbsorber = new TextAbsorber();
            document2.getPages().accept(textAbsorber);
            Files.writeString(outputFile, textAbsorber.getText());
        }
    }
}

Extract text with a scale factor

Use TextExtractionOptions in pure formatting mode and tune the scale factor for column-heavy layouts.

Open the source PDF in a Document instance.
Create a TextAbsorber for full-document extraction.
Create TextExtractionOptions in pure formatting mode so layout-sensitive extraction behavior is used.
Set the scale factor and apply the extraction options to the absorber before visiting the pages.
Visit all document pages and write the extracted text to the output file.

public static void extractTextScaleFactor(Path inputFile, Path outputFile, double scaleFactor) throws Exception {
    try (Document document = new Document(inputFile.toString())) {
        TextAbsorber textAbsorber = new TextAbsorber();
        TextExtractionOptions extractionOptions =
                new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);
        extractionOptions.setScaleFactor(scaleFactor);
        textAbsorber.setExtractionOptions(extractionOptions);
        document.getPages().accept(textAbsorber);
        Files.writeString(outputFile, textAbsorber.getText());
    }
}

Region-Based Extraction using Java Annotations and Special Text using Java