Annotations and Special Text using Java

Extract highlighted text

Iterate through page annotations and read marked text from HighlightAnnotation.

Open the source PDF in a Document instance.
Iterate through the Annotation objects on the target Page.
Check whether each annotation is a HighlightAnnotation before casting it to the typed annotation class.
Read the marked text from each highlight annotation and print it to the console.

public static void extractHighlightedText(Path inputFile) {
    try (Document document = new Document(inputFile.toString())) {
        for (Annotation annotation : document.getPages().get_Item(1).getAnnotations()) {
            if (annotation instanceof HighlightAnnotation) {
                HighlightAnnotation highlightAnnotation = (HighlightAnnotation) annotation;
                System.out.println(highlightAnnotation.getMarkedText());
            }
        }
    }
}

Extract text from stamp annotations

Read the normal appearance stream from a stamp annotation and pass it through TextAbsorber.

Open the source PDF in a Document instance.
Iterate through the Annotation objects on the target Page.
Filter the annotations to those whose type is Stamp.
Create a TextAbsorber and request the normal appearance entry from the stamp annotation appearance dictionary.
Visit the appearance XForm and print the extracted text.

public static void extractStampText(Path inputFile) {
    try (Document document = new Document(inputFile.toString())) {
        for (Annotation annotation : document.getPages().get_Item(1).getAnnotations()) {
            if (annotation.getAnnotationType() == AnnotationType.Stamp) {
                TextAbsorber absorber = new TextAbsorber();
                Object[] xforms = new Object[1];
                if (annotation.getAppearance().tryGetValue("N", xforms) && xforms[0] instanceof XForm) {
                    absorber.visit((XForm) xforms[0]);
                    System.out.println(absorber.getText());
                }
            }
        }
    }
}

Extract superscript and subscript text details

Use TextFragmentAbsorber when you need both the extracted text and the superscript or subscript flags on each fragment.

Open the source PDF in a Document instance.
Create a TextFragmentAbsorber for fragment-level text analysis.
Visit the target Page and collect its TextFragment objects.
Iterate through those fragments and read the text together with the superscript and subscript flags from fragment.getTextState().
Write the extracted details to the output file.

public static void extractSuperSubDetails(Path inputFile, Path outputFile, int pageNumber) throws Exception {
    try (Document document = new Document(inputFile.toString())) {
        TextFragmentAbsorber absorber = new TextFragmentAbsorber();
        document.getPages().get_Item(pageNumber).accept(absorber);
        StringBuilder details = new StringBuilder();
        for (TextFragment fragment : absorber.getTextFragments()) {
            details.append("Text: '").append(fragment.getText())
                    .append("' | Superscript: ").append(fragment.getTextState().isSuperscript())
                    .append(" | Subscript: ").append(fragment.getTextState().isSubscript())
                    .append(System.lineSeparator());
        }
        Files.writeString(outputFile, details.toString());
    }
}

Improving Text Extraction from Multi-Column PDFs