Annotations and Special Text using Java

Extract highlighted text

Iterate through page annotations and read marked text from HighlightAnnotation.

  1. Open the source PDF Document.
  2. Iterate through the Annotation objects on the target Page.
  3. Check whether each annotation is a HighlightAnnotation.
  4. Read and print the marked text from each highlight annotation.
public static void extractHighlightedText(Path inputFile) {
    try (Document document = new Document(inputFile.toString())) {
        for (Annotation annotation : document.getPages().get_Item(1).getAnnotations()) {
            if (annotation instanceof HighlightAnnotation) {
                HighlightAnnotation highlightAnnotation = (HighlightAnnotation) annotation;
                System.out.println(highlightAnnotation.getMarkedText());
            }
        }
    }
}

Extract text from stamp annotations

Read the normal appearance stream from a stamp annotation and pass it through TextAbsorber.

  1. Open the source PDF Document.
  2. Iterate through the Annotation objects on the target Page.
  3. Check whether each annotation is a stamp annotation.
  4. Create a TextAbsorber and get the normal appearance stream from the stamp annotation.
  5. Visit the appearance XForm and print the extracted text.
public static void extractStampText(Path inputFile) {
    try (Document document = new Document(inputFile.toString())) {
        for (Annotation annotation : document.getPages().get_Item(1).getAnnotations()) {
            if (annotation.getAnnotationType() == AnnotationType.Stamp) {
                TextAbsorber absorber = new TextAbsorber();
                Object[] xforms = new Object[1];
                if (annotation.getAppearance().tryGetValue("N", xforms) && xforms[0] instanceof XForm) {
                    absorber.visit((XForm) xforms[0]);
                    System.out.println(absorber.getText());
                }
            }
        }
    }
}

Extract superscript and subscript text details

Use TextFragmentAbsorber when you need both the extracted text and the superscript or subscript flags on each fragment.

  1. Open the source PDF Document.
  2. Create a TextFragmentAbsorber.
  3. Visit the target Page and collect the text fragments.
  4. Iterate through the TextFragment objects and read the text, superscript flag, and subscript flag.
  5. Write the extracted details to the output file.
public static void extractSuperSubDetails(Path inputFile, Path outputFile, int pageNumber) throws Exception {
    try (Document document = new Document(inputFile.toString())) {
        TextFragmentAbsorber absorber = new TextFragmentAbsorber();
        document.getPages().get_Item(pageNumber).accept(absorber);
        StringBuilder details = new StringBuilder();
        for (TextFragment fragment : absorber.getTextFragments()) {
            details.append("Text: '").append(fragment.getText())
                    .append("' | Superscript: ").append(fragment.getTextState().isSuperscript())
                    .append(" | Subscript: ").append(fragment.getTextState().isSubscript())
                    .append(System.lineSeparator());
        }
        Files.writeString(outputFile, details.toString());
    }
}