Extract Tagged Content from PDFs in Java
Contents
[
Hide
]
Use these APIs when you need to inspect the logical structure tree of a tagged PDF and examine or update structure element metadata.
Get tagged content metadata
Use this example when you need access to the tagged content container and want to define basic document metadata such as title and language.
- Create a new PDF Document.
- Get the ITaggedContent object from the document.
- Set the tagged content metadata and save the output file.
public static void getTaggedContent(Path outputFile) {
try (Document document = new Document()) {
ITaggedContent taggedContent = document.getTaggedContent();
taggedContent.setTitle("Simple Tagged Pdf Document");
taggedContent.setLanguage("en-US");
document.save(outputFile.toString());
}
}
Get the root structure of a tagged PDF
This example shows how to inspect the root objects that represent the structure tree of a tagged PDF.
- Create a new PDF Document and get its tagged content.
- Set the required document metadata.
- Read and print the structure tree root and logical root element, then save the file.
public static void getRootStructure(Path outputFile) {
try (Document document = new Document()) {
ITaggedContent taggedContent = document.getTaggedContent();
taggedContent.setTitle("Tagged Pdf Document");
taggedContent.setLanguage("en-US");
System.out.println("StructTreeRootElement: " + taggedContent.getStructTreeRootElement());
System.out.println("RootElement: " + taggedContent.getRootElement());
document.save(outputFile.toString());
}
}
Access and update child structure elements
Use this example when you need to iterate through child elements in the structure tree, inspect their properties, and update selected metadata.
- Open the source tagged PDF Document.
- Read the child elements from the structure tree root and print the available properties.
- Access the child elements of the first root child, update their metadata, and save the document.
public static void accessChildElements(Path inputFile, Path outputFile) {
try (Document document = new Document(inputFile.toString())) {
ITaggedContent taggedContent = document.getTaggedContent();
ElementList elementList = taggedContent.getStructTreeRootElement().getChildElements();
for (Object element : elementList) {
if (element instanceof StructureElement structureElement) {
System.out.println("StructureElement properties - "
+ "title: " + structureElement.getTitle()
+ ", language: " + structureElement.getLanguage()
+ ", actual_text: " + structureElement.getActualText()
+ ", expansion_text: " + structureElement.getExpansionText()
+ ", alternative_text: " + structureElement.getAlternativeText());
}
}
Element firstChild = taggedContent.getRootElement().getChildElements().get_Item(1);
for (Object element : firstChild.getChildElements()) {
if (element instanceof StructureElement structureElement) {
structureElement.setTitle("title");
structureElement.setLanguage("fr-FR");
structureElement.setActualText("actual text");
structureElement.setExpansionText("exp");
structureElement.setAlternativeText("alt");
}
}
document.save(outputFile.toString());
}
}