Extract Tagged Content from PDF
In this article you will learn how to to extract tagged content PDF document using Python.
Getting Tagged PDF Content
In order to get content of PDF Document with Tagged Text, Aspose.PDF offers tagged_content property of Document class.
Create an advanced, fully tagged PDF document with a structured and hierarchical Table of Contents (TOC):
- Create a new Document object.
- Access the tagged_content property.
- Set the document title using ‘set_title()’.
- Set the document language using ‘set_language()’.
- Save the document.
import aspose.pdf as ap
# region Extract Tagged Content from PDF
def get_tagged_content(outfile):
# Create PDF Document
with ap.Document() as document:
# Get Content for work with Tagged PDF
tagged_content = document.tagged_content
# Work with Tagged PDF content
# Set Title and Language for Document
tagged_content.set_title("Simple Tagged Pdf Document")
tagged_content.set_language("en-US")
# Save Tagged PDF Document
document.save(outfile)
Getting Root Structure
Tagged PDFs contain a logical structure tree that defines the semantic structure of the document. The StructTreeRoot represents the root of this logical tree, while the RootElement provides an interface to interact with the top-level structure element of the document.
Following code snippet shows how to get the root structure of Tagged PDF Document:
- Create a new tagged PDF document.
- Access tagged content and set metadata.
- Access StructTreeRoot and RootElement.
- Save the tagged PDF.
import aspose.pdf as ap
from aspose.pycore import cast
def get_root_structure(outfile):
# Create PDF Document
with ap.Document() as document:
# Get Content for work with Tagged PDF
tagged_content = document.tagged_content
# Set Title and Language for Document
tagged_content.set_title("Tagged Pdf Document")
tagged_content.set_language("en-US")
# Properties StructTreeRootElement and RootElement are used for access to
# StructTreeRoot object of pdf document and to root structure element (Document structure element).
struct_tree_root_element = tagged_content.struct_tree_root_element
root_element = tagged_content.root_element
print(f"StructTreeRootElement: {struct_tree_root_element}")
print(f"RootElement: {root_element}")
# Save Tagged PDF Document
document.save(outfile)
Accessing Child Elements
Tagged PDFs contain a logical structure tree that defines the semantic hierarchy of the document (headings, paragraphs, forms, lists, etc.). Accessing and modifying these structure elements allows you to:
- Inspect metadata such as title, language, actual_text, and accessibility-related properties
- Update properties for improved accessibility or localization
- Programmatically adjust the logical document structure for PDF/UA compliance
Following code snippet shows how to access child elements of a Tagged PDF Document:
import aspose.pdf as ap
from aspose.pycore import
def access_child_elements(infile, outfile):
# Open PDF Document
with ap.Document(infile) as document:
# Get Content for work with Tagged PDF
tagged_content = document.tagged_content
# Access to root element(s)
element_list = tagged_content.struct_tree_root_element.child_elements
for element in element_list:
if isinstance(element, ap.logicalstructure.StructureElement):
structure_element = cast(ap.logicalstructure.StructureElement, element)
# Get properties
print(
"StructureElement properties - "
f"title: {structure_element.title}, "
f"language: {structure_element.language}, "
f"actual_text: {structure_element.actual_text}, "
f"expansion_text: {structure_element.expansion_text}, "
f"alternative_text: {structure_element.alternative_text}"
)
# Access to child elements of first element in root element
element_list = tagged_content.root_element.child_elements[1].child_elements
for element in element_list:
if isinstance(element, ap.logicalstructure.StructureElement):
structure_element = element
# Set properties
structure_element.title = "title"
structure_element.language = "fr-FR"
structure_element.actual_text = "actual text"
structure_element.expansion_text = "exp"
structure_element.alternative_text = "alt"
# Save Tagged PDF Document
document.save(outfile)