Extract Tagged Content from PDF

In this article you will learn how to to extract tagged content PDF document using Python.

Getting Tagged PDF Content

In order to get content of PDF Document with Tagged Text, Aspose.PDF offers tagged_content property of Document class.

Create an advanced, fully tagged PDF document with a structured and hierarchical Table of Contents (TOC):

  1. Create a new Document object.
  2. Access the tagged_content property.
  3. Set the document title using ‘set_title()’.
  4. Set the document language using ‘set_language()’.
  5. Save the document.

    import aspose.pdf as ap

    # region Extract Tagged Content from PDF
    def get_tagged_content(outfile):

        # Create PDF Document
        with ap.Document() as document:
            # Get Content for work with Tagged PDF
            tagged_content = document.tagged_content

            # Work with Tagged PDF content
            # Set Title and Language for Document
            tagged_content.set_title("Simple Tagged Pdf Document")
            tagged_content.set_language("en-US")

            # Save Tagged PDF Document
            document.save(outfile)

Getting Root Structure

Tagged PDFs contain a logical structure tree that defines the semantic structure of the document. The StructTreeRoot represents the root of this logical tree, while the RootElement provides an interface to interact with the top-level structure element of the document.

Following code snippet shows how to get the root structure of Tagged PDF Document:

  1. Create a new tagged PDF document.
  2. Access tagged content and set metadata.
  3. Access StructTreeRoot and RootElement.
  4. Save the tagged PDF.

    import aspose.pdf as ap
    from aspose.pycore import cast

    def get_root_structure(outfile):

        # Create PDF Document
        with ap.Document() as document:
            # Get Content for work with Tagged PDF
            tagged_content = document.tagged_content

            # Set Title and Language for Document
            tagged_content.set_title("Tagged Pdf Document")
            tagged_content.set_language("en-US")

            # Properties StructTreeRootElement and RootElement are used for access to
            # StructTreeRoot object of pdf document and to root structure element (Document structure element).
            struct_tree_root_element = tagged_content.struct_tree_root_element
            root_element = tagged_content.root_element

            print(f"StructTreeRootElement: {struct_tree_root_element}")
            print(f"RootElement: {root_element}")

            # Save Tagged PDF Document
            document.save(outfile)

Accessing Child Elements

Tagged PDFs contain a logical structure tree that defines the semantic hierarchy of the document (headings, paragraphs, forms, lists, etc.). Accessing and modifying these structure elements allows you to:

  • Inspect metadata such as title, language, actual_text, and accessibility-related properties
  • Update properties for improved accessibility or localization
  • Programmatically adjust the logical document structure for PDF/UA compliance

Following code snippet shows how to access child elements of a Tagged PDF Document:


    import aspose.pdf as ap
    from aspose.pycore import

    def access_child_elements(infile, outfile):

        # Open PDF Document
        with ap.Document(infile) as document:
            # Get Content for work with Tagged PDF
            tagged_content = document.tagged_content

            # Access to root element(s)
            element_list = tagged_content.struct_tree_root_element.child_elements

            for element in element_list:
                if isinstance(element, ap.logicalstructure.StructureElement):
                    structure_element = cast(ap.logicalstructure.StructureElement, element)
                    # Get properties
                    print(
                        "StructureElement properties - "
                        f"title: {structure_element.title}, "
                        f"language: {structure_element.language}, "
                        f"actual_text: {structure_element.actual_text}, "
                        f"expansion_text: {structure_element.expansion_text}, "
                        f"alternative_text: {structure_element.alternative_text}"
                    )

            # Access to child elements of first element in root element
            element_list = tagged_content.root_element.child_elements[1].child_elements
            for element in element_list:
                if isinstance(element, ap.logicalstructure.StructureElement):
                    structure_element = element

                    # Set properties
                    structure_element.title = "title"
                    structure_element.language = "fr-FR"
                    structure_element.actual_text = "actual text"
                    structure_element.expansion_text = "exp"
                    structure_element.alternative_text = "alt"

            # Save Tagged PDF Document
            document.save(outfile)