Navigate HTML in Python

The Aspose.Html.Dom namespace provides API that represents and interacts with any HTML, XML or SVG documents and is entirely based on the WHATWG DOM specification supported in many modern browsers.

This article provides information on how to programmatically extract data from HTML documents with the Aspose.HTML for Python via .NET. You find out:

how to navigate through an HTML document and perform a detailed inspection of its elements using the Python API;
how to navigate over the document by using CSS Selector and XPath Query.

Navigating HTML involves accessing and manipulating elements and their relationships within a document. Aspose.HTML for Python via .NET allows you to navigate and inspect HTML, which involves working with the Document Object Model (DOM) provided by the library. The following shortlist shows the simplest way to access all DOM elements:

Document Object Model (DOM) . DOM Structure represents the HTML document as a tree of nodes. Each node represents a part of the document, such as elements, text, or comments.

The Document class represents the entire HTML, XML, or SVG document and serves as the root of the document tree.
The Element class represents an element in an HTML or XML document.
The Node class represents a single node in the document tree.

Accessing Elements

Use methods like get_elements_by_tag_name(tagname) to retrieve elements by their tag name.
Use the get_element_by_id() method to access a specific element with a unique ID.
Use get_elements_by_class_name(class_names) to retrieve elements by their class names.
Use the query_selector(selector) method for a single element or query_selector_all(selector) for a list of elements that match a CSS selector.

Navigating the DOM Tree

Access children of an element using child_nodes or children properties.
Use the first_child or last_child property to return the first or last child node of the current node, which could be any type of node, such as an element, text, or comment.
Use the parent_node property to access the parent of a given element.
Access siblings using properties like next_sibling or next_sibling.

Manipulating Elements

Use properties of the Element class like inner_html and text_content to modify element content.
Get or set attributes using methods like get_attribute(qualified_name) and set_attribute(qualified_name, value).

The API Reference Source provides a comprehensive list of classes and methods in the aspose.html.dom namespace.

Navigating the DOM Tree

We consider how the DOM represents an HTML document in memory and how to use API for navigation through HTML files. Four of the Node class properties – first_child, last_child, next_sibling, and next_sibling, each provides a live reference to another element with the defined relationship to the current element if the related element exists.

Using the mentioned properties, you can navigate through an HTML document as it follows:

 1from aspose.html import *
 2
 3# Prepare HTML code
 4html_code = "<span>Hello</span> <span>World!</span>"
 5
 6# Initialize a document from the prepared code
 7with HTMLDocument(html_code, ".") as document:
 8    # Get the reference to the first child (first SPAN) of the BODY
 9    element = document.body.first_child
10    print(element.text_content)  # output: Hello
11
12    # Get the reference to the whitespace between html elements
13    element = element.next_sibling
14    print(element.text_content)  # output: " "
15
16    # Get the reference to the second SPAN element
17    element = element.next_sibling
18    print(element.text_content)  # output: World!

Inspecting HTML

Aspose.HTML contains a list of methods that are based on the Element Traversal Specifications. You can perform a detailed inspection of the document and its elements using the API. The following Python code demonstrates how to navigate and extract specific elements and their properties from an HTML document using Aspose.HTML for Python via .NET.

 1import os
 2from aspose.html import *
 3
 4# Load a document from a file
 5data_dir = "data"
 6document_path = os.path.join(data_dir, "html_file.html")
 7with HTMLDocument(document_path) as document:
 8    # Get the <html> element of the document
 9    element = document.document_element
10    print(element.tag_name)  # HTML
11
12    # Get the last element of the <html> element
13    element = element.last_element_child
14    print(element.tag_name)  # BODY
15
16    # Get the first element of the <body> element
17    element = element.first_element_child
18    print(element.tag_name)  # H1
19    print(element.text_content)  # Header 1

The provided Python code begins by defining the path to the HTML file located in the “data” directory.

Use HTMLDocument to load a document, and the document.document_element property to accesses the root HTML element. Print the tag name of this element, which is “HTML”.
Next, retrieve the last child of the HTML element using the last_element_child, which is the “BODY” element, and print its tag name.
Subsequently, use the first_element_child property to accesse the first child of the BODY element, which is an “H1” element, printing both its tag name and its text content, which is “Header 1”.

XPath Query

The alternative to the HTML Navigation is XPath Query ( XML Path Language) that often referred to simply as an XPath. It is a query language that can be used to query data from HTML documents. It is based on a DOM representation of the HTML document, and selects nodes by various criteria. The syntax of the XPath expressions is quite simple, and what is more important, it is easy to read and support.

The following example shows how to use XPath queries within Aspose.HTML Python API:

 1from aspose.html import *
 2from aspose.html.dom.xpath import *
 3
 4# Prepare HTML code
 5code = """
 6    <div class='happy'>
 7        <div>
 8            <span>Hello,</span>
 9        </div>
10    </div>
11    <p class='happy'>
12        <span>World!</span>
13    </p>
14"""
15
16# Initialize a document based on the prepared code
17with HTMLDocument(code, ".") as document:
18    # Here we evaluate the XPath expression where we select all child SPAN elements from elements whose 'class' attribute equals to 'happy'
19    result = document.evaluate("//*[@class='happy']//span",
20                               document,
21                               None,
22                               XPathResultType.ANY,
23                               None)
24
25    # Iterate over the resulted nodes
26    node = result.iterate_next()
27    while node is not None:
28        print(node.text_content)
29        node = result.iterate_next()
30        # output: Hello,
31        # output: World!

The evaluate() method in the Aspose.HTML Python library allows you to execute XPath queries against HTML or XML documents, enabling detailed data extraction and navigation. It takes an XPath expression as its primary parameter, specifying the query to be executed, and returns an XPathResult object based on the defined result type.

CSS Selector

In addition to HTML navigation and XPath, the Aspose.HTML Python API supports the CSS Selector API. This API allows you to formulate search patterns using CSS Selectors syntax to identify and select elements within an HTML document. For instance, the query_selector_all(selector) method can be used to traverse an HTML document and retrieve elements that match a specified CSS selector. This method accepts a CSS selector string as its argument and returns a NodeList containing all elements that conform to the selector criteria. Using CSS selectors, you can efficiently find and manipulate elements based on their attributes, classes, IDs, and other criteria, making it a versatile tool for both simple and complex document parsing tasks. This functionality is particularly useful for tasks such as styling, data extraction, and content manipulation within an HTML document.

 1from aspose.html import *
 2
 3# Prepare HTML code
 4code = """
 5    <div class='happy'>
 6        <div>
 7            <span>Hello,</span>
 8        </div>
 9    </div>
10    <p class='happy'>
11        <span>World!</span>
12        <p>I use CSS Selector.</p>
13    </p>
14"""
15
16# Initialize a document based on the prepared code
17with HTMLDocument(code, ".") as document:
18    # Create a CSS Selector that extracts all elements whose "class" attribute equals "happy" and their child <span> elements
19    elements = document.query_selector_all(".happy span")
20
21# Iterate over the resulted list of elements
22    for element in elements:
23        print(element.text_content)
24        # output: Hello,
25        # output: World!

Conclusion

The Aspose.HTML for Python via .NET library offers a robust set of tools for working with HTML, XML, and SVG documents, adhering to modern browsers’ widely supported WHATWG DOM specification. Using the HTMLDocument class and its various navigation properties and methods, you can effectively interact with and manipulate HTML content, avoiding the complexities of manual data extraction and focusing on more strategic aspects of your projects.

Aspose.HTML offers free online HTML Web Applications that are an online collection of converters, mergers, SEO tools, HTML code generators, URL tools, web accessibility checks, and more. The applications work on any operating system with a web browser and do not require any additional software installation. Use our collection of HTML Web Applications to perform your daily matters and make your workflow seamless!

Data Extraction Save File from URL