Navigate HTML in Python

HTML Navigation

The Aspose.Html.Dom namespace provides API that represents and interacts with any HTML, XML or SVG documents and is entirely based on the WHATWG DOM specification supported in many modern browsers.

This article provides information on how to programmatically extract data from HTML documents with the Aspose.HTML for Python via .NET. You find out:

  • how to navigate through an HTML document and perform a detailed inspection of its elements using the Python API;
  • how to navigate over the document by using CSS Selector and XPath Query.

Navigating HTML involves accessing and manipulating elements and their relationships within a document. Aspose.HTML for Python via .NET allows you to navigate and inspect HTML, which involves working with the Document Object Model (DOM) provided by the library. The following shortlist shows the simplest way to access all DOM elements:

  1. Document Object Model (DOM) . DOM Structure represents the HTML document as a tree of nodes. Each node represents a part of the document, such as elements, text, or comments.
  1. Accessing Elements
  1. Navigating the DOM Tree
  1. Manipulating Elements

The API Reference Source provides a comprehensive list of classes and methods in the aspose.html.dom namespace.

We consider how the DOM represents an HTML document in memory and how to use API for navigation through HTML files. Four of the Node class properties – first_child, last_child, next_sibling, and next_sibling, each provides a live reference to another element with the defined relationship to the current element if the related element exists.

Using the mentioned properties, you can navigate through an HTML document as it follows:

 1# Navigate the HTML DOM using Python
 2
 3import aspose.html as ah
 4
 5# Prepare HTML code
 6html_code = "<span>Hello</span> <span>World!</span>"
 7
 8# Initialize a document from the prepared code
 9with ah.HTMLDocument(html_code, ".") as document:
10    # Get the reference to the first child (first SPAN) of the BODY
11    element = document.body.first_child
12    print(element.text_content)  # output: Hello,
13
14    # Get the reference to the whitespace between html elements
15    element = element.next_sibling
16    print(element.text_content)  # output: " "
17
18    # Get the reference to the second SPAN element
19    element = element.next_sibling
20    print(element.text_content)  # output: World!

Inspecting HTML

Aspose.HTML contains a list of methods that are based on the Element Traversal Specifications. You can perform a detailed inspection of the document and its elements using the API. The following Python code demonstrates how to navigate and extract specific elements and their properties from an HTML document using Aspose.HTML for Python via .NET.

 1# Navigate and inspect HTML document using Python
 2
 3import os
 4import aspose.html as ah
 5
 6# Load a document from a file
 7data_dir = "data"  # Change this to your actual data directory
 8document_path = os.path.join(data_dir, "html_file.html")
 9with ah.HTMLDocument(document_path) as document:
10    # Get the html element of the document
11    element = document.document_element
12    print(element.tag_name)  # HTML
13
14    # Get the last element of the html element
15    element = element.last_element_child
16    print(element.tag_name)  # BODY
17
18    # Get the first element of the body element
19    element = element.first_element_child
20    print(element.tag_name)  # H1
21    print(element.text_content)  # Header 1

The provided Python code begins by defining the path to the HTML file located in the “data” directory.

XPath Query

The alternative to the HTML Navigation is XPath Query ( XML Path Language) that often referred to simply as an XPath. It is a query language that can be used to query data from HTML documents. It is based on a DOM representation of the HTML document, and selects nodes by various criteria. The syntax of the XPath expressions is quite simple, and what is more important, it is easy to read and support.

The following example shows how to use XPath queries within Aspose.HTML Python API:

 1# How to use XPath to select nodes using Python
 2
 3import aspose.html as ah
 4import aspose.html.dom.xpath as hxpath
 5
 6# Prepare HTML code
 7code = """
 8    <div class='happy'>
 9        <div>
10            <span>Hello,</span>
11        </div>
12    </div>
13    <p class='happy'>
14        <span>World!</span>
15    </p>
16"""
17
18# Initialize a document based on the prepared code
19with ah.HTMLDocument(code, ".") as document:
20    # Here we evaluate the XPath expression where we select all child SPAN elements from elements whose 'class' attribute equals to 'happy'
21    result = document.evaluate("//*[@class='happy']//span",
22                               document,
23                               None,
24                               hxpath.XPathResultType.ANY,
25                               None)
26
27    # Iterate over the resulted nodes
28    node = result.iterate_next()
29    while node is not None:
30        print(node.text_content)
31        node = result.iterate_next()
32        # output: Hello,
33        # output: World!

The evaluate() method in the Aspose.HTML Python library allows you to execute XPath queries against HTML or XML documents, enabling detailed data extraction and navigation. It takes an XPath expression as its primary parameter, specifying the query to be executed, and returns an XPathResult object based on the defined result type.

CSS Selector

In addition to HTML navigation and XPath, the Aspose.HTML Python API supports the CSS Selector API. This API allows you to formulate search patterns using CSS Selectors syntax to identify and select elements within an HTML document. For instance, the query_selector_all(selector) method can be used to traverse an HTML document and retrieve elements that match a specified CSS selector. This method accepts a CSS selector string as its argument and returns a NodeList containing all elements that conform to the selector criteria. Using CSS selectors, you can efficiently find and manipulate elements based on their attributes, classes, IDs, and other criteria, making it a versatile tool for both simple and complex document parsing tasks. This functionality is particularly useful for tasks such as styling, data extraction, and content manipulation within an HTML document.

 1# Extract nodes Using CSS selector using Python
 2
 3import aspose.html as ah
 4
 5# Prepare HTML code
 6code = """
 7    <div class='happy'>
 8        <div>
 9            <span>Hello,</span>
10        </div>
11    </div>
12    <p class='happy'>
13        <span>World!</span>
14        <p>I use CSS Selector.</p>
15    </p>
16"""
17
18# Initialize a document based on the prepared code
19with ah.HTMLDocument(code, ".") as document:
20    # Create a CSS Selector that extracts all elements whose "class" attribute equals "happy" and their child <span> elements
21    elements = document.query_selector_all(".happy span")
22
23# Iterate over the resulted list of elements
24    for element in elements:
25        print(element.text_content)
26        # output: Hello,
27        # output: World!

Conclusion

The Aspose.HTML for Python via .NET library offers a robust set of tools for working with HTML, XML, and SVG documents, adhering to modern browsers’ widely supported WHATWG DOM specification. Using the HTMLDocument class and its various navigation properties and methods, you can effectively interact with and manipulate HTML content, avoiding the complexities of manual data extraction and focusing on more strategic aspects of your projects.

Aspose.HTML offers free online HTML Web Applications that are an online collection of converters, mergers, SEO tools, HTML code generators, URL tools, web accessibility checks, and more. The applications work on any operating system with a web browser and do not require any additional software installation. Use our collection of HTML Web Applications to perform your daily matters and make your workflow seamless!

Text “HTML Web Applications”

Subscribe to Aspose Product Updates

Get monthly newsletters & offers directly delivered to your mailbox.