HTML Navigation in C#

Quick summary:

Aspose.HTML for .NET lets you explore an HTML document in three ways—direct DOM navigation, XPath queries, and CSS selectors. Use the FirstChild / NextSibling properties for step‑by‑step traversal, XPath for powerful pattern matching, or QuerySelectorAll() for familiar CSS‑style queries. Jump to the section you need with the links above.

Using the Aspose.HTML for .NET library, you can easily create your own application, since our API provides a powerful toolset to analyze and collect information from HTML documents.

This article provides information on how to programmatically extract data from HTML documents with the Aspose.HTML for .NET API. You will learn:

DOM Navigation

The Aspose.Html.Dom namespace provides API that represents and interacts with any HTML, XML or SVG documents and is entirely based on the WHATWG DOM specification supported in many modern browsers. The DOM is a document model loaded in the browser and representing the document as a node tree, where each node represents part of the document (e.g. an element, text string, or comment).

We consider how the DOM represents an HTML document in memory and how to use the API for navigation through HTML files. Many ways can be used to make HTML navigation. The following shortlist shows the simplest way to access all DOM elements:

PropertyDescription
FirstChildAccessing this property of an element must return a reference to the first child node.
LastChildAccessing this property of an element must return a reference to the last child node
NextSiblingAccessing this property of an element must return a reference to the sibling node of that element which most immediately follows that element.
PreviousSiblingAccessing this property of an element must return a reference to the sibling node of that element which most immediately precedes that element.
ChildNodesReturns a list that contains all children of that element.

Four of the Node class properties – FirstChild, LastChild, NextSibling, and PreviousSibling – each provide a live reference to another element with the defined relationship to the current element if the related element exists. For a complete list of classes and methods represented in the Aspose.Html.Dom Namespace, please visit the API Reference Source.

Direct Node Navigation

Using the properties above, you can walk through an HTML document as shown below.

 1// Navigate the HTML DOM using C#
 2
 3// Prepare HTML code
 4string html_code = "<span>Hello,</span> <span>World!</span>";
 5
 6// Initialize a document from the prepared code
 7using (HTMLDocument document = new HTMLDocument(html_code, "."))
 8{
 9    // Get the reference to the first child (first <span>) of the <body>
10    Node element = document.Body.FirstChild;
11    Console.WriteLine(element.TextContent); // output: Hello,
12
13    // Get the reference to the whitespace between html elements
14    element = element.NextSibling;
15    Console.WriteLine(element.TextContent); // output: ' '
16
17    // Get the reference to the second <span> element
18    element = element.NextSibling;
19    Console.WriteLine(element.TextContent); // output: World!
20
21    // Set an html variable for the document 
22    string html = document.DocumentElement.OuterHTML;
23
24    Console.WriteLine(html); // output: <html><head></head><body><span>Hello,</span> <span>World!</span></body></html>
25}

Inspect HTML

Aspose.HTML contains a list of methods that are based on the Element Traversal Specifications. You can perform a detailed inspection of the document and its elements using the API. The following code sample shows the generalized usage of Element Traversal features.

 1// Access and navigate HTML elements in a document using C#
 2
 3// Load a document from a file
 4string documentPath = Path.Combine(DataDir, "html_file.html");
 5
 6using (HTMLDocument document = new HTMLDocument(documentPath))
 7{
 8    // Get the html element of the document
 9    Element element = document.DocumentElement;
10    Console.WriteLine(element.TagName); // HTML
11
12    // Get the last element of the html element
13    element = element.LastElementChild;
14    Console.WriteLine(element.TagName); // BODY
15
16    // Get the first element in the body element
17    element = element.FirstElementChild;
18    Console.WriteLine(element.TagName); // H1
19    Console.WriteLine(element.TextContent); // Header 1
20}

Note: You need to specify the path to the source HTML file in your local file system (documentPath).

The DocumentElement property of the Document class gives direct access to the <html> element of the document ( html_file.html). The LastElementChild property of the Document class returns the last child element of the <html> element. It is the <body> element. According to the code snippet above, the variable element is overloaded again, and the FirstElementChild property returns the first child of the <body> element. It is the <h1> element.

Implementing a NodeFilter for Images

For the more complicated scenarios, when you need to find a node based on a specific pattern (e.g., get the list of headers, links, etc.), you can use a specialized TreeWalker or NodeIterator object with a custom Filter implementation.

The following example shows how to implement your own NodeFilter to skip all elements except images:

 1// Filter only <img> elements in HTML tree using C#
 2
 3class OnlyImageFilter : Aspose.Html.Dom.Traversal.Filters.NodeFilter
 4{
 5    public override short AcceptNode(Node n)
 6    {
 7        // The current filter skips all elements, except IMG elements
 8        return string.Equals("img", n.LocalName)
 9            ? FILTER_ACCEPT
10            : FILTER_SKIP;
11    }
12}

Once you implement a filter, you can use HTML navigation as it follows:

 1// Implement NodeFilter to skip all elements except images
 2
 3// Prepare HTML code
 4string code = @"
 5    <p>Hello,</p>
 6    <img src='image1.png'>
 7    <img src='image2.png'>
 8    <p>World!</p>";
 9
10// Initialize a document based on the prepared code
11using (HTMLDocument document = new HTMLDocument(code, "."))
12{
13    // To start HTML navigation, we need to create an instance of TreeWalker
14    // The specified parameters mean that it starts walking from the root of the document, iterating all nodes and using our custom implementation of the filter
15    using (ITreeWalker iterator = document.CreateTreeWalker(document, NodeFilter.SHOW_ALL, new OnlyImageFilter()))
16    {
17        while (iterator.NextNode() != null)
18        {
19            // Since we are using our own filter, the current node will always be an instance of the HTMLImageElement
20            // So, we don't need the additional validations here
21            HTMLImageElement image = (HTMLImageElement)iterator.CurrentNode;
22
23            Console.WriteLine(image.Src);
24            // output: image1.png
25            // output: image2.png
26
27            // Set an html variable for the document 
28            string html = document.DocumentElement.OuterHTML;
29        }
30    }
31}

XPath Query

The alternative to the HTML Navigation is XPath Query ( XML Path Language) that often referred to simply as an XPath. It is a query language that can be used to query data from HTML documents. It is based on a DOM representation of the HTML document, and selects nodes by various criteria. The syntax of the XPath expressions is quite simple, and what is more important, it is easy to read and support.

The following example shows how to use XPath queries within Aspose.HTML API:

 1// How to use XPath to select nodes using C#
 2
 3// Prepare HTML code
 4string code = @"
 5    <div class='happy'>
 6        <div>
 7            <span>Hello,</span>
 8        </div>
 9    </div>
10    <p class='happy'>
11        <span>World!</span>
12    </p>
13";
14
15// Initialize a document based on the prepared code
16using (HTMLDocument document = new HTMLDocument(code, "."))
17{
18    // Here we evaluate the XPath expression where we select all child <span> elements from elements whose 'class' attribute equals to 'happy':
19    IXPathResult result = document.Evaluate("//*[@class='happy']//span",
20        document,
21        null,
22        XPathResultType.Any,
23        null);
24
25    // Iterate over the resulted nodes
26    for (Node node; (node = result.IterateNext()) != null;)
27    {
28        Console.WriteLine(node.TextContent);
29        // output: Hello,
30        // output: World!
31    }
32}

The article How to use XPath Query in HTML – Evaluate() method introduces how to use Evaluate() method to navigate through an HTML document and select nodes by various criteria. With C# examples, you will learn how to select all Nodes with specified Name using XPath query.

CSS Selector

Along with HTML Navigation and XPath, you can use CSS Selector API that is also supported by our library. This API is designed to create a search pattern to match elements in a document tree based on CSS Selectors syntax.

In the following example, we use the QuerySelectorAll() method for navigation through an HTML document and search the needed elements. This method takes as a parameter the query selector and returns a NodeList of all the elements, which match the specified selector.

 1// Extract nodes Using CSS selector in C#
 2
 3// Prepare HTML code
 4string code = @"
 5    <div class='happy'>
 6        <div>
 7            <span>Hello,</span>
 8        </div>
 9    </div>
10    <p class='happy'>
11        <span>World!</span>
12    </p>
13";
14
15// Initialize a document based on the prepared code
16using (HTMLDocument document = new HTMLDocument(code, "."))
17{
18    // Here we create a CSS Selector that extracts all elements whose 'class' attribute equals 'happy' and their child <span> elements
19    NodeList elements = document.QuerySelectorAll(".happy span");
20
21    // Iterate over the resulted list of elements
22    foreach (HTMLElement element in elements)
23    {
24        Console.WriteLine(element.InnerHTML);
25        // output: Hello,
26        // output: World!
27    }
28}

Common Mistakes and Fixes

ProblemCauseSolution
XPath returns no nodesIncorrect XPath syntax or missing namespace handling.Test the expression in an XPath tester and ensure the document is loaded correctly.
QuerySelectorAll returns an empty listSelector does not match due to incorrect syntax, missing class/id, or wrong element type.Use the exact CSS selector syntax and verify the HTML structure. Note: HTML element names are case-insensitive, but class and ID selectors are case-sensitive.
Custom NodeFilter never firesFilter not attached to the walker/iterator.Pass the filter instance when creating TreeWalker or NodeIterator.

Conclusion

Aspose.HTML for .NET offers three powerful ways to locate and inspect elements—direct DOM navigation, XPath queries, and CSS selectors. Choose the approach that best fits your scenario, combine them when needed, and leverage custom filters for fine‑grained control.

See Also

  • For more information on how to effectively apply selectors to select the elements you want to style, please visit the article CSS Selectors. You will cover Basic Selectors, Combinator Selectors, Attribute Selectors, Group Selectors and Pseudo Selectors.
  • In the article How to use CSS Selector, you will learn how to effectively apply selectors to select the elements you want to style. QuerySelector() and QuerySelectorAll() are methods that are used to query DOM elements that match a CSS selector.
  • You can download the complete C# examples and data files from GitHub.

Text “HTML Web Applications”

Close
Loading

Analyzing your prompt, please hold on...

An error occurred while retrieving the results. Please refresh the page and try again.

Subscribe to Aspose Product Updates

Get monthly newsletters & offers directly delivered to your mailbox.