HTML Navigation – C# Examples

Using the Aspose.HTML for .NET library, you can easily create your own application, since our API provides a powerful toolset to analyze and collect information from HTML documents.

This article provides information on how to programmatically extract data from HTML documents with the Aspose.HTML API. You find out:

how to navigate through an HTML document and perform a detailed inspection of its elements using the API;
about custom filters usage for iterating over the document elements;
how to navigate over the document by using CSS Selector or XPath Query.

The Aspose.Html.Dom namespace provides API that represents and interacts with any HTML, XML or SVG documents and is entirely based on the WHATWG DOM specification supported in many modern browsers. The DOM is a document model loaded in the browser and representing the document as a node tree, where each node represents part of the document (e.g. an element, text string, or comment).

We consider how the DOM represents an HTML document in memory and how to use API for navigation through HTML files. Many ways can be used to make HTML navigation. The following shortlist shows the simplest way to access all DOM elements:

Property	Description
FirstChild	Accessing this property of an element must return a reference to the first child node.
LastChild	Accessing this property of an element must return a reference to the last child node
NextSibling	Accessing this property of an element must return a reference to the sibling node of that element which most immediately follows that element.
PreviousSibling	Accessing this property of an element must return a reference to the sibling node of that element which most immediately precedes that element.
ChildNodes	Returns a list that contains all children of that element.

Four of the Node class properties – FirstChild, LastChild, NextSibling, and PreviousSibling, each provides a live reference to another element with the defined relationship to the current element if the related element exists. For a complete list of classes and methods represented in the Aspose.Html.Dom Namespace, please visit API Reference Source.

Using the mentioned properties, you can navigate through an HTML document as it follows:

 1// For complete examples and data files, please go to https://github.com/aspose-html/Aspose.HTML-for-.NET
 2// Prepare HTML code
 3var html_code = "<span>Hello</span> <span>World!</span>";
 4
 5// Initialize a document from the prepared code
 6using (var document = new Aspose.Html.HTMLDocument(html_code, "."))
 7{
 8    // Get the reference to the first child (first SPAN) of the BODY
 9    var element = document.Body.FirstChild;
10    Console.WriteLine(element.TextContent); // output: Hello
11
12    // Get the reference to the whitespace between html elements
13    element = element.NextSibling;
14    Console.WriteLine(element.TextContent); // output: ' '
15
16    // Get the reference to the second SPAN element
17    element = element.NextSibling;
18    Console.WriteLine(element.TextContent); // output: World!
19}

view raw Examples-CSharp-WebScraping-WebScraping-NavigateThroughHTML.cs hosted with ❤ by GitHub

Inspect HTML

Aspose.HTML contains a list of methods that are based on the Element Traversal Specifications. You can perform a detailed inspection of the document and its elements using the API. The following code sample shows the generalized usage of Element Traversal features.

 1//Load a document from a file
 2string documentPath = System.IO.Path.Combine(DataDir, "html_file.html");
 3using (var document = new Aspose.Html.HTMLDocument(documentPath))
 4{
 5    // Get the html element of the document
 6    var element = document.DocumentElement;                
 7    Console.WriteLine(element.TagName); // HTML
 8    
 9    // Get the last element of the html element
10    element = element.LastElementChild;
11    Console.WriteLine(element.TagName); // BODY
12    
13    // Get the first element of the body element
14    element = element.FirstElementChild;
15    Console.WriteLine(element.TagName); // H1
16    Console.WriteLine(element.TextContent); // Header 1     
17}

view raw Examples-CSharp-WebScraping-HtmlNavigation-InspectionOfTheHtmlDocument.cs hosted with ❤ by GitHub

Note: You need to specify the path to the source HTML file in your local file system (documentPath).

The DocumentElement property of the Document class gives direct access to the <html> element of the document ( html_file.html). The LastElementChild property of the Document class returns the last child element of the <html> element. It is the <body> element. According to the code snippet above, the variable “element” is overloaded again, and the FirstElementChild property returns the first child of the <body> element. It is the <h1> element.

Custom Filter Usage

For the more complicated scenarios, when you need to find a node based on a specific pattern (e.g., get the list of headers, links, etc.), you can use a specialized TreeWalker or NodeIterator object with a custom Filter implementation.

The following example shows how to implement your own NodeFilter to skip all elements except images:

 1// For complete examples and data files, please go to https://github.com/aspose-html/Aspose.HTML-for-.NET
 2class OnlyImageFilter : Aspose.Html.Dom.Traversal.Filters.NodeFilter
 3{
 4    public override short AcceptNode(Aspose.Html.Dom.Node n)
 5    {
 6        // The current filter skips all elements, except IMG elements.
 7        return string.Equals("img", n.LocalName)
 8            ? FILTER_ACCEPT
 9            : FILTER_SKIP;
10    }
11}

view raw Examples-CSharp-WebScraping-WebScraping-OnlyImageFilter.cs hosted with ❤ by GitHub

Once you implement a filter, you can use HTML navigation as it follows:

 1// For complete examples and data files, please go to https://github.com/aspose-html/Aspose.HTML-for-.NET
 2// Prepare HTML code
 3var code = @"
 4    <p>Hello</p>
 5    <img src='image1.png'>
 6    <img src='image2.png'>
 7    <p>World!</p>";
 8
 9// Initialize a document based on the prepared code
10using (var document = new Aspose.Html.HTMLDocument(code, "."))
11{
12    // To start HTML navigation we need to create an instance of TreeWalker.
13    // The specified parameters mean that it starts walking from the root of the document, iterating all nodes and use our custom implementation of the filter
14    using (var iterator = document.CreateTreeWalker(document, Aspose.Html.Dom.Traversal.Filters.NodeFilter.SHOW_ALL, new OnlyImageFilter()))
15    {         
16        while (iterator.NextNode() != null)
17        {
18            // Since we are using our own filter, the current node will always be an instance of the HTMLImageElement.
19            // So, we don't need the additional validations here.
20            var image = (Aspose.Html.HTMLImageElement)iterator.CurrentNode;
21
22            System.Console.WriteLine(image.Src);
23            // output: image1.png
24            // output: image2.png
25        }
26    }
27}

view raw Examples-CSharp-WebScraping-WebScraping-NodeFilterUsageExample.cs hosted with ❤ by GitHub

XPath Query

The alternative to the HTML Navigation is XPath Query ( XML Path Language) that often referred to simply as an XPath. It is a query language that can be used to query data from HTML documents. It is based on a DOM representation of the HTML document, and selects nodes by various criteria. The syntax of the XPath expressions is quite simple, and what is more important, it is easy to read and support.

The following example shows how to use XPath queries within Aspose.HTML API:

 1// For complete examples and data files, please go to https://github.com/aspose-html/Aspose.HTML-for-.NET
 2// Prepare HTML code
 3var code = @"
 4    <div class='happy'>
 5        <div>
 6            <span>Hello!</span>
 7        </div>
 8    </div>
 9    <p class='happy'>
10        <span>World</span>
11    </p>
12";
13
14// Initialize a document based on the prepared code
15using (var document = new Aspose.Html.HTMLDocument(code, "."))
16{
17    // Here we evaluate the XPath expression where we select all child SPAN elements from elements whose 'class' attribute equals to 'happy':
18    var result = document.Evaluate("//*[@class='happy']//span",
19        document,
20        null,
21        Aspose.Html.Dom.XPath.XPathResultType.Any,
22        null);
23
24    // Iterate over the resulted nodes
25    for (Aspose.Html.Dom.Node node; (node = result.IterateNext()) != null;)
26    {
27        System.Console.WriteLine(node.TextContent);
28        // output: Hello
29        // output: World!
30    }
31}

view raw Examples-CSharp-WebScraping-WebScraping-XPathQueryUsageExample.cs hosted with ❤ by GitHub

CSS Selector

Along with HTML Navigation and XPath, you can use CSS Selector API that is also supported by our library. This API is designed to create a search pattern to match elements in a document tree based on CSS Selectors syntax.

In the following example, we use the QuerySelectorAll() method for navigation through an HTML document and search the needed elements. This method takes as a parameter the query selector and returns a NodeList of all the elements, which match the specified selector.

 1// For complete examples and data files, please go to https://github.com/aspose-html/Aspose.HTML-for-.NET
 2// Prepare HTML code
 3var code = @"
 4    <div class='happy'>
 5        <div>
 6            <span>Hello</span>
 7        </div>
 8    </div>
 9    <p class='happy'>
10        <span>World!</span>
11    </p>
12";
13
14// Initialize a document based on the prepared code
15using (var document = new Aspose.Html.HTMLDocument(code, "."))
16{
17    // Here we create a CSS Selector that extract all elements whose 'class' attribute equals to 'happy' and their child SPAN elements
18    var elements = document.QuerySelectorAll(".happy span");
19
20    // Iterate over the resulted list of elements
21    foreach (Aspose.Html.HTMLElement element in elements)
22    {
23        System.Console.WriteLine(element.InnerHTML);
24        // output: Hello
25        // output: World!
26    }
27}

view raw Examples-CSharp-WebScraping-WebScraping-CSSSelectorUsageExample.cs hosted with ❤ by GitHub

You can download the complete C# examples and data files from GitHub.

Aspose.HTML offers free online HTML Web Applications that are an online collection of converters, mergers, SEO tools, HTML code generators, URL tools, and more. The applications work on any operating system with a web browser and do not require any additional software installation. Easily convert, merge, encode, generate HTML code, extract data from the web, or analyze web pages in terms of SEO wherever you are. Use our collection of HTML Web Applications to perform your daily matters and make your workflow seamless!

Data Extraction Website to HTML