Web scraping, also well known as web harvesting, web data extraction or web crawling, is used for extracting data from websites. A web scraping software will help you to automate the process of extracting data based on your requirements. However, configuring web scraping software sometimes is a challenging task. Using Aspose.HTML class library, you can easily create your own application, since our API provides a powerful toolset to analyze and collect information from HTML documents.
There are many ways that can be used to make HTML navigation. The following shortlist shows the simplest way to access to all DOM elements:
|FirstChild||Accessing this attribute of an element must return a reference to the first child node.|
|LastChild||Accessing this attribute of an element must return a reference to the last child node|
|NextSibling||Accessing this attribute of an element must return a reference to the sibling node of that element which most immediately follows that element.|
|PreviousSibling||Accessing this attribute of an element must return a reference to the sibling node of that element which most immediately precedes that element.|
|ChildNodes||Returns a list that contains all children of that element.|
Using the mentioned properties, you can navigate through an HTML document as it follows:
For the more complicated scenarios, when you need to find a node based on a specific pattern (e.g., get the list of headers, links, etc.), you can use a specialized TreeWalker or NodeIterator object with a custom Filter implementation.
The next example shows how to implement your own NodeFilter to skip all elements except images:
Once you implement a filter, you can use HTML navigation as it follows:
The alternative to the HTML Navigation is XML Path Language. The syntax of the XPath expressions is quite simple and what is more important, it is easy to read and support.
The following example shows how to use XPath queries within Aspose.HTML API::
Along with HTML Navigation and XPath you can use CSS Selector API that is also supported by our library. This API is designed to create a search pattern to match elements in a document tree based on CSS Selectors syntax.