使用 Aspose.HTML for Java 进行 HTML 导航

在本文中，您将学习如何使用 Aspose.HTML for Java API 浏览 HTML 文档并对其中的元素进行详细检查。由于我们的 API 提供了强大的工具集，可以使用 CSS 选择器、XPath 查询或自定义过滤器来浏览文档，因此您可以轻松创建自己的应用程序，以分析、收集或提取 HTML 文档中的信息。

HTML 导航

有许多方法可用于 HTML 导航。下面列出了使用 Node 类访问所有 DOM 元素的最简单方法：

Property	Description
FirstChild	Accessing this attribute of an element must return a reference to the first child node.
LastChild	Accessing this attribute of an element must return a reference to the last child node
NextSibling	Accessing this attribute of an element must return a reference to the sibling node of that element which most immediately follows that element.
PreviousSibling	Accessing this attribute of an element must return a reference to the sibling node of that element which most immediately precedes that element.
ChildNodes	Returns a list that contains all children of that element.

使用上述属性，您可以浏览 HTML 文档，如下所示：

 1// Navigate the HTML DOM using Java
 2
 3// Prepare HTML code
 4String html_code = "<span>Hello,</span> <span>World!</span>";
 5
 6// Initialize a document from the prepared code
 7HTMLDocument document = new HTMLDocument(html_code, ".");
 8
 9// Get the reference to the first child (first <span>) of the document body
10Element element = document.getBody().getFirstElementChild();
11System.out.println(element.getTextContent());
12// @output: Hello,
13
14// Get the reference to the second <span> element
15element = element.getNextElementSibling();
16System.out.println(element.getTextContent());
17// @output: World!

Example_NavigateThroughHtml.java hosted with ❤ by GitHub

对于更复杂的情况，即需要根据特定模式查找节点(例如，获取标题、链接等列表)时，可以使用专门的 TreeWalker 或 NodeIterator 对象，并自定义 Filter 实现。

下一个示例展示了如何实现自己的 NodeFilter 以跳过图像以外的所有元素：

 1// Create custom NodeFilter to accept only image elements in Java
 2
 3public static class OnlyImageFilter extends NodeFilter {
 4    @Override
 5    public short acceptNode(Node n) {
 6        // The current filter skips all elements, except IMG elements
 7        return "img".equals(n.getLocalName())
 8                ? FILTER_ACCEPT
 9                : FILTER_SKIP;
10    }
11}

Example_CustomNodeFilterForImageElements.java hosted with ❤ by GitHub

使用过滤器后，就可以按如下方式使用 HTML 导航：

 1// Filter HTML elements using TreeWalker and custom NodeFilter in Aspose.HTML for Java
 2
 3// Prepare HTML code
 4String code = "    < p > Hello, </p >\n" +
 5        "    <img src = 'image1.png' >\n" +
 6        "    <img src = 'image2.png' >\n" +
 7        "    <p > World ! </p >\n";
 8
 9// Initialize a document based on the prepared code
10HTMLDocument document = new HTMLDocument(code, ".");
11
12// To start HTML navigation, we need to create an instance of TreeWalker
13// The specified parameters mean that it starts walking from the root of the document, iterating all nodes, and using our custom implementation of the filter
14ITreeWalker iterator = document.createTreeWalker(document, NodeFilter.SHOW_ALL, new NodeFilterUsageExample.OnlyImageFilter());
15// Use
16while (iterator.nextNode() != null) {
17    // Since we are using our own filter, the current node will always be an instance of the HTMLImageElement
18    // So, we don't need the additional validations here
19    HTMLImageElement image = (HTMLImageElement) iterator.getCurrentNode();
20
21    System.out.println(image.getSrc());
22    // @output: image1.png
23    // @output: image2.png
24}

Example_TraverseHtmlDocumentUsingTreeWalker.java hosted with ❤ by GitHub

XPath

替代HTML 导航的是 XML 路径语言。XPath 表达式的语法相当简单，更重要的是，它易于阅读和支持。

下面的示例展示了如何在 Aspose.HTML for Java API 中使用 XPath 查询：

 1// Select HTML elements using XPath expression in Aspose.HTML for Java
 2
 3// Prepare HTML code
 4String code = "< div class='happy' >\n" +
 5        "        <div >\n" +
 6        "            <span > Hello! </span >\n" +
 7        "        </div >\n" +
 8        "    </div >\n" +
 9        "    <p class='happy' >\n" +
10        "        <span > World! </span >\n" +
11        "    </p >\n";
12
13// Initialize a document based on the prepared code
14HTMLDocument document = new HTMLDocument(code, ".");
15
16// Here, we evaluate the XPath expression where we select all child <span> elements from elements whose 'class' attribute equals to 'happy'
17IXPathResult result = document.evaluate("//*[@class='happy']//span",
18        document,
19        null,
20        XPathResultType.Any,
21        null
22);
23
24// Iterate over the resulted nodes
25for (Node node; (node = result.iterateNext()) != null; ) {
26    System.out.println(node.getTextContent());
27    // @output: Hello!
28    // @output: World!
29}

Example_SelectHtmlElementsUsingXPath.java hosted with ❤ by GitHub

CSS 选择器

除了 HTML Navigation 和 XPath，您还可以使用我们的库同样支持的 CSS Selector API。该 API 可根据 CSS 选择器语法创建搜索模式，以匹配文档树中的元素。

 1// Select HTML elements using CSS selector querySelectorAll method in Aspose.HTML for Java
 2
 3// Prepare HTML code
 4String code = "< div class='happy' >\n" +
 5        "        <div >\n" +
 6        "            <span > Hello, </span >\n" +
 7        "        </div >\n" +
 8        "    </div >\n" +
 9        "    <p class='happy' >\n" +
10        "        <span > World ! </span >\n" +
11        "    </p >\n";
12
13// Initialize a document based on the prepared code
14HTMLDocument document = new HTMLDocument(code, ".");
15
16// Here, we create a CSS Selector that extracts all elements whose 'class' attribute equals to 'happy' and their child SPAN elements
17NodeList elements = document.querySelectorAll(".happy span");
18
19// Iterate over the resulted list of elements
20elements.forEach(element -> {
21    System.out.println(((HTMLElement) element).getInnerHTML());
22    // @output: Hello,
23    // @output: World!
24});

Example_SelectHtmlElementsUsingCssSelector.java hosted with ❤ by GitHub

Aspose.HTML 提供 AI 关键词提取器，这是一款人工智能工具，用于从网页、纯文本或文件中提取关键词。这款应用程序可帮助您快速确定关键主题和趋势，以便进行网站优化、竞争对手分析或总结大型文档。只需粘贴文本或 URL，选择设置，然后点击 “提取”，即可在几秒钟内获得准确、有意义的关键词。是提高搜索引擎可见度、内容定位和数据驱动决策的理想选择。

数据提取网站转 HTML

Analyzing your prompt, please hold on...

An error occurred while retrieving the results. Please refresh the page and try again.