使用 Aspose.HTML for Java 进行 HTML 导航

Contents
[ Hide Show ]

在本文中,您将学习如何使用 Aspose.HTML for Java API 浏览 HTML 文档并对其中的元素进行详细检查。由于我们的 API 提供了强大的工具集,可以使用 CSS 选择器、XPath 查询或自定义过滤器来浏览文档,因此您可以轻松创建自己的应用程序,以分析、收集或提取 HTML 文档中的信息。

HTML 导航

有许多方法可用于 HTML 导航。下面列出了使用 Node 类访问所有 DOM 元素的最简单方法:

PropertyDescription
FirstChildAccessing this attribute of an element must return a reference to the first child node.
LastChildAccessing this attribute of an element must return a reference to the last child node
NextSiblingAccessing this attribute of an element must return a reference to the sibling node of that element which most immediately follows that element.
PreviousSiblingAccessing this attribute of an element must return a reference to the sibling node of that element which most immediately precedes that element.
ChildNodesReturns a list that contains all children of that element.

使用上述属性,您可以浏览 HTML 文档,如下所示:

 1// Navigate the HTML DOM using Java
 2
 3// Prepare HTML code
 4String html_code = "<span>Hello,</span> <span>World!</span>";
 5
 6// Initialize a document from the prepared code
 7HTMLDocument document = new HTMLDocument(html_code, ".");
 8
 9// Get the reference to the first child (first <span>) of the document body
10Element element = document.getBody().getFirstElementChild();
11System.out.println(element.getTextContent());
12// @output: Hello,
13
14// Get the reference to the second <span> element
15element = element.getNextElementSibling();
16System.out.println(element.getTextContent());
17// @output: World!

对于更复杂的情况,即需要根据特定模式查找节点(例如,获取标题、链接等列表)时,可以使用专门的 TreeWalkerNodeIterator 对象,并自定义 Filter 实现。

下一个示例展示了如何实现自己的 NodeFilter 以跳过图像以外的所有元素:

 1// Create custom NodeFilter to accept only image elements in Java
 2
 3public static class OnlyImageFilter extends NodeFilter {
 4    @Override
 5    public short acceptNode(Node n) {
 6        // The current filter skips all elements, except IMG elements
 7        return "img".equals(n.getLocalName())
 8                ? FILTER_ACCEPT
 9                : FILTER_SKIP;
10    }
11}

使用过滤器后,就可以按如下方式使用 HTML 导航:

 1// Filter HTML elements using TreeWalker and custom NodeFilter in Aspose.HTML for Java
 2
 3// Prepare HTML code
 4String code = "    < p > Hello, </p >\n" +
 5        "    <img src = 'image1.png' >\n" +
 6        "    <img src = 'image2.png' >\n" +
 7        "    <p > World ! </p >\n";
 8
 9// Initialize a document based on the prepared code
10HTMLDocument document = new HTMLDocument(code, ".");
11
12// To start HTML navigation, we need to create an instance of TreeWalker
13// The specified parameters mean that it starts walking from the root of the document, iterating all nodes, and using our custom implementation of the filter
14ITreeWalker iterator = document.createTreeWalker(document, NodeFilter.SHOW_ALL, new NodeFilterUsageExample.OnlyImageFilter());
15// Use
16while (iterator.nextNode() != null) {
17    // Since we are using our own filter, the current node will always be an instance of the HTMLImageElement
18    // So, we don't need the additional validations here
19    HTMLImageElement image = (HTMLImageElement) iterator.getCurrentNode();
20
21    System.out.println(image.getSrc());
22    // @output: image1.png
23    // @output: image2.png
24}

XPath

替代HTML 导航的是 XML 路径语言。XPath 表达式的语法相当简单,更重要的是,它易于阅读和支持。

下面的示例展示了如何在 Aspose.HTML for Java API 中使用 XPath 查询:

 1// Select HTML elements using XPath expression in Aspose.HTML for Java
 2
 3// Prepare HTML code
 4String code = "< div class='happy' >\n" +
 5        "        <div >\n" +
 6        "            <span > Hello! </span >\n" +
 7        "        </div >\n" +
 8        "    </div >\n" +
 9        "    <p class='happy' >\n" +
10        "        <span > World! </span >\n" +
11        "    </p >\n";
12
13// Initialize a document based on the prepared code
14HTMLDocument document = new HTMLDocument(code, ".");
15
16// Here, we evaluate the XPath expression where we select all child <span> elements from elements whose 'class' attribute equals to 'happy'
17IXPathResult result = document.evaluate("//*[@class='happy']//span",
18        document,
19        null,
20        XPathResultType.Any,
21        null
22);
23
24// Iterate over the resulted nodes
25for (Node node; (node = result.iterateNext()) != null; ) {
26    System.out.println(node.getTextContent());
27    // @output: Hello!
28    // @output: World!
29}

CSS 选择器

除了 HTML NavigationXPath,您还可以使用我们的库同样支持的 CSS Selector API。该 API 可根据 CSS 选择器 语法创建搜索模式,以匹配文档树中的元素。

 1// Select HTML elements using CSS selector querySelectorAll method in Aspose.HTML for Java
 2
 3// Prepare HTML code
 4String code = "< div class='happy' >\n" +
 5        "        <div >\n" +
 6        "            <span > Hello, </span >\n" +
 7        "        </div >\n" +
 8        "    </div >\n" +
 9        "    <p class='happy' >\n" +
10        "        <span > World ! </span >\n" +
11        "    </p >\n";
12
13// Initialize a document based on the prepared code
14HTMLDocument document = new HTMLDocument(code, ".");
15
16// Here, we create a CSS Selector that extracts all elements whose 'class' attribute equals to 'happy' and their child SPAN elements
17NodeList elements = document.querySelectorAll(".happy span");
18
19// Iterate over the resulted list of elements
20elements.forEach(element -> {
21    System.out.println(((HTMLElement) element).getInnerHTML());
22    // @output: Hello,
23    // @output: World!
24});

Aspose.HTML 提供 AI 关键词提取器,这是一款人工智能工具,用于从网页、纯文本或文件中提取关键词。这款应用程序可帮助您快速确定关键主题和趋势,以便进行网站优化、竞争对手分析或总结大型文档。只需粘贴文本或 URL,选择设置,然后点击 “提取”,即可在几秒钟内获得准确、有意义的关键词。是提高搜索引擎可见度、内容定位和数据驱动决策的理想选择。

文本 “AI 关键字提取器”

Close
Loading

Analyzing your prompt, please hold on...

An error occurred while retrieving the results. Please refresh the page and try again.

Subscribe to Aspose Product Updates

Get monthly newsletters & offers directly delivered to your mailbox.