Create HTML Document – Create, Load HTML in Java
HTML Document
The HTMLDocument class is a starting point for the Aspose.HTML for Java class library, allowing developers to work with HTML content programmatically. The HTMLDocument class represents an HTML page as rendered in a browser, serving as the root of the Document Object Model (DOM).
Some HTMLDocument features
- Flexible constructors support creating an HTML document from a file, URL, stream, or string.
- HTMLDocument provides an in-memory representation that ensures the full DOM structure for manipulation.
- Event handling includes support for DOM events for asynchronous operations.
The HTMLDocument provides an in-memory representation of an HTML DOM and is entirely based on W3C DOM and WHATWG DOM specifications supported by many modern browsers. If you are familiar with WHATWG DOM, WHATWG HTML, and JavaScript standards, you will find it quite comfy to use the Aspose.HTML for Java. Otherwise, you can visit www.w3schools.com, where you can find many examples and tutorials on how to work with HTML documents.
Create an Empty HTML Document
The following code snippet shows the usage of the default HTMLDocument()
constructor to create an empty HTML document and save it to a file.
1// Initialize an empty HTML Document.
2HTMLDocument document = new HTMLDocument();
3
4// Save the document to disk.
5document.save("create-empty-document.html");
After the creation, file create-empty-document.html appears with the initial document structure: the empty document includes elements such as <html>
<head>
and <body>
. More details about HTML files saving are in the
Save HTML Document article.
Resulting File Structure:
1<html>
2 <head></head>
3 <body></body>
4</html>
Create a New HTML Document
To generate a document programmatically from scratch, use the HTMLDocument()
constructor with no parameters, as shown in the code snippet above. Once the document object is created, it can be filled later with HTML elements. You can populate the document with content, such as creating a text node and adding it to the body of the document:
1// Initialize an empty HTML Document.
2HTMLDocument document = new HTMLDocument();
3
4// Create a text element and add it to the document
5Text text = document.createTextNode("Hello World!");
6document.getBody().appendChild(text);
7
8// Save the document to a disk
9document.save("create-new-document.html");
Load HTML from a File
Following code snippet shows how to load the HTMLDocument from an existing file:
1// Prepare a 'load-from-file.html' file.
2try (java.io.FileWriter fileWriter = new java.io.FileWriter("load-from-file.html")) {
3 fileWriter.write("Hello World!");
4}
5
6// Load from a 'load-from-file.html' file.
7HTMLDocument document = new HTMLDocument("load-from-file.html");
8
9// Write the document content to the output stream.
10System.out.println(document.getDocumentElement().getOuterHTML());
Load HTML from a URL
The HTMLDocument class can fetch and load HTML content from a web page. In the next code snippet you can see how to load a web page into HTMLDocument.
In case if you pass a wrong URL that can’t be reached right at the moment, the library throws the
DOMException with specialized code NetworkError
to inform you that the selected resource can not be found.
1// Load a document from 'https://docs.aspose.com/html/net/creating-a-document/document.html' web page
2HTMLDocument document = new HTMLDocument("https://docs.aspose.com/html/net/creating-a-document/document.html");
3
4System.out.println(document.getDocumentElement().getOuterHTML());
Load HTML from HTML Code
If you prepare an HTML code as an in-memory Class String or Class InputStream objects, you don’t need to save them to the file, simply pass your HTML code into specialized constructors. To create a document from a string, use the constructor HTMLDocument(content, baseUri) with the HTML content and a baseUri:
1// Prepare HTML code
2String html_code = "<p>Hello World!</p>";
3
4// Initialize a document from the string variable
5HTMLDocument document = new HTMLDocument(html_code, ".");
6
7// Save the document to a disk
8document.save("create-from-string.html");
In case your HTML code has the linked resources (styles, scripts, images, etc.), you need to pass a valid baseUrl parameter to the constructor of the document. It will be used to resolve the location of the resource during the document loading.
Load HTML from a Stream
To create an HTML document from a stream, you can use the HTMLDocument(stream, string) constructor:
1// Create a memory stream object
2String code = "<p>Hello World! I love HTML!</p>";
3java.io.InputStream inputStream = new java.io.ByteArrayInputStream(code.getBytes());
4
5// Initialize a document from the stream variable
6HTMLDocument document = new HTMLDocument(inputStream, ".");
7
8// Save the document to a disk
9document.save("load-from-stream.html");
Working with SVG, MHTML, and EPUB Documents
SVG Document
Since Scalable Vector Graphics (SVG) is a part of W3C standards and could be embedded into HTMLDocument, we implemented SVGDocument and all its functionality. Our implementation is based on official specification SVG2 specification, so you can load, read, manipulate SVG documents as it described officially.
Since
SVGDocument and
HTMLDocument are based on the same
WHATWG DOM standard, the all operations such as loading, reading, editing, converting and saving are similar for both documents. So, the all examples where you can see manipulation with HTMLDocument
are applicable for SVGDocument
as well.
The example below shows you how to load the SVG Document from the in-memory Class String variable:
1// Initialize an SVG document from a string object
2SVGDocument document = new SVGDocument("<svg xmlns='http://www.w3.org/2000/svg'><circle cx='50' cy='50' r='40'/></svg>", ".");
3
4// Write the document content to the output stream
5System.out.println(document.getDocumentElement().getOuterHTML());
MHTML Document
MHTML (MIME encapsulation of aggregate HTML documents) is a specialized format for creating web page archives. The Aspose.HTML for Java library supports MHTML, but its functionality is currently limited to converting and rendering operations from MHTML to other supported output formats. For more information, refer to the Converting Between Formats article.
EPUB Document
EPUB, an electronic publication format widely used for eBooks, has similar limitations in the Aspose.HTML for Java library as MHTML. The library supports only rendering operations from EPUB to supported output formats. For additional details, visit the Converting Between Formats article.
Asynchronous Operations
We realize that loading a document could be a resource-intensive operation since it’s required loading not only the document itself but all linked resources and processing all scripts. So, in the following code snippets, we show you how to use asynchronous operations and load HTMLDocument without blocking the main thread:
1// Create an instance of an HTML document
2HTMLDocument document = new HTMLDocument();
3
4// Create a string variable for OuterHTML property reading
5StringBuilder outerHTML = new StringBuilder();
6
7// Subscribe to 'ReadyStateChange' event
8// This event will be fired during the document loading process
9document.OnReadyStateChange.add(new DOMEventHandler() {
10 @Override
11 public void invoke(Object sender, Event e) {
12 // Check the value of the 'ReadyState' property
13 // This property is representing the status of the document. For detail information please visit https://www.w3schools.com/jsref/prop_doc_readystate.asp
14 if (document.getReadyState().equals("complete")) {
15 // Fill the outerHTML variable by value of loaded document
16 outerHTML.append(document.getDocumentElement().getOuterHTML());
17 }
18 }
19});
20
21Thread.sleep(5000);
22
23System.out.println("outerHTML = " + outerHTML);
ReadyStateChange is not the only event that can used to handle an async loading operation, you can also subscribe for Load event, as it follows:
1// Create the instance of HTML Document
2HTMLDocument document = new HTMLDocument();
3
4// Subscribe to the 'ReadyStateChange' event.
5// This event will be fired during the document loading process.
6document.OnReadyStateChange.add(new DOMEventHandler() {
7 @Override
8 public void invoke(Object sender, Event e) {
9 // Check the value of 'ReadyState' property.
10 // This property is representing the status of the document. For detail information please visit https://www.w3schools.com/jsref/prop_doc_readystate.asp
11 if (document.getReadyState().equals("complete")) {
12 System.out.println(document.getDocumentElement().getOuterHTML());
13 notifyAll();
14 }
15 }
16});
17
18// Navigate asynchronously at the specified Uri
19document.navigate("https://html.spec.whatwg.org/multipage/introduction.html");
20
21synchronized (this) {
22 wait(10000);
23}
The following Java code example uses the HTMLDocumentWaiter
class in the context of working with HTML documents asynchronously in the Aspose.HTML for Java library. The HTMLDocumentWaiter
class provides constructors and methods that execute the asynchronous loading operation in a separate thread and waits until either the loading is finished or the current thread is interrupted. Let’s see what the code does:
1public class HTMLDocumentWaiter implements Runnable {
2
3 private final Examples_Java_WorkingWithDocuments_CreatingADocument_HTMLDocumentAsynchronouslyOnLoad html;
4
5 public HTMLDocumentWaiter(Examples_Java_WorkingWithDocuments_CreatingADocument_HTMLDocumentAsynchronouslyOnLoad html) throws Exception {
6 this.html = html;
7 this.html.execute();
8 }
9
10 @Override
11 public void run() {
12 System.out.println("Current Thread: " + Thread.currentThread().getName() + "; " + Thread.currentThread().getId());
13
14 while (!Thread.currentThread().isInterrupted() && html.getMsg() == null) {
15 try {
16 Thread.sleep(60000);
17 } catch (InterruptedException e) {
18 throw new RuntimeException(e);
19 }
20 }
21 Thread.currentThread().interrupt();
22 }
23}
The following code snippet describes the SimpleWait
class, which contains the main()
method that serves as the entry point for a Java application. Inside the main()
method, an html
instance of the class Examples_Java_WorkingWithDocuments_CreatingADocument_HTMLDocumentAsynchronouslyOnLoad
is created. It is responsible for loading the HTML document asynchronously and creates an HTMLDocumentWaiter object to wait for the loading to complete. Finally, it starts a new thread to execute the waiting process:
1// Create the instance of HTML Document
2HTMLDocument document = new HTMLDocument();
3
4// Subscribe to the 'OnLoad' event.
5// This event will be fired once the document is fully loaded.
6document.OnLoad.add(new DOMEventHandler() {
7 @Override
8 public void invoke(Object sender, Event e) {
9 msg = document.getDocumentElement().getOuterHTML();
10 System.out.println(msg);
11 }
12});
13
14// Navigate asynchronously at the specified Uri
15document.navigate("https://html.spec.whatwg.org/multipage/introduction.html");
Conclusions
Comprehensive DOM manipulation: The
HTMLDocument
class provides a robust and standards-compliant way to create, modify, and manipulate HTML documents programmatically, following W3C and WHATWG specifications.Flexible document creation and loading: Using constructors, developers can create documents from scratch, load HTML from a variety of sources (files, URLs, streams), or dynamically generate content.
Advanced operations support: Features such as asynchronous loading and event handling enable seamless integration of resource-intensive operations without blocking the main application thread.
Cross-format compatibility: The library extends some functionality HTML to other document formats such as SVG, MHTML, and EPUB, offering a unified approach to handling diverse web content.
You can download the complete examples and data files from GitHub.