Create HTML Document – Create, Load HTML in Java

HTML Document

The HTMLDocument class is a starting point for the Aspose.HTML for Java library, allowing developers to work with HTML content programmatically. The HTMLDocument class represents an HTML page as rendered in a browser, serving as the root of the Document Object Model (DOM).

Some HTMLDocument features

The HTMLDocument provides an in-memory representation of an HTML DOM and is entirely based on W3C DOM and WHATWG DOM specifications supported by many modern browsers. If you are familiar with WHATWG DOM, WHATWG HTML, and JavaScript standards, you will find it quite comfy to use the Aspose.HTML for Java. Otherwise, you can visit www.w3schools.com, where you can find many examples and tutorials on how to work with HTML documents.

Create an Empty HTML Document

The following code snippet shows the usage of the default HTMLDocument() constructor to create an empty HTML document and save it to a file.

1// Create an empty HTML document using Java
2
3// Initialize an empty HTML Document
4HTMLDocument document = new HTMLDocument();
5
6// Save the document to disk
7document.save("create-empty-document.html");

After the creation, file create-empty-document.html appears with the initial document structure: the empty document includes elements such as <html> <head> and <body>. More details about HTML files saving are in the Save HTML Document article.

Resulting File Structure:

1<html>
2    <head></head>
3    <body></body>
4</html>

Create a New HTML Document

To generate a document programmatically from scratch, use the HTMLDocument() constructor with no parameters, as shown in the code snippet above. Once the document object is created, it can be filled later with HTML elements. You can populate the document with content, such as creating a text node and adding it to the body of the document:

 1// Create an HTML document using Java
 2
 3// Initialize an empty HTML document
 4HTMLDocument document = new HTMLDocument();
 5
 6// Create a text node and add it to the document
 7Text text = document.createTextNode("Hello, World!");
 8document.getBody().appendChild(text);
 9
10// Save the document to disk
11document.save("create-new-document.html");

Load HTML from a File

Following code snippet shows how to load the HTMLDocument from an existing file:

 1// Load HTML from a file using Java
 2
 3// Prepare the "load-from-file.html" file
 4try (java.io.FileWriter fileWriter = new java.io.FileWriter("load-from-file.html")) {
 5    fileWriter.write("Hello, World!");
 6}
 7
 8// Load HTML from the file
 9HTMLDocument document = new HTMLDocument("load-from-file.html");
10
11// Write the document content to the output stream
12System.out.println(document.getDocumentElement().getOuterHTML());

Load HTML from a URL

The HTMLDocument class can fetch and load HTML content from a web page. In the next code snippet you can see how to load a web page into HTMLDocument.

In case if you pass a wrong URL that can’t be reached right at the moment, the library throws the DOMException with specialized code NetworkError to inform you that the selected resource can not be found.

1// Load HTML from a URL using Java
2
3// Load a document from https://docs.aspose.com/html/files/document.html web page
4HTMLDocument document = new HTMLDocument("https://docs.aspose.com/html/files/document.html");
5
6System.out.println(document.getDocumentElement().getOuterHTML());

Load HTML from HTML Code

If you prepare an HTML code as an in-memory Class String or Class InputStream objects, you don’t need to save them to the file, simply pass your HTML code into specialized constructors. To create a document from a string, use the constructor HTMLDocument(content, baseUri) with the HTML content and a baseUri:

 1// Create HTML from a string using Java
 2
 3// Prepare HTML code
 4String html_code = "<p>Hello, World!</p>";
 5
 6// Initialize a document from a string variable
 7HTMLDocument document = new HTMLDocument(html_code, ".");
 8
 9// Save the document to disk
10document.save("create-from-string.html");

In case your HTML code has the linked resources (styles, scripts, images, etc.), you need to pass a valid baseUrl parameter to the constructor of the document. It will be used to resolve the location of the resource during the document loading.

Load HTML from a Stream

To create an HTML document from a stream, you can use the HTMLDocument(stream, string) constructor:

 1// Load HTML from a stream using Java
 2
 3// Create a memory stream object
 4String code = "<p>Hello, World! I love HTML!</p>";
 5java.io.InputStream inputStream = new java.io.ByteArrayInputStream(code.getBytes());
 6
 7// Initialize a document from the stream variable
 8HTMLDocument document = new HTMLDocument(inputStream, ".");
 9
10// Save the document to disk
11document.save("load-from-stream.html");

Working with SVG, MHTML, and EPUB Documents

SVG Document

Since Scalable Vector Graphics (SVG) is a part of W3C standards and could be embedded into HTMLDocument, we implemented SVGDocument and all its functionality. Our implementation is based on official SVG2 specification, so you can load, read, manipulate SVG documents as it described officially.

Since SVGDocument and HTMLDocument are based on the same WHATWG DOM standard, the all operations such as loading, reading, editing, converting and saving are similar for both documents. So, the all examples where you can see manipulation with HTMLDocument are applicable for SVGDocument as well.

The example below shows you how to load the SVG Document from the in-memory Class String variable:

1// Load SVG from a string using Java
2
3// Initialize an SVG document from a string object
4SVGDocument document = new SVGDocument("<svg xmlns='http://www.w3.org/2000/svg'><circle cx='50' cy='50' r='40'/></svg>", ".");
5
6// Write the document content to the output stream
7System.out.println(document.getDocumentElement().getOuterHTML());

MHTML Document

MHTML (MIME encapsulation of aggregate HTML documents) is a specialized format for creating web page archives. The Aspose.HTML for Java library supports MHTML, but its functionality is currently limited to converting and rendering operations from MHTML to other supported output formats. For more information, refer to the Converting Between Formats article.

EPUB Document

EPUB, an electronic publication format widely used for eBooks, has similar limitations in the Aspose.HTML for Java library as MHTML. The library supports only rendering operations from EPUB to supported output formats. For additional details, visit the Converting Between Formats article.

Asynchronous Operations

We realize that loading a document can be a resource-intensive operation since it requires loading not only the document itself but all linked resources and processing all scripts. In the following code snippets, we demonstrate how to utilize asynchronous operations and load an HTMLDocument without blocking the main thread.

The following code demonstrates how to work with an HTMLDocument in Java by subscribing to the OnReadyStateChange event, which monitors the document loading process. When the document reaches the “complete” state, it retrieves the full HTML markup of the document’s root element using the getOuterHTML() method and stores it in a StringBuilder. To ensure the document has enough time to load and the event handler can execute, the program pauses execution for 5 seconds using Thread.sleep(5000). Finally, it prints the captured HTML to the console. This approach can be helpful for programmatically loading, monitoring, and extracting the full HTML structure of a web page or document, which can then be processed, parsed, or saved for later use.

 1// Load HTML asynchronously using Java
 2
 3// Create an instance of the HTMLDocument class
 4HTMLDocument document = new HTMLDocument();
 5
 6// Create a string variable for OuterHTML property reading
 7StringBuilder outerHTML = new StringBuilder();
 8
 9// Subscribe to 'ReadyStateChange' event
10// This event will be fired during the document loading process
11document.OnReadyStateChange.add(new DOMEventHandler() {
12    @Override
13    public void invoke(Object sender, Event e) {
14        // Check the value of the 'ReadyState' property
15        // This property is representing the status of the document. For detail information please visit https://www.w3schools.com/jsref/prop_doc_readystate.asp
16        if (document.getReadyState().equals("complete")) {
17            // Fill the outerHTML variable by value of loaded document
18            outerHTML.append(document.getDocumentElement().getOuterHTML());
19        }
20    }
21});
22
23Thread.sleep(5000);
24
25System.out.println("outerHTML = " + outerHTML);

Unlike the first example, the following one demonstrates asynchronous navigation by loading a document from a given URL and using wait/notify instead of a fixed delay. This approach is more reliable because it reacts precisely when the document reaches the “complete” state, avoiding unnecessary waiting or premature execution.

 1// Create an instance of the HTMLDocument class
 2
 3HTMLDocument document = new HTMLDocument();
 4
 5// Subscribe to the 'ReadyStateChange' event. This event will be fired during the document loading process
 6document.OnReadyStateChange.add(new DOMEventHandler() {
 7    @Override
 8    public void invoke(Object sender, Event e) {
 9        // Check the value of 'ReadyState' property
10        // This property is representing the status of the document. For detail information please visit https://www.w3schools.com/jsref/prop_doc_readystate.asp
11        if (document.getReadyState().equals("complete")) {
12            System.out.println(document.getDocumentElement().getOuterHTML());
13            notifyAll();
14        }
15    }
16});
17
18// Navigate asynchronously at the specified Uri
19document.navigate("https://html.spec.whatwg.org/multipage/introduction.html");
20
21synchronized (this) {
22    wait(10000);
23}

This Java example defines a custom HTMLDocumentWaiter class that implements Runnable to work with HTML documents asynchronously using the Aspose.HTML for Java library. The constructor accepts an instance of HTMLDocumentAsynchronouslyOnLoad and triggers its execution in a separate thread. In the run() method, the code continuously checks whether a message has been received from the asynchronous operation, pausing the thread for a given interval Thread.sleep(60000) between checks. Once the message is available or the thread is interrupted, the waiter stops. This approach allows monitoring the asynchronous loading process in parallel with the main program flow.

 1// Create async waiter thread for HTML document loading using Java
 2
 3public class HTMLDocumentWaiter implements Runnable {
 4
 5    private final HTMLDocumentAsynchronouslyOnLoad html;
 6
 7    public HTMLDocumentWaiter(HTMLDocumentAsynchronouslyOnLoad html) throws Exception {
 8        this.html = html;
 9        this.html.execute();
10    }
11
12    @Override
13    public void run() {
14        System.out.println("Current Thread: " + Thread.currentThread().getName() + "; " + Thread.currentThread().getId());
15
16        while (!Thread.currentThread().isInterrupted() && html.getMsg() == null) {
17            try {
18                Thread.sleep(60000);
19            } catch (InterruptedException e) {
20                throw new RuntimeException(e);
21            }
22        }
23        Thread.currentThread().interrupt();
24    }
25}

In this example, the code uses the OnLoad event rather than relying on ReadyState checks or manual waiting. The OnLoad event is automatically triggered once the document has finished loading, making this approach simpler and more efficient. When the event is fired, the program retrieves the outer HTML of the document and prints it to the console. This method avoids unnecessary delays and synchronization issues, providing a cleaner way to work with asynchronously loaded HTML content.

 1// Handle HTML document onLoad event when navigating to URL using Java
 2
 3// Create an instance of the HTMLDocument class
 4HTMLDocument document = new HTMLDocument();
 5
 6// Subscribe to the 'OnLoad' event. This event will be fired once the document is fully loaded
 7document.OnLoad.add(new DOMEventHandler() {
 8    @Override
 9    public void invoke(Object sender, Event e) {
10        msg = document.getDocumentElement().getOuterHTML();
11        System.out.println(msg);
12    }
13});
14
15// Navigate asynchronously at the specified Uri
16document.navigate("https://html.spec.whatwg.org/multipage/introduction.html");

Conclusions

  1. Comprehensive DOM manipulation: The HTMLDocument class provides a robust and standards-compliant way to create, modify, and manipulate HTML documents programmatically, following W3C and WHATWG specifications.

  2. Flexible document creation and loading: Using constructors, developers can create documents from scratch, load HTML from a variety of sources (files, URLs, streams), or dynamically generate content.

  3. Advanced operations support: Features such as asynchronous loading and event handling enable seamless integration of resource-intensive operations without blocking the main application thread.

  4. Cross-format compatibility: The library extends some functionality HTML to other document formats such as SVG, MHTML, and EPUB, offering a unified approach to handling diverse web content.

You can download the complete examples and data files from GitHub.

Subscribe to Aspose Product Updates

Get monthly newsletters & offers directly delivered to your mailbox.