Create HTML Document – Create, Load HTML in Java

HTML Document

The HTMLDocument class is a starting point for the Aspose.HTML for Java class library, allowing developers to work with HTML content programmatically. The HTMLDocument class represents an HTML page as rendered in a browser, serving as the root of the Document Object Model (DOM).

Some HTMLDocument features

The HTMLDocument provides an in-memory representation of an HTML DOM and is entirely based on W3C DOM and WHATWG DOM specifications supported by many modern browsers. If you are familiar with WHATWG DOM, WHATWG HTML, and JavaScript standards, you will find it quite comfy to use the Aspose.HTML for Java. Otherwise, you can visit www.w3schools.com, where you can find many examples and tutorials on how to work with HTML documents.

Create an Empty HTML Document

The following code snippet shows the usage of the default HTMLDocument() constructor to create an empty HTML document and save it to a file.

1// Initialize an empty HTML Document.
2HTMLDocument document = new HTMLDocument();
3
4// Save the document to disk.
5document.save("create-empty-document.html");

After the creation, file create-empty-document.html appears with the initial document structure: the empty document includes elements such as <html> <head> and <body>. More details about HTML files saving are in the Save HTML Document article.

Resulting File Structure:

1<html>
2    <head></head>
3    <body></body>
4</html>

Create a New HTML Document

To generate a document programmatically from scratch, use the HTMLDocument() constructor with no parameters, as shown in the code snippet above. Once the document object is created, it can be filled later with HTML elements. You can populate the document with content, such as creating a text node and adding it to the body of the document:

1// Initialize an empty HTML Document.
2HTMLDocument document = new HTMLDocument();
3
4// Create a text element and add it to the document
5Text text = document.createTextNode("Hello World!");
6document.getBody().appendChild(text);
7
8// Save the document to a disk
9document.save("create-new-document.html");

Load HTML from a File

Following code snippet shows how to load the HTMLDocument from an existing file:

 1// Prepare a 'load-from-file.html' file.
 2try (java.io.FileWriter fileWriter = new java.io.FileWriter("load-from-file.html")) {
 3    fileWriter.write("Hello World!");
 4}
 5
 6// Load from a 'load-from-file.html' file.
 7HTMLDocument document = new HTMLDocument("load-from-file.html");
 8
 9// Write the document content to the output stream.
10System.out.println(document.getDocumentElement().getOuterHTML());

Load HTML from a URL

The HTMLDocument class can fetch and load HTML content from a web page. In the next code snippet you can see how to load a web page into HTMLDocument.

In case if you pass a wrong URL that can’t be reached right at the moment, the library throws the DOMException with specialized code NetworkError to inform you that the selected resource can not be found.

1// Load a document from 'https://docs.aspose.com/html/net/creating-a-document/document.html' web page
2HTMLDocument document = new HTMLDocument("https://docs.aspose.com/html/net/creating-a-document/document.html");
3
4System.out.println(document.getDocumentElement().getOuterHTML());

Load HTML from HTML Code

If you prepare an HTML code as an in-memory Class String or Class InputStream objects, you don’t need to save them to the file, simply pass your HTML code into specialized constructors. To create a document from a string, use the constructor HTMLDocument(content, baseUri) with the HTML content and a baseUri:

1// Prepare HTML code
2String html_code = "<p>Hello World!</p>";
3
4// Initialize a document from the string variable
5HTMLDocument document = new HTMLDocument(html_code, ".");
6
7// Save the document to a disk
8document.save("create-from-string.html");

In case your HTML code has the linked resources (styles, scripts, images, etc.), you need to pass a valid baseUrl parameter to the constructor of the document. It will be used to resolve the location of the resource during the document loading.

Load HTML from a Stream

To create an HTML document from a stream, you can use the HTMLDocument(stream, string) constructor:

1// Create a memory stream object
2String code = "<p>Hello World! I love HTML!</p>";
3java.io.InputStream inputStream = new java.io.ByteArrayInputStream(code.getBytes());
4
5// Initialize a document from the stream variable
6HTMLDocument document = new HTMLDocument(inputStream, ".");
7
8// Save the document to a disk
9document.save("load-from-stream.html");

Working with SVG, MHTML, and EPUB Documents

SVG Document

Since Scalable Vector Graphics (SVG) is a part of W3C standards and could be embedded into HTMLDocument, we implemented SVGDocument and all its functionality. Our implementation is based on official specification SVG2 specification, so you can load, read, manipulate SVG documents as it described officially.

Since SVGDocument and HTMLDocument are based on the same WHATWG DOM standard, the all operations such as loading, reading, editing, converting and saving are similar for both documents. So, the all examples where you can see manipulation with HTMLDocument are applicable for SVGDocument as well.

The example below shows you how to load the SVG Document from the in-memory Class String variable:

1// Initialize an SVG document from a string object
2SVGDocument document = new SVGDocument("<svg xmlns='http://www.w3.org/2000/svg'><circle cx='50' cy='50' r='40'/></svg>", ".");
3
4// Write the document content to the output stream
5System.out.println(document.getDocumentElement().getOuterHTML());

MHTML Document

MHTML (MIME encapsulation of aggregate HTML documents) is a specialized format for creating web page archives. The Aspose.HTML for Java library supports MHTML, but its functionality is currently limited to converting and rendering operations from MHTML to other supported output formats. For more information, refer to the Converting Between Formats article.

EPUB Document

EPUB, an electronic publication format widely used for eBooks, has similar limitations in the Aspose.HTML for Java library as MHTML. The library supports only rendering operations from EPUB to supported output formats. For additional details, visit the Converting Between Formats article.

Asynchronous Operations

We realize that loading a document could be a resource-intensive operation since it’s required loading not only the document itself but all linked resources and processing all scripts. So, in the following code snippets, we show you how to use asynchronous operations and load HTMLDocument without blocking the main thread:

 1// Create an instance of an HTML document
 2HTMLDocument document = new HTMLDocument();
 3
 4// Create a string variable for OuterHTML property reading
 5StringBuilder outerHTML = new StringBuilder();
 6
 7// Subscribe to 'ReadyStateChange' event
 8// This event will be fired during the document loading process
 9document.OnReadyStateChange.add(new DOMEventHandler() {
10    @Override
11    public void invoke(Object sender, Event e) {
12        // Check the value of the 'ReadyState' property
13        // This property is representing the status of the document. For detail information please visit https://www.w3schools.com/jsref/prop_doc_readystate.asp
14        if (document.getReadyState().equals("complete")) {
15            // Fill the outerHTML variable by value of loaded document
16            outerHTML.append(document.getDocumentElement().getOuterHTML());
17        }
18    }
19});
20
21Thread.sleep(5000);
22
23System.out.println("outerHTML = " + outerHTML);

ReadyStateChange is not the only event that can used to handle an async loading operation, you can also subscribe for Load event, as it follows:

 1// Create the instance of HTML Document
 2HTMLDocument document = new HTMLDocument();
 3
 4// Subscribe to the 'ReadyStateChange' event.
 5// This event will be fired during the document loading process.
 6document.OnReadyStateChange.add(new DOMEventHandler() {
 7    @Override
 8    public void invoke(Object sender, Event e) {
 9        // Check the value of 'ReadyState' property.
10        // This property is representing the status of the document. For detail information please visit https://www.w3schools.com/jsref/prop_doc_readystate.asp
11        if (document.getReadyState().equals("complete")) {
12            System.out.println(document.getDocumentElement().getOuterHTML());
13            notifyAll();
14        }
15    }
16});
17
18// Navigate asynchronously at the specified Uri
19document.navigate("https://html.spec.whatwg.org/multipage/introduction.html");
20
21synchronized (this) {
22    wait(10000);
23}

The following Java code example uses the HTMLDocumentWaiter class in the context of working with HTML documents asynchronously in the Aspose.HTML for Java library. The HTMLDocumentWaiter class provides constructors and methods that execute the asynchronous loading operation in a separate thread and waits until either the loading is finished or the current thread is interrupted. Let’s see what the code does:

 1public class HTMLDocumentWaiter implements Runnable {
 2
 3    private final Examples_Java_WorkingWithDocuments_CreatingADocument_HTMLDocumentAsynchronouslyOnLoad html;
 4
 5    public HTMLDocumentWaiter(Examples_Java_WorkingWithDocuments_CreatingADocument_HTMLDocumentAsynchronouslyOnLoad html) throws Exception {
 6        this.html = html;
 7        this.html.execute();
 8    }
 9
10    @Override
11    public void run() {
12        System.out.println("Current Thread: " + Thread.currentThread().getName() + "; " + Thread.currentThread().getId());
13
14        while (!Thread.currentThread().isInterrupted() && html.getMsg() == null) {
15            try {
16                Thread.sleep(60000);
17            } catch (InterruptedException e) {
18                throw new RuntimeException(e);
19            }
20        }
21        Thread.currentThread().interrupt();
22    }
23}

The following code snippet describes the SimpleWait class, which contains the main() method that serves as the entry point for a Java application. Inside the main() method, an html instance of the class Examples_Java_WorkingWithDocuments_CreatingADocument_HTMLDocumentAsynchronouslyOnLoad is created. It is responsible for loading the HTML document asynchronously and creates an HTMLDocumentWaiter object to wait for the loading to complete. Finally, it starts a new thread to execute the waiting process:

 1// Create the instance of HTML Document
 2HTMLDocument document = new HTMLDocument();
 3
 4// Subscribe to the 'OnLoad' event.
 5// This event will be fired once the document is fully loaded.
 6document.OnLoad.add(new DOMEventHandler() {
 7    @Override
 8    public void invoke(Object sender, Event e) {
 9        msg = document.getDocumentElement().getOuterHTML();
10        System.out.println(msg);
11    }
12});
13
14// Navigate asynchronously at the specified Uri
15document.navigate("https://html.spec.whatwg.org/multipage/introduction.html");
view raw SimpleWait hosted with ❤ by GitHub

Conclusions

  1. Comprehensive DOM manipulation: The HTMLDocument class provides a robust and standards-compliant way to create, modify, and manipulate HTML documents programmatically, following W3C and WHATWG specifications.

  2. Flexible document creation and loading: Using constructors, developers can create documents from scratch, load HTML from a variety of sources (files, URLs, streams), or dynamically generate content.

  3. Advanced operations support: Features such as asynchronous loading and event handling enable seamless integration of resource-intensive operations without blocking the main application thread.

  4. Cross-format compatibility: The library extends some functionality HTML to other document formats such as SVG, MHTML, and EPUB, offering a unified approach to handling diverse web content.

You can download the complete examples and data files from GitHub.

Subscribe to Aspose Product Updates

Get monthly newsletters & offers directly delivered to your mailbox.