How to Safely Load and Convert Untrusted HTML in Python – Sandboxing

Why Sandboxing Matters

When dealing with external HTML content, there is always a risk of unpredictable behavior from scripts or media within the page. Aspose.HTML for Python via .NET provides a sandboxing mechanism that allows you to control which elements of the document can be executed or loaded during processing. This ensures safe, predictable, and efficient HTML rendering and conversion.

Most HTML processing scenarios, such as converting HTML to PDF or images, don’t require JavaScript execution or the use of remote content. Allowing access to such resources can slow down the process or even cause unwanted behavior. Using sandbox flags, you can explicitly disable or restrict resource types such as scripts, images, plugins, forms, etc. This gives you fine-grained control over what your document can do during rendering or conversion.

Blocking JavaScript Execution

The following example shows how to disable JavaScript when converting HTML to PDF. This ensures that no scripts are executed during rendering - ideal for safe conversions to PDF or other formats.

  1. Initialize an instance of the Configuration class.
  2. Set the sandbox flag of the configuration instance to include the Sandbox.SCRIPTS value. This marks scripts as untrusted resources within the sandbox environment. This step is crucial as scripts pose a potential risk in executing malicious code.
  3. Create an instance of the HTMLDocument class using HTMLDocument(address, configuration) constructor that takes the HTML file path and the configuration instance.
  4. Call the Converter.convert_html() method to convert HTML to PDF.
 1# How to disable scripts for HTML to PDF conversion using Python
 2
 3import os
 4import aspose.html as ah
 5import aspose.html.converters as conv
 6import aspose.html.saving as sav
 7
 8# Define input and output directories
 9data_dir = "data"
10output_dir = "output"
11os.makedirs(output_dir, exist_ok=True)
12
13# Create an instance of the Configuration class
14with ah.Configuration() as config:
15    # Mark "scripts" as an untrusted resource
16    config.security |= ah.Sandbox.SCRIPTS
17
18    # Initialize an HTML document with the specified configuration
19    html_path = os.path.join(data_dir, "document-with-scripts.html")
20    with ah.HTMLDocument(html_path, config) as doc:
21        # Convert HTML to PDF
22        output_pdf = os.path.join(output_dir, "document-sandbox.pdf")
23        conv.Converter.convert_html(doc, sav.PdfSaveOptions(), output_pdf)

Result: Any JavaScript within the HTML document will be ignored, producing a static and secure output.

Blocking Untrusted Images

Follow these simple steps to disable image loading and safely convert HTML to PDF using Aspose.HTML for Python via .NET:

  1. Prepare HTML code and save it to a file. The HTML code contains a <span> element with an inline style that sets a background image from the URL.
  2. Create a configuration instance. To do this, initialize the Configuration class to define custom security settings for your HTML document.
  3. Set the Sandbox.IMAGES flag in the configuration to mark all images as untrusted resources.
  4. Initialize the HTML document. Load the saved HTML file using the custom configuration. The sandbox restrictions will automatically apply during loading.
  5. Use the Converter.convert_html() method to render the document and save it as a PDF file.
 1# Disable loading images in HTML with sandbox configuration using Python
 2
 3import os
 4import aspose.html as ah
 5import aspose.html.converters as conv
 6import aspose.html.saving as sav
 7
 8# Prepare HTML code and save it to a file
 9code = "<span style=\"background-image:url('https://docs.aspose.com/html/images/work/lioness.jpg')\">Hello, World!!</span> " \
10       "<script>document.write('Have a nice day!');</script>"
11
12output_dir = "output"
13os.makedirs(output_dir, exist_ok=True)
14html_path = os.path.join(output_dir, "sandboxing.html")
15output_pdf = os.path.join(output_dir, "sandboxing-out.pdf")
16
17with open(html_path, "w", encoding="utf-8") as file:
18    file.write(code)
19
20# Create an instance of Configuration
21with ah.Configuration() as configuration:
22    # Mark 'IMAGES' as an untrusted resource
23    configuration.security |= ah.Sandbox.IMAGES
24
25    # Initialize an HTML document with the specified configuration
26    with ah.HTMLDocument(html_path, configuration) as document:
27        # Convert HTML to PDF
28        conv.Converter.convert_html(document, sav.PdfSaveOptions(), output_pdf)

Result: Images from external URLs will not be loaded, and the output PDF will contain only static text content.

FAQ

1. Which sandbox options are available?

Aspose.HTML provides several sandbox flags you can combine, such as:

2. How does sandboxing improve performance?

By skipping the execution of scripts and preventing network requests for external resources, sandboxing significantly reduces the processing time and resource usage during conversion.

3. How can I combine multiple sandbox restrictions?

You can enable multiple restrictions at once, for example:

configuration.security |= ah.Sandbox.SCRIPTS | ah.Sandbox.IMAGES

This line disables both JavaScript execution and image loading in a single configuration.

4. Is sandboxing available for other languages?

Yes. The same functionality is available in Aspose.HTML for .NET and Aspose.HTML for Java, with similar APIs for managing sandbox restrictions.

5. What are common use cases for sandboxing?

You can download the complete examples and data files from GitHub.

Subscribe to Aspose Product Updates

Get monthly newsletters & offers directly delivered to your mailbox.