How to Safely Load and Convert Untrusted HTML in Python – Sandboxing
Why Sandboxing Matters
When dealing with external HTML content, there is always a risk of unpredictable behavior from scripts or media within the page. Aspose.HTML for Python via .NET provides a sandboxing mechanism that allows you to control which elements of the document can be executed or loaded during processing. This ensures safe, predictable, and efficient HTML rendering and conversion.
Most HTML processing scenarios, such as converting HTML to PDF or images, don’t require JavaScript execution or the use of remote content. Allowing access to such resources can slow down the process or even cause unwanted behavior. Using sandbox flags, you can explicitly disable or restrict resource types such as scripts, images, plugins, forms, etc. This gives you fine-grained control over what your document can do during rendering or conversion.
Blocking JavaScript Execution
The following example shows how to disable JavaScript when converting HTML to PDF. This ensures that no scripts are executed during rendering - ideal for safe conversions to PDF or other formats.
- Initialize an instance of the Configuration class.
- Set the
sandbox flag of the configuration instance to include the
Sandbox.SCRIPTSvalue. This marks scripts as untrusted resources within the sandbox environment. This step is crucial as scripts pose a potential risk in executing malicious code. - Create an instance of the HTMLDocument class using
HTMLDocument(address, configuration)constructor that takes the HTML file path and the configuration instance. - Call the
Converter.convert_html()method to convert HTML to PDF.
1# How to disable scripts for HTML to PDF conversion using Python
2
3import os
4import aspose.html as ah
5import aspose.html.converters as conv
6import aspose.html.saving as sav
7
8# Define input and output directories
9data_dir = "data"
10output_dir = "output"
11os.makedirs(output_dir, exist_ok=True)
12
13# Create an instance of the Configuration class
14with ah.Configuration() as config:
15 # Mark "scripts" as an untrusted resource
16 config.security |= ah.Sandbox.SCRIPTS
17
18 # Initialize an HTML document with the specified configuration
19 html_path = os.path.join(data_dir, "document-with-scripts.html")
20 with ah.HTMLDocument(html_path, config) as doc:
21 # Convert HTML to PDF
22 output_pdf = os.path.join(output_dir, "document-sandbox.pdf")
23 conv.Converter.convert_html(doc, sav.PdfSaveOptions(), output_pdf)Result: Any JavaScript within the HTML document will be ignored, producing a static and secure output.
Blocking Untrusted Images
Follow these simple steps to disable image loading and safely convert HTML to PDF using Aspose.HTML for Python via .NET:
- Prepare HTML code and save it to a file. The HTML code contains a
<span>element with an inline style that sets a background image from the URL. - Create a configuration instance. To do this, initialize the Configuration class to define custom security settings for your HTML document.
- Set the
Sandbox.IMAGESflag in the configuration to mark all images as untrusted resources. - Initialize the HTML document. Load the saved HTML file using the custom configuration. The sandbox restrictions will automatically apply during loading.
- Use the
Converter.convert_html()method to render the document and save it as a PDF file.
1# Disable loading images in HTML with sandbox configuration using Python
2
3import os
4import aspose.html as ah
5import aspose.html.converters as conv
6import aspose.html.saving as sav
7
8# Prepare HTML code and save it to a file
9code = "<span style=\"background-image:url('https://docs.aspose.com/html/images/work/lioness.jpg')\">Hello, World!!</span> " \
10 "<script>document.write('Have a nice day!');</script>"
11
12output_dir = "output"
13os.makedirs(output_dir, exist_ok=True)
14html_path = os.path.join(output_dir, "sandboxing.html")
15output_pdf = os.path.join(output_dir, "sandboxing-out.pdf")
16
17with open(html_path, "w", encoding="utf-8") as file:
18 file.write(code)
19
20# Create an instance of Configuration
21with ah.Configuration() as configuration:
22 # Mark 'IMAGES' as an untrusted resource
23 configuration.security |= ah.Sandbox.IMAGES
24
25 # Initialize an HTML document with the specified configuration
26 with ah.HTMLDocument(html_path, configuration) as document:
27 # Convert HTML to PDF
28 conv.Converter.convert_html(document, sav.PdfSaveOptions(), output_pdf)Result: Images from external URLs will not be loaded, and the output PDF will contain only static text content.
FAQ
1. Which sandbox options are available?
Aspose.HTML provides several sandbox flags you can combine, such as:
Sandbox.SCRIPTS– Disables JavaScript execution.Sandbox.IMAGES– Blocks loading of external images.Sandbox.NAVIGATION– Prevents redirections or navigation attempts.Sandbox.FORMS– Disables form submission and related activity.Sandbox.PLUGINS– Prevents content from instantiating plugins.
2. How does sandboxing improve performance?
By skipping the execution of scripts and preventing network requests for external resources, sandboxing significantly reduces the processing time and resource usage during conversion.
3. How can I combine multiple sandbox restrictions?
You can enable multiple restrictions at once, for example:
configuration.security |= ah.Sandbox.SCRIPTS | ah.Sandbox.IMAGES
This line disables both JavaScript execution and image loading in a single configuration.
4. Is sandboxing available for other languages?
Yes. The same functionality is available in Aspose.HTML for .NET and Aspose.HTML for Java, with similar APIs for managing sandbox restrictions.
5. What are common use cases for sandboxing?
- Converting HTML emails or web pages from external sources;
- Processing scraped web data without executing active content;
- Running conversions in secure or offline environments;
- Need to improve conversion performance by skipping dynamic resources.
You can download the complete examples and data files from GitHub.