Convert PDF to HTML in Python

Convert PDF to HTML

Aspose.PDF for Python via .NET provides many features for converting various file formats into PDF documents and converting PDF files into various output formats. This article discusses how to convert a PDF file into HTML. You can use just a couple of lines of code Python for converting PDF To HTML. You may need to convert PDF to HTML if you want to create a website or add content to an online forum. One way to convert PDF to HTML is to programmatically use Python.

Steps: Convert PDF to HTML in Python

  1. Create an instance of Document object with the source PDF document.
  2. Save it to HtmlSaveOptions by calling save() method.

    from os import path
    import aspose.pdf as apdf

    path_infile = path.join(self.data_dir, infile)
    path_outfile = path.join(self.data_dir, "python", outfile)
    document = apdf.Document(path_infile)
    save_options = apdf.HtmlSaveOptions()
    document.save(path_outfile, save_options)

    print(infile + " converted into " + outfile)

Convert PDF to HTML with saving images in the specified folder

This function converts a PDF file into HTML format using Aspose.PDF for Python via .NET. All extracted images are stored in a specified folder instead of being embedded in the HTML file.

  1. Configure HTML save options.
  2. Save as HTML with external images.

    from os import path
    import aspose.pdf as apdf

    path_infile = path.join(self.data_dir, infile)
    path_outfile = path.join(self.data_dir, "python", outfile)
    document = apdf.Document(path_infile)
    save_options = apdf.HtmlSaveOptions()
    save_options.special_folder_for_all_images = self.data_dir
    document.save(path_outfile, save_options)

    print(infile + " converted into " + outfile)

Convert PDF to Multi-Page HTML

This function converts a PDF file into multi-page HTML, where each PDF page is exported as a separate HTML file. This makes the output easier to navigate and reduces loading time for large PDFs.

  1. Load the source PDF using ‘ap.Document’.
  2. Create ‘HtmlSaveOptions’ and ‘set split_into_pages’.
  3. Save the document as HTML with pages split into separate files.
  4. Print a confirmation message.

    from os import path
    import aspose.pdf as apdf

    path_infile = path.join(self.data_dir, infile)
    path_outfile = path.join(self.data_dir, "python", outfile)
    document = apdf.Document(path_infile)
    save_options = apdf.HtmlSaveOptions()
    save_options.split_into_pages = True
    document.save(path_outfile, save_options)

    print(infile + " converted into " + outfile)

Convert PDF to HTML with saving SVG images in specified folder

This function converts a PDF into HTML format while storing all images as SVG files in a specified folder, instead of embedding them directly in the HTML.

  1. Load the source PDF using ‘ap.Document’.
  2. Create ‘HtmlSaveOptions’ and ‘set special_folder_for_svg_images’ to the target folder.
  3. Save the document as HTML with external SVG images.
  4. Print a confirmation message.

    from os import path
    import aspose.pdf as apdf

    path_infile = path.join(self.data_dir, infile)
    path_outfile = path.join(self.data_dir, "python", outfile)
    document = apdf.Document(path_infile)
    save_options = apdf.HtmlSaveOptions()
    save_options.special_folder_for_svg_images = self.data_dir
    document.save(path_outfile, save_options)

    print(infile + " converted into " + outfile)

Convert PDF to HTML and saving compressed SVG images

This snippet converts a PDF into HTML format, storing all images as SVG files in a specified folder and compressing them to reduce file size.

  1. Load the PDF document using ‘ap.Document’.
  2. Create ‘HtmlSaveOptions’ and:
    • Set ‘special_folder_for_svg_images’ to store SVG images externally.
    • Enable ‘compress_svg_graphics_if_any’ to compress SVG images.
  3. Save the document as HTML with compressed external SVG images.
  4. Print a confirmation message.

    from os import path
    import aspose.pdf as apdf

    path_infile = path.join(self.data_dir, infile)
    path_outfile = path.join(self.data_dir, "python", outfile)
    document = apdf.Document(path_infile)
    save_options = apdf.HtmlSaveOptions()
    save_options.special_folder_for_svg_images = self.data_dir
    save_options.compress_svg_graphics_if_any = True
    document.save(path_outfile, save_options)

    print(infile + " converted into " + outfile)

Convert PDF to HTML with control of Embedded Raster Images

This snippet converts a PDF into HTML format, embedding raster images as PNG page backgrounds. This approach preserves image quality and page layout within the HTML.

  1. Load the PDF document using ‘ap.Document’.
  2. Create ‘HtmlSaveOptions’ and ‘set raster_images_saving_mode’ to ‘AS_EMBEDDED_PARTS_OF_PNG_PAGE_BACKGROUND’.
  3. Save the document as HTML with embedded raster images.
  4. Print a confirmation message.

    from os import path
    import aspose.pdf as apdf

    path_infile = path.join(self.data_dir, infile)
    path_outfile = path.join(self.data_dir, "python", outfile)
    document = apdf.Document(path_infile)
    save_options = apdf.HtmlSaveOptions()
    save_options.raster_images_saving_mode = apdf.HtmlSaveOptions.RasterImagesSavingModes.AS_EMBEDDED_PARTS_OF_PNG_PAGE_BACKGROUND
    document.save(path_outfile, save_options)

    print(infile + " converted into " + outfile)

Convert PDF to Body-Only content HTML page

This function converts a PDF into HTML format, generating ‘body-only’ content without extra ‘html’ or ‘head’ tags, and splits the output into separate pages.

  1. Load the PDF document using ‘ap.Document’.
  2. Create ‘HtmlSaveOptions’ and configure:
    • ‘html_markup_generation_mode = WRITE_ONLY_BODY_CONTENT’ to generate only the ‘body’ content.
    • ‘split_into_pages’ to create separate HTML files for each PDF page.
  3. Save the document as HTML with the specified options.
  4. Print a confirmation message.

from os import path
    import aspose.pdf as apdf

    path_infile = path.join(self.data_dir, infile)
    path_outfile = path.join(self.data_dir, "python", outfile)
    document = apdf.Document(path_infile)
    save_options = apdf.HtmlSaveOptions()
    save_options.html_markup_generation_mode = apdf.HtmlSaveOptions.HtmlMarkupGenerationModes.WRITE_ONLY_BODY_CONTENT
    save_options.split_into_pages = True
    document.save(path_outfile, save_options)

    print(infile + " converted into " + outfile)

Convert PDF to HTML with Transparent Text Rendering

This function converts a PDF into HTML format, rendering all text as transparent, including shadowed texts, which preserves visual fidelity while allowing flexible styling in the output HTML.

  1. Load the PDF document using ‘ap.Document’.
  2. Create ‘HtmlSaveOptions’ and configure:
    • ‘save_transparent_texts’ to render normal text as transparent.
    • ‘save_shadowed_texts_as_transparent_texts’ to render shadowed text as transparent.
  3. Save the document as HTML with transparent text rendering.
  4. Print a confirmation message.

    from os import path
    import aspose.pdf as apdf

    path_infile = path.join(self.data_dir, infile)
    path_outfile = path.join(self.data_dir, "python", outfile)
    document = apdf.Document(path_infile)
    save_options = apdf.HtmlSaveOptions()
    save_options.save_transparent_texts = True
    save_options.save_shadowed_texts_as_transparent_texts = True
    document.save(path_outfile, save_options)

    print(infile + " converted into " + outfile)

Convert PDF to HTML with Document Layers Rendering

This function converts a PDF into HTML format, preserving document layers by converting marked content into separate layers in the output HTML. This allows layered elements (like annotations, backgrounds, and overlays) to be rendered accurately.

  1. Load the PDF document using ‘ap.Document’.
  2. Create ‘HtmlSaveOptions’ and enable ‘convert_marked_content_to_layers’ to preserve layers.
  3. Save the document as HTML with layered content.
  4. Print a confirmation message.

    from os import path
    import aspose.pdf as apdf

    path_infile = path.join(self.data_dir, infile)
    path_outfile = path.join(self.data_dir, "python", outfile)
    document = apdf.Document(path_infile)
    save_options = apdf.HtmlSaveOptions()
    save_options.convert_marked_content_to_layers  = True
    document.save(path_outfile, save_options)

    print(infile + " converted into " + outfile)