Convert PDF to HTML in .NET

Overview

This article explains how to convert PDF to HTML using C#. It covers these topics.

Convert PDF to HTML

The following code snippet also works with the Aspose.PDF.Drawing library.

Convert PDF to HTML

Aspose.PDF for .NET provides many features for converting various file formats into PDF documents and converting PDF files into various output formats. This article discusses how to convert a PDF file into HTML. Aspose.PDF for .NET provides the capability to convert HTML files into PDF format using the InLineHtml approach. We have had many requests for functionality that converts a PDF file into HTML format and have provided this feature. Please note that this feature also supports XHTML 1.0.

Aspose.PDF for .NET supports the features to convert a PDF file into HTML. The main tasks you can accomplish with the Aspose.PDF library are listed:

Convert PDF to HTML.
Splitting Output to Multi-page HTML.
Specify Folder for Storing SVG Files.
Compressing SVG Images During Conversion.
Save images as PNG background.
Specifying the Images Folder.
Create Subsequent Files with Body Contents Only.
Transparent Text rendering.
PDF document layers rendering.

Try to convert PDF to HTML online

Aspose.PDF for .NET presents you with an online free application “PDF to HTML”, where you may try to investigate the functionality and quality it works.

Aspose.PDF for .NET provides a two-line code for transforming a source PDF file to HTML. The SaveFormat enumeration contains the value Html which lets you save the source file to HTML. The following code snippet shows the process of converting a PDF file into HTML.

Convert PDF to HTML

Create an instance of the Document object with the source PDF document.
Save it to SaveFormat.Html format by calling Document.Save() method.

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void ConvertPDFtoHTML()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "PDFToHTML.pdf"))
    {
        // Save the output HTML
        document.Save(dataDir + "output_out.html", Aspose.Pdf.SaveFormat.Html);
    }
}

Splitting Output to Multi-page HTML

When converting a large PDF file with several pages to HTML format, the output appears as a single HTML page. It can end up being very long. To control page size, it is possible to split the output into several pages during PDF to HTML conversion. Please try using the following code snippet.

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void ConvertPDFtoMultiPageHTML()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "PDFToHTML.pdf"))
    {
        // Instantiate HTML SaveOptions object
        var htmlOptions = new Aspose.Pdf.HtmlSaveOptions
        {
            // Specify to split the output into multiple pages
            SplitIntoPages = true
        };

        // Save the output HTML
        document.Save(dataDir + "MultiPageHTML_out.html", htmlOptions);
    }
}

Specify Folder for Storing SVG Files

During PDF to HTML conversion, it is possible to specify the folder that SVG images should be saved to. Use the HtmlSaveOption class SpecialFolderForSvgImages property to specify a special SVG image directory. This property gets or sets the path to the directory to which SVG images must be saved to when encountered during conversion. If the parameter is empty or null, then any SVG files are saved together with other image files.

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void SavePDFtoHTMLWithSVG()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "PDFToHTML.pdf"))
    {
        // Instantiate HTML save options object
        var newOptions = new Aspose.Pdf.HtmlSaveOptions
        {
            // Specify the folder where SVG images are saved during PDF to HTML conversion
            SpecialFolderForSvgImages = dataDir
        };

        // Save the output HTML
        document.Save(dataDir + "SaveSVGFiles_out.html", newOptions);
    }
}

Compressing SVG Images During Conversion

To compress SVG images during PDF to HTML conversion, please try using the following code:

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void SavePDFtoCompressedHTMLWithSVG()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "PDFToHTML.pdf"))
    {
        // Create HtmlSaveOptions with tested feature
        var newOptions = new Aspose.Pdf.HtmlSaveOptions
        {
            // Compress the SVG images if there are any
            CompressSvgGraphicsIfAny = true
        };

        // Save the output HTML
        document.Save(dataDir + "CompressedSVGHTML_out.html", newOptions);
    }
}

Save images as PNG background

The default output format for saving images is SVG. During conversion, some images from PDF are transformed into SVG vector images. This could be slow. Instead, the images could be transformed into one PNG background file for each page.

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void PdfToHtmlSaveImagesAsPngBackground()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_DocumentConversion_PDFToHTMLFormat();
           
    // Create HtmlSaveOption with tested feature
    var htmlSaveOptions = new HtmlSaveOptions();
           
    // Option to save images in PNG format as background for each page.
    htmlSaveOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;

    using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
    {
       document.Save(dataDir + "imagesAsPngBackground_out.html", htmlSaveOptions);         
    }
}

Specifying the Images Folder

We can also specify the folder that images will be saved to during PDF to HTML conversion:

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void SavePDFtoHTMLWithSeparateImageFolder()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "PDFToHTML.pdf"))
    {
        // Create HtmlSaveOptions with tested feature
        var newOptions = new Aspose.Pdf.HtmlSaveOptions
        {
            // Specify the separate folder to save images
            SpecialFolderForAllImages = dataDir
        };

        // Save the output HTML
        document.Save(dataDir + "HTMLWithSeparateImageFolder_out.html", newOptions);
    }
}

Create Subsequent Files with Body Contents Only

Recently, we were asked to introduce a feature where PDF files are converted to HTML and the user can get only the contents of the <body> tag for each page. This would produce one file with CSS, <html>, <head> details and all pages in other files just with <body> contents.

To meet this requirement, a new property, HtmlMarkupGenerationMode, was introduced to the HtmlSaveOptions class.

With the following simple code snippet, you can split the output HTML into pages. In the output pages, all HTML objects must go exactly where they go now (fonts processing and output, CSS creation and output, images creation and output), except that the output HTML will contain contents currently placed inside the tags (now “body” tags will be omitted). However, when using this approach, the link to the CSS is the responsibility of your code, because things like will be stripped out. For this purpose, you may read the CSS via File.ReadAllText() and send it via AJAX to a web page where it will be applied by jQuery.

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void ConvertPDFToHTMLWithBodyContent()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "PDFToHTML.pdf"))
    {
        // Initialize HtmlSaveOptions
        var options = new Aspose.Pdf.HtmlSaveOptions
        {
            // Set HtmlMarkupGenerationMode to generate only body content
            HtmlMarkupGenerationMode =
                Aspose.Pdf.HtmlSaveOptions.HtmlMarkupGenerationModes.WriteOnlyBodyContent,

            // Specify to split the output into multiple pages
            SplitIntoPages = true
        };

        // Save the output HTML
        document.Save(dataDir + "CreateSubsequentFiles_out.html", options);
    }
}

Transparent Text rendering

In case the source/input PDF file contains transparent texts shadowed by foreground images, then there might be text rendering issues. So in order to cater to such scenarios, SaveShadowedTextsAsTransparentTexts and SaveTransparentTexts properties can be used.

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void ConvertPDFToHTMLWithTransparentTextRendering()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "PDFToHTML.pdf"))
    {
        // Initialize HtmlSaveOptions
        var htmlOptions = new Aspose.Pdf.HtmlSaveOptions
        {
            // Enable transparent text rendering
            SaveShadowedTextsAsTransparentTexts = true,
            SaveTransparentTexts = true
        };

        // Save the output HTML
        document.Save(dataDir + "TransparentTextRendering_out.html", htmlOptions);
    }
}

PDF document layers rendering

We can render PDF document layers in a separate layer type element during PDF to HTML conversion:

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void ConvertPDFToHTMLWithLayersRendering()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "PDFToHTML.pdf"))
    {
        // Instantiate HTML SaveOptions object
        var htmlOptions = new Aspose.Pdf.HtmlSaveOptions
        {
            // Enable rendering of PDF document layers separately in the output HTML
            ConvertMarkedContentToLayers = true
        };

        // Save the output HTML
        document.Save(dataDir + "LayersRendering_out.html", htmlOptions);
    }
}

Saving output HTML into Stream - PDF to HTML

This code shows an example of saving the received file to a stream. Any type of stream can be used for saving. This code fragment also describes how custom saving strategies work.

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static string _folderForReferencedResources;

private static void ConvertPDFtoHTMLWithStream()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf();

    // Open PDF document
    using (var doc = new Aspose.Pdf.Document(dataDir + "PDFToHTML.pdf"))
    {
        var streamFilePath = dataDir + "saveToStream.html";
        _folderForReferencedResources = dataDir + @"/saveFolder/";

        // Cleaning existing files
        if (Directory.Exists(_folderForReferencedResources))
        {
            Directory.Delete(_folderForReferencedResources, true);
        }
        // Cleaning existing files
        if (File.Exists(streamFilePath))
        {
            File.Delete(streamFilePath);
        }

        var saveOptions = new Aspose.Pdf.HtmlSaveOptions();
        // Setting up custom strategies for conversion
        saveOptions.CustomResourceSavingStrategy = new Aspose.Pdf.HtmlSaveOptions.ResourceSavingStrategy(StrategyCustomSaveResources);
        saveOptions.CustomCssSavingStrategy = new Aspose.Pdf.HtmlSaveOptions.CssSavingStrategy(StrategyCssSaving);
        saveOptions.CustomStrategyOfCssUrlCreation = new Aspose.Pdf.HtmlSaveOptions.CssUrlMakingStrategy(StrategyCssNaming);

        using (Stream saveStream = File.OpenWrite(streamFilePath))
        {
            // Save document to the stream
            doc.Save(saveStream, saveOptions);
        }
    }
}

private static void StrategyCssSaving(Aspose.Pdf.HtmlSaveOptions.CssSavingInfo resourceInfo)
{
    // Cleaning existing files
    if (!System.IO.Directory.Exists(_folderForReferencedResources))
    {
        System.IO.Directory.CreateDirectory(_folderForReferencedResources);
    }
    // Create a path for a css
    string path = _folderForReferencedResources + Path.GetFileName(resourceInfo.SupposedURL);
    var reader = new System.IO.BinaryReader(resourceInfo.ContentStream);
    // Recording a css at the created path
    System.IO.File.WriteAllBytes(path, reader.ReadBytes((int)resourceInfo.ContentStream.Length));
}

private static string StrategyCssNaming(Aspose.Pdf.HtmlSaveOptions.CssUrlRequestInfo requestInfo)
{
    return _folderForReferencedResources + "css_style{0}.css";
}

private static string StrategyCustomSaveResources(Aspose.Pdf.SaveOptions.ResourceSavingInfo resourceSavingInfo)
{
    // Cleaning existing files
    if (!System.IO.Directory.Exists(_folderForReferencedResources))
    {
        System.IO.Directory.CreateDirectory(_folderForReferencedResources);
    }
    // Create a path for a resource
    var resourcePath = _folderForReferencedResources + System.IO.Path.GetFileName(resourceSavingInfo.SupposedFileName);
    var binaryReader = new System.IO.BinaryReader(resourceSavingInfo.ContentStream);
    // Recording a resource at the created path
    System.IO.File.WriteAllBytes(resourcePath, binaryReader.ReadBytes((int)resourceSavingInfo.ContentStream.Length));
    var urlInHtml = _folderForReferencedResources.Replace(@"\", "/") + System.IO.Path.GetFileName(resourceSavingInfo.SupposedFileName);
    // Return the resource name to insert into html
    return urlInHtml;
}

Convert HTML to PDF in .NET Convert various Images formats to PDF in .NET