Convert PDF to HTML in .NET
Overview
This article explains how to convert PDF to HTML using C#. It covers these topics.
Format: HTML
The following code snippet also work with Aspose.PDF.Drawing library.
Convert PDF to HTML
Aspose.PDF for .NET provides many features for converting various file formats into PDF documents and converting PDF files into various output formats. This article discusses how to convert a PDF file into HTML. Aspose.PDF for .NET provides the capability to convert HTML files into PDF format using the InLineHtml approach. We have had many requests for functionality that converts a PDF file into HTML format and have provided this feature. Please note that this feature also supports XHTML 1.0.
Aspose.PDF for .NET support the features to convert a PDF file into HTML. The main tasks you can accomplish with the Aspose.PDF library are listed:
- Convert PDF to HTML.
- Splitting Output to Multi-page HTML.
- Specify Folder for Storing SVG Files.
- Compressing SVG Images During Conversion.
- Save images as PNG background.
- Specifying the Images Folder.
- Create Subsequent Files with Body Contents Only.
- Transparent Text rendering.
- PDF document layers rendering.
Try to convert PDF to HTML online
Aspose.PDF for .NET presents you online free application “PDF to HTML”, where you may try to investigate the functionality and quality it works.
Aspose.PDF for .NET provides a two-line code for transforming a source PDF file to HTML. The SaveFormat enumeration
contains the value Html which lets you save the source file to HTML. The following code snippet shows the process of converting a PDF file into HTML.
Steps: Convert PDF to HTML in C#
- Create an instance of Document object with the source PDF document.
- Save it to SaveFormat.Html format by calling Document.Save() method.
// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void ConvertPDFtoHTML()
{
// The path to the documents directory
var dataDir = RunExamples.GetDataDir_AsposePdf();
// Open PDF document
using (var document = new Aspose.Pdf.Document(dataDir + "PDFToHTML.pdf"))
{
// Save the output HTML
document.Save(dataDir + "output_out.html", Aspose.Pdf.SaveFormat.Html);
}
}
Splitting Output to Multi-page HTML
When converting large PDF file with several pages to HTML format, the output appears as a single HTML page. It can end up being very long. To control page size, it is possible to split the output into several pages during PDF to HTML conversion. Please try using the following code snippet.
// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void ConvertPDFtoMultiPageHTML()
{
// The path to the documents directory
var dataDir = RunExamples.GetDataDir_AsposePdf();
// Open PDF document
using (var document = new Aspose.Pdf.Document(dataDir + "PDFToHTML.pdf"))
{
// Instantiate HTML SaveOptions object
var htmlOptions = new Aspose.Pdf.HtmlSaveOptions
{
// Specify to split the output into multiple pages
SplitIntoPages = true
};
// Save the output HTML
document.Save(dataDir + "MultiPageHTML_out.html", htmlOptions);
}
}
Specify Folder for Storing SVG Files
During PDF to HTML conversion, it is possible to specify the folder that SVG images should be saved to. Use the HtmlSaveOption class
SpecialFolderForSvgImages property
to specify a special SVG image directory. This property gets or sets the path to the directory to which SVG images must be saved to when encountered during conversion. If the parameter is empty or null, then any SVG files are saved together with other image files.
// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void SavePDFtoHTMLWithSVG()
{
// The path to the documents directory
var dataDir = RunExamples.GetDataDir_AsposePdf();
// Open PDF document
using (var document = new Aspose.Pdf.Document(dataDir + "PDFToHTML.pdf"))
{
// Instantiate HTML save options object
var newOptions = new Aspose.Pdf.HtmlSaveOptions
{
// Specify the folder where SVG images are saved during PDF to HTML conversion
SpecialFolderForSvgImages = dataDir
};
// Save the output HTML
document.Save(dataDir + "SaveSVGFiles_out.html", newOptions);
}
}
Compressing SVG Images During Conversion
To compress SVG images during PDF to HTML conversion, please try using the following code:
// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void SavePDFtoCompressedHTMLWithSVG()
{
// The path to the documents directory
var dataDir = RunExamples.GetDataDir_AsposePdf();
// Open PDF document
using (var document = new Aspose.Pdf.Document(dataDir + "PDFToHTML.pdf"))
{
// Create HtmlSaveOptions with tested feature
var newOptions = new Aspose.Pdf.HtmlSaveOptions
{
// Compress the SVG images if there are any
CompressSvgGraphicsIfAny = true
};
// Save the output HTML
document.Save(dataDir + "CompressedSVGHTML_out.html", newOptions);
}
}
Save images as PNG background
The default output format for saving images is SVG. During conversion, some images from PDF are transformed into SVG vector images. This could be slow. Instead, the images could be transformed into one PNG background file for each page.
// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void PdfToHtmlSaveImagesAsPngBackground()
{
// The path to the documents directory
var dataDir = RunExamples.GetDataDir_AsposePdf_DocumentConversion_PDFToHTMLFormat();
// Create HtmlSaveOption with tested feature
var htmlSaveOptions = new HtmlSaveOptions();
// Option to save images in PNG format as background for each page.
htmlSaveOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
{
document.Save(dataDir + "imagesAsPngBackground_out.html", htmlSaveOptions);
}
}
Specifying the Images Folder
We can also specify the folder that images will be saved to during PDF to HTML conversion:
// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void SavePDFtoHTMLWithSeparateImageFolder()
{
// The path to the documents directory
var dataDir = RunExamples.GetDataDir_AsposePdf();
// Open PDF document
using (var document = new Aspose.Pdf.Document(dataDir + "PDFToHTML.pdf"))
{
// Create HtmlSaveOptions with tested feature
var newOptions = new Aspose.Pdf.HtmlSaveOptions
{
// Specify the separate folder to save images
SpecialFolderForAllImages = dataDir
};
// Save the output HTML
document.Save(dataDir + "HTMLWithSeparateImageFolder_out.html", newOptions);
}
}
Create Subsequent Files with Body Contents Only
Recently, we were asked to introduce a feature where PDF files are converted to HTML and the user can get only the contents of the <body>
tag for each page. This would produce one file with CSS, <html>
, <head>
details and all pages in other files just with <body>
contents.
To meet this requirement, a new property, HtmlMarkupGenerationMode, was introduced to the HtmlSaveOptions class.
With the following simple code snippet, you can split the output HTML into pages. In the output pages, all HTML objects must go exactly where they go now (fonts processing and output, CSS creation and output, images creation and output), except that the output HTML will contain contents currently placed inside thetags (now “body” tags will be omitted). However, when using this approach, the link to the CSS is the responsibility of your code, because things like will be stripped out. For this purpose, you may read the CSS via File.ReadAllText() and send it via AJAX to to a web page where it will be applied by jQuery.
// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void ConvertPDFToHTMLWithBodyContent()
{
// The path to the documents directory
var dataDir = RunExamples.GetDataDir_AsposePdf();
// Open PDF document
using (var document = new Aspose.Pdf.Document(dataDir + "PDFToHTML.pdf"))
{
// Initialize HtmlSaveOptions
var options = new Aspose.Pdf.HtmlSaveOptions
{
// Set HtmlMarkupGenerationMode to generate only body content
HtmlMarkupGenerationMode =
Aspose.Pdf.HtmlSaveOptions.HtmlMarkupGenerationModes.WriteOnlyBodyContent,
// Specify to split the output into multiple pages
SplitIntoPages = true
};
// Save the output HTML
document.Save(dataDir + "CreateSubsequentFiles_out.html", options);
}
}
Transparent Text rendering
In case the source/input PDF file contains transparent texts shadowed by foreground images, then there might be text rendering issues. So in order to cater such scenarios, SaveShadowedTextsAsTransparentTexts and SaveTransparentTexts properties can be used.
// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void ConvertPDFToHTMLWithTransparentTextRendering()
{
// The path to the documents directory
var dataDir = RunExamples.GetDataDir_AsposePdf();
// Open PDF document
using (var document = new Aspose.Pdf.Document(dataDir + "PDFToHTML.pdf"))
{
// Initialize HtmlSaveOptions
var htmlOptions = new Aspose.Pdf.HtmlSaveOptions
{
// Enable transparent text rendering
SaveShadowedTextsAsTransparentTexts = true,
SaveTransparentTexts = true
};
// Save the output HTML
document.Save(dataDir + "TransparentTextRendering_out.html", htmlOptions);
}
}
PDF document layers rendering
We can render PDF document layers in separate layer type element during PDF to HTML conversion:
// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void ConvertPDFToHTMLWithLayersRendering()
{
// The path to the documents directory
var dataDir = RunExamples.GetDataDir_AsposePdf();
// Open PDF document
using (var document = new Aspose.Pdf.Document(dataDir + "PDFToHTML.pdf"))
{
// Instantiate HTML SaveOptions object
var htmlOptions = new Aspose.Pdf.HtmlSaveOptions
{
// Enable rendering of PDF document layers separately in the output HTML
ConvertMarkedContentToLayers = true
};
// Save the output HTML
document.Save(dataDir + "LayersRendering_out.html", htmlOptions);
}
}
See Also
This article also covers these topics. The codes are same as above.
Format: HTML