Convert HTML to Markdown in Java
Markdown (MD) is a simple markup language that uses a plain-text-formatting syntax. It is commonly used for creating documentation and readme files due to its easy-to-read and easy-to-write format. Its design allows it to be easily converted to many output formats, but initially, it was created to convert only to HTML. Aspose.HTML provides a class library that allows you to convert HTML to Markdown format in Java and other Java programming languages, offering a reverse conversion from HTML to Markdown.
In this article, you find information on how to convert HTML to Markdown using convertHTML()
methods of the
Converter class, and how to apply
MarkdownSaveOptions.
HTML to Markdown by a few lines of Java code
You can convert HTML to Markdown format using Java and other Java programming languages. The static methods of the Converter class are primarily used as the easiest way to convert an HTML code into various formats. The following code snippet shows how to convert HTML to Markdown literally with a few lines of code!
1// Prepare HTML code and save it to a file
2String code = "<h1>Convert HTML to Markdown Using Java</h1>" +
3 "<h2>How to Convert HTML to MD in Java</h2>" +
4 "<p>The Aspose.HTML for Java library allows you to convert HTML to Markdown.</p>";
5FileHelper.writeAllText($o("conversion.html"), code);
6
7// Call ConvertHTML() method to convert HTML to Markdown
8Converter.convertHTML($o("conversion.html"), new MarkdownSaveOptions(), $o("conversion.md"));
Save Options – MarkdownSaveOptions Class
The
MarkdownSaveOptions has a number of properties that give you control over the conversion process. The most important option is MarkdownFeatures
. This option allows you to enable/disable the conversion of the particular element.
Method | Description |
---|---|
getDefault() | This method returns a set of options that are compatible with default Markdown documentation. |
setFeatures(value) | A flag set that controls which HTML elements are converted to Markdown. |
setFormatter(value) | This method gets or sets the Markdown formatting style. |
getGit() | This method returns a set of options that are compatible with GitLab Flavored Markdown. |
getResourceHandlingOptions() | Gets a ResourceHandlingOptions object which is used for configuration of resources handling. |
To learn more about MarkdownSaveOptions, please read the Fine-Tuning Converters article.
In the Markdown Syntax article, you will find information on the main Markdown elements, details, and examples of the Markdown syntax.
Convert HTML to Markdown using MarkdownSaveOptions
setFeatures()
method
To convert HTML to Markdown with MarkdownSaveOptions
specifying, you should follow a few steps:
- One popular scenario is to load an HTML file using one of the
HTMLDocument()
constructors of the HTMLDocument class. But in this example, we create an HTML source from scratch by preparing HTML code and saving it to a file. - Create a new
MarkdownSaveOptions object. The
MarkdownSaveOptions
object can be customized to specify different settings for the conversion process. - Use the
convertHTML(sourcePath, options, outputPath)
method of the Converter class to save HTML as a Markdown file.
The following example shows how to process only links and paragraphs, other HTML elements remain as is:
1// Prepare HTML code and save it to the file
2String code = "<h1>Header 1</h1>" +
3 "<h2>Header 2</h2>" +
4 "<p>Hello, World!!</p>" +
5 "<a href='aspose.com'>aspose</a>";
6FileHelper.writeAllText($o("options.html"), code);
7
8// Create an instance of SaveOptions and set up the rule:
9// - only <a> and <p> elements will be converted to Markdown
10MarkdownSaveOptions options = new MarkdownSaveOptions();
11options.setFeatures(MarkdownFeatures.Link | MarkdownFeatures.AutomaticParagraph);
12
13// Call the ConvertHTML() method to convert the HTML to Markdown
14Converter.convertHTML($o("options.html"), options, $o("options-output.md"));
In the Java code above, the options
object is created, and two options are set using the setFeatures()
method. The Link
feature specifies that HTML <a>
elements will be converted to Markdown, while the AutomaticParagraph
feature determines that HTML <p>
elements will be converted to Markdown. Any other elements in the HTML document will not be converted.
getGit()
method
GitHub Flavored Markdown is the GitHub.com version of the Markdown syntax that provides an additional set of helpful features that make it easier to work with content on GitHub.com. It is an extension of the standard Markdown syntax and adds many additional features, including code highlighting, task lists, tables, and more.
To convert HTML to Markdown you can define your own set of rules or use the predefined templates. For instance, you can use the template based on GitLab Flavored Markdown syntax:
1// Prepare HTML code and save it to the file
2String code = "<h1>Header 1</h1>" +
3 "<h2>Header 2</h2>" +
4 "<p>Hello World!!</p>";
5FileHelper.writeAllText($o("document.html"), code);
6
7// Call ConvertHTML() method to convert HTML to Markdown
8Converter.convertHTML($o("document.html"), MarkdownSaveOptions.getGit(), $o("output-git.md"));
In the java example above, the convertHTML(sourcePath, options, outputPath)
method performs the conversion. It takes three arguments: sourcePath
, options
, and outputPath
. The second argument is an instance of the MarkdownSaveOptions
. We use the getGit()
method in MarkdownSaveOptions
that returns an options
instance with Git enabled. Enabling Git in MarkdownSaveOptions
means that the generated Markdown output file will contain a Git-flavored Markdown, a Markdown syntax that includes Git-specific features.
Limitation
Markdown is a lightweight and easy-to-use syntax. Not all HTML elements are possible to convert to Markdown since there is no equivalent in Markdown syntax. The elements such as STYLE, SCRIPT, LINK, EMBED, etc. will be discarded during conversion.
Inline HTML
Markdown allows you to specify the pure HTML code, which will be rendered as is. The feature that allows this behavior is called “Inline HTML”. In order to use it, you should place one of the specific elements supported by this feature at the beginning of the new line. Or you can mark one of such elements as “inline HTML”, by adding the attribute markdown
with the value inline
to this element. Here is a small example that demonstrates how to use this attribute:
1// Prepare an HTML code and save it to the file.
2String code = "text<div markdown='inline'><code>text</code></div>";
3FileHelper.writeAllText($o("inline.html"), code);
4
5// Call ConvertHTML() method to convert HTML to Markdown.
6Converter.convertHTML($o("inline.html"), new MarkdownSaveOptions(), $o("inline-html.md"));
7
8// Output file will contain: text\r\n<div markdown="inline"><code>text</code></div>
As you can see, content of the <div>
element is not converted to Markdown and is treated by Markdown Processor as-is. The list of elements, which support this feature, is different for every Markdown processor.
The original Markdown specification supports these tags: BLOCKQUOTE,H1, H2, H3, H4, H5, H6, P, PRE, OL, UL, DL, DIV, INS, DEL, IFRAME, FIELDSET, NOSCRIPT, FORM, MATH.
The GitLab Flavored Markdown extends this list with the next tags: ARTICLE, FOOTER, NAV, ASIDE, HEADER, ADDRESS, HR, DD, FIGURE, FIGCAPTION, ABBR, VIDEO, AUDIO, OUTPUT, CANVAS, SECTION, DETAILS, HGROUP, SUMMARY.
Features nesting
Although Markdown supports a wide range of features, not all of them can be combined. For example, list elements inside table elements will not be converted. The table below shows which features can be nested. Each feature is a member of the MarkdownFeatures enumeration.
Parent feature | Features which can be processed inside |
---|---|
Header | Link, Emphasis, Strong, InlineCode, Image, Strikethrough, Video |
Blockquote | Any |
List | AutomaticParagraph, Link, Emphasis, Strong, InlineCode, Image, LineBreak, Strikethrough, Video, TaskList, List |
Link | Emphasis, Strong, InlineCode, Image, LineBreak, Strikethrough |
AutomaticParagraph | Link, Emphasis, Strong, InlineCode, Image, LineBreak, Strikethrough |
Strikethrough | Link, Emphasis, Strong, InlineCode, Image, LineBreak |
Table | Video, Strikethrough, Image, InlineCode, Emphasis, Strong, Link |
Emphasis | Link, InlineCode, Image, LineBreak, Strikethrough, Video |
Strong | Link, InlineCode, Image, LineBreak, Strikethrough, Video |
Conclusion
Aspose.HTML for Java provides powerful tools for HTML-to-Markdown conversion, leveraging the convertHTML()
method and customizable MarkdownSaveOptions
.
The MarkdownSaveOptions
class gives developers fine-grained control over the conversion process. It includes features to enable or disable specific HTML elements, set formatting styles, and handle resources efficiently. For advanced scenarios, predefined options like GitHub Flavored Markdown (GFM) are available, allowing seamless integration with platforms that support extended Markdown syntax.
However, limitations exist due to Markdown’s lightweight nature. Certain HTML elements, such as <style>
and <script>
, lack direct Markdown equivalents and are omitted during conversion. Nonetheless, Markdown’s support for “inline HTML” provides a workaround for including unsupported elements.
You can download the complete examples and data files from GitHub.
Aspose.HTML offers a free online HTML to Markdown Converter that converts HTML to Markdown with high quality, easy and fast. Just upload, convert your files and get results in a few seconds!