Parse and Process Markdown in C#

Parse Markdown to a Syntax Tree (Markdown AST) in C#

Aspose.HTML for .NET provides a dedicated Markdown parsing API in the Aspose.Html.Toolkit.Markdown.Syntax namespace. The MarkdownParse class converts a .md file into a strongly typed MarkdownSyntaxTree – a Markdown Abstract Syntax Tree (AST) that represents the full document structure.

Each Markdown element is mapped to a specific node type:

Unlike text-based Markdown converters, this API gives you direct programmatic access to the document structure before rendering. Instead of converting Markdown to HTML first, you can analyze, validate, filter, or modify the syntax tree at the structural level.

To build a syntax tree:

1using Aspose.Html.Toolkit.Markdown.Syntax.Parser;
2
3var parser = new MarkdownParser();
4var syntaxTree = parser.ParseFile("document.md");

syntaxTree is the root node of the Markdown AST and provides access to all child nodes in the document. In the following sections, you will learn how to traverse the syntax tree, extract headings and tables, and filter nodes by type.

Traverse the Markdown Syntax Tree in C#

Once the Markdown file is parsed into a MarkdownSyntaxTree, you can traverse its nodes to inspect or process document elements. The Markdown API does not support index-based access to children. The NodeList returned by ChildNodes() has no indexer. Instead, traversal follows a pointer-based pattern using:

Example: iterating through top-level nodes of the document:

1var node = syntaxTree.FirstChild;
2
3while (node != null)
4{
5    // Process node
6    Console.WriteLine(node.GetType().Name);
7
8    node = node.NextSibling;
9}

This pattern ensures efficient and safe traversal of the Markdown AST without relying on list indexing.

For more advanced scenarios, such as visiting only specific node types (e.g., tables or headings), you can use CreateTreeWalker() together with a MarkdownSyntaxNodeFilter. This approach enables selective traversal of the syntax tree.

Extract Headings from a Markdown in C#

After parsing a Markdown file into a MarkdownSyntaxTree, you can programmatically extract specific node types. This example demonstrates how to locate all ATX-style headings (#, ##, ###), determine their nesting level (H1–H6), and reconstruct their hierarchy using the Markdown AST.

 1using System.Text;
 2using Aspose.Html.Toolkit.Markdown.Syntax;
 3using Aspose.Html.Toolkit.Markdown.Syntax.Parser;
 4using System.IO;
 5...
 6
 7// Parse a Markdown file and extract ATX headings (# H1, ## H2, ### H3)
 8// preserving their hierarchy and nesting level using Aspose.HTML for .NET.
 9
10// Initialize the parser and build a syntax tree from the .md file
11var parser = new MarkdownParser();
12var syntaxTree = parser.ParseFile("document.md");
13
14var headings = new List<(int Level, string Text)>();
15
16// Walk top-level nodes using FirstChild -> NextSibling
17var topNode = syntaxTree.FirstChild;
18while (topNode != null)
19{
20    if (topNode is AtxHeadingSyntaxNode heading)
21    {
22        // Determine heading level (1–6)ns
23        int level = heading.GetLeadingTrivia()
24                           .Count(t => t.ToString().Trim() == "#");
25
26        if (level == 0)
27            level = heading.ToString().TakeWhile(c => c == '#').Count();
28
29        // Extract heading text from child nodes
30        var sb = new StringBuilder();
31        var child = heading.FirstChild;
32        while (child != null)
33        {
34            if (child is TextSyntaxNode || child is WhitespaceSyntaxNode)
35                sb.Append(child.ToString());
36            child = child.NextSibling;
37        }
38
39        headings.Add((level, sb.ToString().Trim()));
40    }
41
42    topNode = topNode.NextSibling;
43}
44
45// Output heading hierarchy with indentation reflecting nesting depth
46foreach (var (level, text) in headings)
47{
48    string indent = new string(' ', (level - 1) * 2);
49    string marker = new string('#', level);
50    Console.WriteLine($"{indent}{marker} {text}");
51}

How ATX Heading Extraction Works

  1. MarkdownParser.ParseFile() builds a MarkdownSyntaxTree, which represents the full Markdown Abstract Syntax Tree (AST).
  2. Traversal begins at syntaxTree.FirstChild and continues through NextSibling. The Markdown API does not support indexed access to NodeList, so pointer-based traversal is required.
  3. AtxHeadingSyntaxNode identifies headings defined with leading # characters. (Setext-style headings are represented by SetextHeadingSyntaxNode and are not handled in this example.)
  4. The heading level is determined by counting leading # tokens via GetLeadingTrivia(). A fallback counts # characters directly from the node string.
  5. The heading text is reconstructed from TextSyntaxNode and WhitespaceSyntaxNode children. Structural nodes such as SoftBreakSyntaxNode are intentionally ignored.

Extract GFM Tables from the Markdown in C#

After parsing Markdown into a MarkdownSyntaxTree, you can selectively traverse the Markdown AST to locate specific node types. This example demonstrates how to extract GitHub Flavored Markdown (GFM) tables using a TreeWalker together with a custom MarkdownSyntaxNodeFilter.

Unlike manual top-level traversal, this approach walks the entire syntax tree and returns only nodes that match the specified type.

 1using System.Text;
 2using Aspose.Html.Toolkit.Markdown.Syntax;
 3using Aspose.Html.Toolkit.Markdown.Syntax.Parser;
 4using System.IO;
 5...
 6
 7// Parse Markdown and extract GFM tables using TreeWalker
 8
 9// Initialize the parser and build a syntax tree from the .md file
10var parser = new MarkdownParser();
11var syntaxTree = parser.ParseFile("document.md");
12
13// Create a TreeWalker that visits only TableSyntaxNode nodes
14using var tableWalker = syntaxTree.CreateTreeWalker(new TypeFilter<TableSyntaxNode>());
15
16MarkdownSyntaxNode current;
17while ((current = tableWalker.NextNode()) != null)
18{
19    if (current is not TableSyntaxNode table) continue;
20
21    // Iterate rows via FirstChild -> NextSibling
22    var row = table.FirstChild;
23    while (row != null)
24    {
25        if (row is TableRowSyntaxNode tableRow)
26        {
27            var sb = new StringBuilder();
28            var cell = tableRow.FirstChild;
29            while (cell != null)
30            {
31                if (cell is TableCellSyntaxNode)
32                    sb.Append($"| {cell.ToString().Trim()} ");
33                cell = cell.NextSibling;
34            }
35            if (sb.Length > 0)
36                Console.WriteLine(sb.Append('|').ToString());
37        }
38        row = row.NextSibling;
39    }
40}
41
42// Custom filter that accepts only nodes of type T
43class TypeFilter<T> : MarkdownSyntaxNodeFilter where T : MarkdownSyntaxNode
44{
45    public override short AcceptNode(MarkdownSyntaxNode node)
46        => node is T ? (short)1 : (short)3;
47}

How Markdown Table Extraction with TreeWalker Works

  1. MarkdownParser.ParseFile() builds a MarkdownSyntaxTree, representing the full Markdown AST.
  2. CreateTreeWalker(MarkdownSyntaxNodeFilter) creates a filtered traversal mechanism that walks the entire tree but returns only nodes accepted by the filter.
  3. The custom TypeFilter<TableSyntaxNode> inherits from MarkdownSyntaxNodeFilter and overrides AcceptNode().
    • 1 (FILTER_ACCEPT) includes the node
    • 3 (FILTER_SKIP) ignores it
  4. NextNode() advances through the Markdown AST and yields only TableSyntaxNode instances.
  5. Rows (TableRowSyntaxNode) and cells (TableCellSyntaxNode) are accessed using pointer-based traversal (FirstChild → NextSibling), since NodeList does not support index access.
  6. cell.ToString().Trim() extracts the textual content of each table cell.

Modify Markdown Syntax Tree in C#

MarkdownSyntaxTree is mutable. You can traverse nodes, modify their content, replace children, and save the updated Markdown document.

The following code modifies Markdown emphasis nodes in a syntax tree using Aspose.HTML for .NET. It replaces all occurrences of a deprecated product name (Aspose.HTML) with a new one (Aspose.HTML for .NET) – directly via the syntax tree, without using regex or string replacement.

 1using Aspose.Html.Toolkit.Markdown.Syntax;
 2using Aspose.Html.Toolkit.Markdown.Syntax.Parser;
 3using System.IO;
 4...
 5
 6string markdownContent =
 7    "# Release Notes\n\n" +
 8    "* Aspose.HTML * is a powerful HTML processing library.\n" +
 9    "Use *Aspose.HTML* to convert, parse, and manipulate HTML documents.\n" +
10    "The latest version of *Aspose.HTML* includes Markdown support.";
11
12// Parse markdown
13var markdown = new MarkdownParser().Parse(markdownContent);
14var factory = markdown.SyntaxFactory;
15
16// Recursively traverse and replace text in Emphasis nodes
17void ReplaceInEmphasis(MarkdownSyntaxNode node, string oldText, string newText)
18{
19    // Check if current node is Emphasis with target text
20    if (node is EmphasisSyntaxNode em && 
21        em.ToString().Trim('*').Trim() == oldText)
22    {
23        // Clear existing content
24        while (em.FirstChild != null) 
25            em.RemoveChild(em.FirstChild);
26        // Insert new text
27        em.AppendChild(factory.Text(newText));
28    }
29    // Recurse into children
30    for (var c = node.FirstChild; c != null; c = c.NextSibling)
31        ReplaceInEmphasis(c, oldText, newText);
32}
33
34// Run replacement on entire document
35ReplaceInEmphasis(markdown, "Aspose.HTML", "Aspose.HTML for .NET");
36
37// Save result to file
38markdown.Save(Path.Combine(OutputDir, "modified-release-notes.md"));

Quick Reference: Markdown Parsing in C#

How to parse a Markdown file into a syntax tree?

1var syntaxTree = new MarkdownParser().ParseFile("document.md");

Builds a MarkdownSyntaxTree – the root of the Markdown AST that represents the full document structure.

How to iterate top-level nodes in the Markdown AST?

1var node = syntaxTree.FirstChild;
2while (node != null)
3{
4    /* process node */
5    node = node.NextSibling;
6}

NodeList returned by ChildNodes() does not support index access. Use pointer-based traversal: FirstChild → NextSibling.

How to check whether a node is a heading?

1if (node is AtxHeadingSyntaxNode heading)
2{
3    /* H1–H6 heading */
4}

AtxHeadingSyntaxNode represents ATX-style headings (#, ##, … ######) in the Markdown syntax tree.

How to traverse only specific node types using a filter?

1using var walker = syntaxTree.CreateTreeWalker(new TypeFilter<TableSyntaxNode>());
2
3while (walker.NextNode() != null)
4{
5    /* only TableSyntaxNode nodes */
6}

TypeFilter<T> is a custom MarkdownSyntaxNodeFilter that accepts only nodes of type T, enabling selective traversal of the Markdown AST.

FAQ: Markdown Parsing with Aspose.HTML

Q: What is the difference between AtxHeadingSyntaxNode and SetextHeadingSyntaxNode?

AtxHeadingSyntaxNode represents headings defined with leading # characters (# H1, ## H2). SetextHeadingSyntaxNode represents headings underlined with = or - on the next line. Both are valid Markdown heading styles, but they map to different node classes in the syntax tree.

Q: Why does NodeList not support index access like children[0]?

NodeList in the Aspose.HTML Markdown API does not implement IList<T> and has no indexer. The correct traversal pattern is pointer-based: start from FirstChild and advance through siblings using NextSibling until the value is null.

Q: Does TreeWalker traverse the entire document or only direct children?

CreateTreeWalker(syntaxTree.FirstChild) starts traversal from the first child node and visits only its descendants. To traverse the entire document, pass a MarkdownSyntaxNodeFilter directly: CreateTreeWalker(filter) – this starts from the tree root and visits all nodes.

In this guide, you learned how to:

  • Parse Markdown
  • Traverse the AST
  • Extract structured elements
  • Modify content programmatically
Close
Loading

Analyzing your prompt, please hold on...

An error occurred while retrieving the results. Please refresh the page and try again.

Subscribe to Aspose Product Updates

Get monthly newsletters & offers directly delivered to your mailbox.