Extract Content Between Nodes in a Document

When working with documents, it is important to be able to easily extract content from a specific range within a document. However, the content may consist of complex elements such as paragraphs, tables, images, etc.

Regardless of what content needs to be extracted, the method to extract that content will always be determined by which nodes are selected to extract content between. These can be entire text bodies or simple text runs.

There are many possible situations and therefore many different node types to consider when extracting content. For example, you might want to extract content between:

Two specific paragraphs
Specific runs of text
Fields of various types, such as merge fields
Start and end ranges of a bookmark or comment
Various bodies of text contained in separate sections

In some situations, you may even need to combine different node types, such as extracting content between a paragraph and a field, or between a run and a bookmark.

This article provides the code implementation for extracting text between different nodes, as well as examples of common scenarios.

These examples are just a few demonstrations of the many possibilities. We plan for the text extraction functionality to be part of the public API in the future, and no extra code will be required. In the meantime, feel free to post your requests regarding this functionality on the Aspose.Words forum.

Why Extract Content

Often the goal of extracting the content is to duplicate or save it separately in a new document. For example, you can extract content and:

Copy it into a separate document
Convert a specific part of a document to PDF or image
Duplicate the content in the document many times
Work with extracted content separate from the rest of the document

This can be easily achieved using Aspose.Words and the code implementation below.

Extracting Content Algorithm

The code in this section addresses all of the possible situations described above with one generalized and reusable method. The general outline of this technique involves:

Gathering the nodes which dictate the area of content that will be extracted from your document. Retrieving these nodes is handled by the user in their code, based on what they want to be extracted.
Passing these nodes to the ExtractContent method provided below. You must also pass a boolean parameter which states whether these nodes, acting as markers, should be included in the extraction or not.
Retrieving a list of cloned content (copied nodes) specified to be extracted. You can use this list of nodes in any applicable way, for example, creating a new document containing only the selected content.

How to Extract Content

We will work with the document below in this article. As you can see it contains a variety of content. Also note, the document contains a second section beginning in the middle of the first page. A bookmark and comment are also present in the document but are not visible in the screenshot below.

extract-content-aspose-words-java

To extract the content from your document you need to call the ExtractContent method below and pass the appropriate parameters.

The underlying basis of this method involves finding block-level nodes (paragraphs and tables) and cloning them to create identical copies. If the marker nodes passed are block-level then the method is able to simply copy the content on that level and add it to the array.

However, if the marker nodes are inline (a child of a paragraph) then the situation becomes more complex, as it is necessary to split the paragraph at the inline node, be it a run, bookmark fields etc. Content in the cloned parent nodes not present between the markers is removed. This process is used to ensure that the inline nodes will still retain the formatting of the parent paragraph.

The method will also run checks on the nodes passed as parameters and throws an exception if either node is invalid. The parameters to be passed to this method are:

StartNode and EndNode. The first two parameters are the nodes which define where the extraction of the content is to begin and to end at respectively. These nodes can be both block level (Paragraph , Table ) or inline level (e.g Run , FieldStart , BookmarkStart etc.):
1. To pass a field you should pass the corresponding FieldStart object.
2. To pass bookmarks, the BookmarkStart and BookmarkEnd nodes should be passed.
3. To pass comments, the CommentRangeStart and CommentRangeEnd nodes should be used.
IsInclusive. Defines if the markers are included in the extraction or not. If this option is set to false and the same node or consecutive nodes are passed, then an empty list will be returned:
1. If a FieldStart node is passed then this option defines if the whole field is to be included or excluded.
2. If a BookmarkStart or BookmarkEnd node is passed, this option defines if the bookmark is included or just the content between the bookmark range.
3. If a CommentRangeStart or CommentRangeEnd node is passed, this option defines if the comment itself is to be included or just the content in the comment range.

The implementation of the ExtractContent method you can find here. This method will be referred to in the scenarios in this article.

We will also define a custom method to easily generate a document from extracted nodes. This method is used in many of the scenarios below and simply creates a new document and imports the extracted content into it.

The following code example shows how to take a list of nodes and inserts them into a new document:

	// For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Java.git.
	public static Document generateDocument(Document srcDoc, ArrayList<Node> nodes) throws Exception
	{
	Document dstDoc = new Document();
	// Remove the first paragraph from the empty document.
	dstDoc.getFirstSection().getBody().removeAllChildren();

	// Import each node from the list into the new document. Keep the original formatting of the node.
	NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
	for (Node node : nodes)
	{
	Node importNode = importer.importNode(node, true);
	dstDoc.getFirstSection().getBody().appendChild(importNode);
	}

	return dstDoc;
	}

view raw generate-document.java hosted with ❤ by GitHub

Extract Content Between Paragraphs

This demonstrates how to use the method above to extract content between specific paragraphs. In this case, we want to extract the body of the letter found in the first half of the document. We can tell that this is between the 7th and 11th paragraphs.

The code below accomplishes this task. The appropriate paragraphs are extracted using the getChild method on the document and passing the specified indices. We then pass these nodes to the ExtractContent method and state that these are to be included in the extraction. This method will return the copied content between these nodes which are then inserted into a new document.

The following code example shows how to extract the content between specific paragraphs using the ExtractContent method above:

	// For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Java.git.
	Document doc = new Document(getMyDir() + "Extract content.docx");

	Paragraph startPara = (Paragraph) doc.getFirstSection().getBody().getChild(NodeType.PARAGRAPH, 6, true);
	Paragraph endPara = (Paragraph) doc.getFirstSection().getBody().getChild(NodeType.PARAGRAPH, 10, true);
	// Extract the content between these nodes in the document. Include these markers in the extraction.
	ArrayList<Node> extractedNodes = ExtractContentHelper.extractContent(startPara, endPara, true);

	Document dstDoc = ExtractContentHelper.generateDocument(doc, extractedNodes);
	dstDoc.save(getArtifactsDir() + "ExtractContent.ExtractContentBetweenParagraphs.docx");

view raw extract-content-between-paragraphs.java hosted with ❤ by GitHub

You can download the sample file of this example from Aspose.Words GitHub.

The output document contains the two paragraphs that were extracted.

extract-content-result-aspose-words-java

Extract Content Between Different Types of Nodes

We can extract content between any combinations of block-level or inline nodes. In this scenario below we will extract the content between the first paragraph and the table in the second section inclusively. We get the markers nodes by calling getFirstParagraph and getChild method on the second section of the document to retrieve the appropriate Paragraph and Table nodes. For a slight variation let’s instead duplicate the content and insert it below the original.

The following code example shows how to extract the content between a paragraph and table using the ExtractContent method:

	// For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Java.git.
	Document doc = new Document(getMyDir() + "Extract content.docx");

	Paragraph startPara = (Paragraph) doc.getLastSection().getChild(NodeType.PARAGRAPH, 2, true);
	Table endTable = (Table) doc.getLastSection().getChild(NodeType.TABLE, 0, true);
	// Extract the content between these nodes in the document. Include these markers in the extraction.
	ArrayList<Node> extractedNodes = ExtractContentHelper.extractContent(startPara, endTable, true);

	// Let's reverse the array to make inserting the content back into the document easier.
	Collections.reverse(extractedNodes);
	for (Node extractedNode : extractedNodes)
	// Insert the last node from the reversed list.
	endTable.getParentNode().insertAfter(extractedNode, endTable);

	doc.save(getArtifactsDir() + "ExtractContent.ExtractContentBetweenBlockLevelNodes.docx");

view raw extract-content-between-block-level-nodes.java hosted with ❤ by GitHub

You can download the sample file of this example from Aspose.Words GitHub.

The content between the paragraph and table has been duplicated below is the result.

extract-content-between-paragraphs-aspose-words-java

Extract Content Between Paragraphs Based on Style

You may need to extract the content between paragraphs of the same or different style, such as between paragraphs marked with heading styles.

The code below shows how to achieve this. It is a simple example which will extract the content between the first instance of the “Heading 1” and “Header 3” styles without extracting the headings as well. To do this we set the last parameter to false, which specifies that the marker nodes should not be included.

In a proper implementation, this should be run in a loop to extract content between all paragraphs of these styles from the document. The extracted content is copied into a new document.

The following code example shows how to extract content between paragraphs with specific styles using the ExtractContent method:

	// For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Java.git.
	Document doc = new Document(getMyDir() + "Extract content.docx");

	// Gather a list of the paragraphs using the respective heading styles.
	ArrayList<Paragraph> parasStyleHeading1 = paragraphsByStyleName(doc, "Heading 1");
	ArrayList<Paragraph> parasStyleHeading3 = paragraphsByStyleName(doc, "Heading 3");

	// Use the first instance of the paragraphs with those styles.
	Node startPara1 = parasStyleHeading1.get(0);
	Node endPara1 = parasStyleHeading3.get(0);

	// Extract the content between these nodes in the document. Don't include these markers in the extraction.
	ArrayList<Node> extractedNodes = ExtractContentHelper.extractContent(startPara1, endPara1, false);

	Document dstDoc = ExtractContentHelper.generateDocument(doc, extractedNodes);
	dstDoc.save(getArtifactsDir() + "ExtractContent.ExtractContentBetweenParagraphStyles.docx");

view raw extract-content-between-paragraph-styles.java hosted with ❤ by GitHub

	// For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Java.git.
	public static ArrayList<Paragraph> paragraphsByStyleName(Document doc, String styleName)
	{
	// Create an array to collect paragraphs of the specified style.
	ArrayList<Paragraph> paragraphsWithStyle = new ArrayList<Paragraph>();

	NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);

	// Look through all paragraphs to find those with the specified style.
	for (Paragraph paragraph : (Iterable<Paragraph>) paragraphs)
	{
	if (paragraph.getParagraphFormat().getStyle().getName().equals(styleName))
	paragraphsWithStyle.add(paragraph);
	}

	return paragraphsWithStyle;
	}

view raw paragraphs-by-style-name.java hosted with ❤ by GitHub

You can download the sample file of this example from Aspose.Words GitHub.

Below is the result of the previous operation.

extract-content-between-paragraph-style-aspose-words-java

Extract Content Between Specific Runs

You can extract content between inline nodes such as a Run as well. Runs from different paragraphs can be passed as markers. The code below shows how to extract specific text in-between the same Paragraph node.

The following code example shows how to extract content between specific runs of the same paragraph using the ExtractContent method:

	// For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Java.git.
	Document doc = new Document(getMyDir() + "Extract content.docx");

	Paragraph para = (Paragraph) doc.getChild(NodeType.PARAGRAPH, 7, true);
	Run startRun = para.getRuns().get(1);
	Run endRun = para.getRuns().get(4);

	// Extract the content between these nodes in the document. Include these markers in the extraction.
	ArrayList<Node> extractedNodes = ExtractContentHelper.extractContent(startRun, endRun, true);
	for (Node extractedNode : extractedNodes)
	System.out.println(extractedNode.toString(SaveFormat.TEXT));

view raw extract-content-between-runs.java hosted with ❤ by GitHub

You can download the sample file of this example from Aspose.Words GitHub.

The extracted text is displayed on the console.

extract-content-between-runs-aspose-words-java

Extract Content using a Field

To use a field as a marker, the FieldStart node should be passed. The last parameter to the ExtractContent method will define if the entire field is to be included or not. Let’s extract the content between the “FullName” merge field and a paragraph in the document. We use the moveToMergeField method of DocumentBuilder class. This will return the FieldStart node from the name of the merge field passed to it.

In our case let’s set the last parameter passed to the ExtractContent method to false to exclude the field from the extraction. We will render the extracted content to PDF.

The following code example shows how to extract content between a specific field and paragraph in the document using the ExtractContent method:

	// For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Java.git.
	Document doc = new Document(getMyDir() + "Extract content.docx");
	DocumentBuilder builder = new DocumentBuilder(doc);
	// Pass the first boolean parameter to get the DocumentBuilder to move to the FieldStart of the field.
	// We could also get FieldStarts of a field using GetChildNode method as in the other examples.
	builder.moveToMergeField("Fullname", false, false);

	// The builder cursor should be positioned at the start of the field.
	FieldStart startField = (FieldStart) builder.getCurrentNode();
	Paragraph endPara = (Paragraph) doc.getFirstSection().getChild(NodeType.PARAGRAPH, 5, true);
	// Extract the content between these nodes in the document. Don't include these markers in the extraction.
	ArrayList<Node> extractedNodes = ExtractContentHelper.extractContent(startField, endPara, false);

	Document dstDoc = ExtractContentHelper.generateDocument(doc, extractedNodes);
	dstDoc.save(getArtifactsDir() + "ExtractContent.ExtractContentUsingField.docx");

view raw extract-content-using-field.java hosted with ❤ by GitHub

You can download the sample file of this example from Aspose.Words GitHub.

The extracted content between the field and paragraph, without the field and paragraph marker nodes rendered to PDF.

extract-content-using-field-aspose-words-java

Extract Content from a Bookmark

In a document, the content that is defined within a bookmark is encapsulated by the BookmarkStart and BookmarkEnd nodes. The content found between these two nodes make up the bookmark. You can pass either of these nodes as any marker, even ones from different bookmarks, as long as the starting marker appears before the ending marker in the document.

In our sample document, we have one bookmark, named “Bookmark1”. The content of this bookmark is highlighted content in our document:

extract-content-from-bookmark-aspose-words-java-1

We will extract this content into a new document using the code below. The IsInclusive parameter option shows how to retain or discard the bookmark.

The following code example shows how to extract the content referenced a bookmark using the ExtractContent method:

	// For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Java.git.
	Document doc = new Document(getMyDir() + "Extract content.docx");

	Bookmark bookmark = doc.getRange().getBookmarks().get("Bookmark1");
	BookmarkStart bookmarkStart = bookmark.getBookmarkStart();
	BookmarkEnd bookmarkEnd = bookmark.getBookmarkEnd();

	// Firstly, extract the content between these nodes, including the bookmark.
	ArrayList<Node> extractedNodesInclusive = ExtractContentHelper.extractContent(bookmarkStart, bookmarkEnd, true);

	Document dstDoc = ExtractContentHelper.generateDocument(doc, extractedNodesInclusive);
	dstDoc.save(getArtifactsDir() + "ExtractContent.ExtractContentBetweenBookmark.IncludingBookmark.docx");

	// Secondly, extract the content between these nodes this time without including the bookmark.
	ArrayList<Node> extractedNodesExclusive = ExtractContentHelper.extractContent(bookmarkStart, bookmarkEnd, false);

	dstDoc = ExtractContentHelper.generateDocument(doc, extractedNodesExclusive);
	dstDoc.save(getArtifactsDir() + "ExtractContent.ExtractContentBetweenBookmark.WithoutBookmark.docx");

view raw extract-content-between-bookmark.java hosted with ❤ by GitHub

You can download the sample file of this example from Aspose.Words GitHub.

The extracted output with the IsInclusive parameter set to true. The copy will retain the bookmark as well.

extract-content-from-bookmark-aspose-words-java-2

The extracted output with the IsInclusive parameter set to false. The copy contains the content but without the bookmark.

extract-content-from-bookmark-aspose-words-java-3

Extract Content from a Comment

A comment is made up of the CommentRangeStart, CommentRangeEnd and Comment nodes. All of these nodes are inline. The first two nodes encapsulate the content in the document which is referenced by the comment, as seen in the screenshot below.

The Comment node itself is an InlineStory that can contain paragraphs and runs. It represents the message of the comment as seen as a comment bubble in the review pane. As this node is inline and a descendant of a body you can also extract the content from inside this message as well.

In our document we have one comment. Let’s display it by showing markup in the Review tab:

extract-content-from-comment-aspose-words-java-1

The comment encapsulates the heading, first paragraph and the table in the second section. Let’s extract this comment into a new document. The IsInclusive option dictates if the comment itself is kept or discarded.

The following code example shows how to do this is below:

	// For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Java.git.
	Document doc = new Document(getMyDir() + "Extract content.docx");

	CommentRangeStart commentStart = (CommentRangeStart) doc.getChild(NodeType.COMMENT_RANGE_START, 0, true);
	CommentRangeEnd commentEnd = (CommentRangeEnd) doc.getChild(NodeType.COMMENT_RANGE_END, 0, true);

	// Firstly, extract the content between these nodes including the comment as well.
	ArrayList<Node> extractedNodesInclusive = ExtractContentHelper.extractContent(commentStart, commentEnd, true);

	Document dstDoc = ExtractContentHelper.generateDocument(doc, extractedNodesInclusive);
	dstDoc.save(getArtifactsDir() + "ExtractContent.ExtractContentBetweenCommentRange.IncludingComment.docx");

	// Secondly, extract the content between these nodes without the comment.
	ArrayList<Node> extractedNodesExclusive = ExtractContentHelper.extractContent(commentStart, commentEnd, false);

	dstDoc = ExtractContentHelper.generateDocument(doc, extractedNodesExclusive);
	dstDoc.save(getArtifactsDir() + "ExtractContent.ExtractContentBetweenCommentRange.WithoutComment.docx");

view raw extract-content-between-comment-range.java hosted with ❤ by GitHub

You can download the sample file of this example from Aspose.Words GitHub.

Firstly the extracted output with the IsInclusive parameter set to true. The copy will contain the comment as well.

extract-content-from-comment-aspose-words-java-2

Secondly the extracted output with isInclusive set to false. The copy contains the content but without the comment.

extract-content-from-comment-aspose-words-java-12

Extract Content using DocumentVisitor

Aspose.Words can be used not only for creating Microsoft Word documents by building them dynamically or merging templates with data, but also for parsing documents in order to extract separate document elements such as headers, footers, paragraphs, tables, images, and others. Another possible task is to find all text of specific formatting or style.

Use the DocumentVisitor class to implement this usage scenario. This class corresponds to the well-known Visitor design pattern. With DocumentVisitor, you can define and execute custom operations that require enumeration over the document tree.

DocumentVisitor provides a set of VisitXXX methods that are invoked when a particular document element (node) is encountered. For example, VisitParagraphStart is called when the beginning of a text paragraph is found and VisitParagraphEnd is called when the end of a text paragraph is found. Each DocumentVisitor.VisitXXX method accepts the corresponding object that it encounters so you can use it as needed (say retrieve the formatting), e.g. both VisitParagraphStart and VisitParagraphEnd accept a Paragraph object.

Each DocumentVisitor.VisitXXX method returns a VisitorAction value that controls the enumeration of nodes. You can request either to continue the enumeration, skip the current node (but continue the enumeration), or stop the enumeration of nodes.

These are the steps you should follow to programmatically determine and extract various parts of a document:

Create a class derived from DocumentVisitor.
Override and provide implementations for some or all of the DocumentVisitor.VisitXXX methods to perform some custom operations.
Call Node.accept on the node from where you want to start the enumeration. For example, if you want to enumerate the whole document, use accept(DocumentVisitor).

DocumentVisitor provides default implementations for all of the DocumentVisitor.VisitXXX methods. This makes it easier to create new document visitors as only the methods required for the particular visitor need to be overridden. It is not necessary to override all of the visitor methods.

The following example shows how to use the Visitor pattern to add new operations to the Aspose.Words object model. In this case, we create a simple document converter into a text format:

	// For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Java.git.
	Document doc = new Document(getMyDir() + "Extract content.docx");

	ConvertDocToTxt convertToPlainText = new ConvertDocToTxt();
	// Note that every node in the object model has the accept method so the visiting
	// can be executed not only for the whole document, but for any node in the document.
	doc.accept(convertToPlainText);

	// Once the visiting is complete, we can retrieve the result of the operation,
	// That in this example, has accumulated in the visitor.
	System.out.println(convertToPlainText.getText());

view raw extract-content-using-document-visitor.java hosted with ❤ by GitHub

	// For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Java.git.
	/// <summary>
	/// Simple implementation of saving a document in the plain text format. Implemented as a Visitor.
	/// </summary>
	static class ConvertDocToTxt extends DocumentVisitor {
	public ConvertDocToTxt() {
	mIsSkipText = false;
	mBuilder = new StringBuilder();
	}

	/// <summary>
	/// Gets the plain text of the document that was accumulated by the visitor.
	/// </summary>
	public String getText() {
	return mBuilder.toString();
	}

	/// <summary>
	/// Called when a Run node is encountered in the document.
	/// </summary>
	public int visitRun(Run run) {
	appendText(run.getText());
	// Let the visitor continue visiting other nodes.
	return VisitorAction.CONTINUE;
	}

	/// <summary>
	/// Called when a FieldStart node is encountered in the document.
	/// </summary>
	public int visitFieldStart(FieldStart fieldStart) {
	// In Microsoft Word, a field code (such as "MERGEFIELD FieldName") follows
	// after a field start character. We want to skip field codes and output field.
	// Result only, therefore we use a flag to suspend the output while inside a field code.
	// Note this is a very simplistic implementation and will not work very well.
	// If you have nested fields in a document.
	mIsSkipText = true;
	return VisitorAction.CONTINUE;
	}

	/// <summary>
	/// Called when a FieldSeparator node is encountered in the document.
	/// </summary>
	public int visitFieldSeparator(FieldSeparator fieldSeparator) {
	// Once reached a field separator node, we enable the output because we are
	// now entering the field result nodes.
	mIsSkipText = false;
	return VisitorAction.CONTINUE;
	}

	/// <summary>
	/// Called when a FieldEnd node is encountered in the document.
	/// </summary>
	public int visitFieldEnd(FieldEnd fieldEnd) {
	// Make sure we enable the output when reached a field end because some fields
	// do not have field separator and do not have field result.
	mIsSkipText = false;
	return VisitorAction.CONTINUE;
	}

	/// <summary>
	/// Called when visiting of a Paragraph node is ended in the document.
	/// </summary>
	public int visitParagraphEnd(Paragraph paragraph) {
	// When outputting to plain text we output Cr+Lf characters.
	appendText(ControlChar.CR_LF);
	return VisitorAction.CONTINUE;
	}

	public int visitBodyStart(Body body) {
	// We can detect beginning and end of all composite nodes such as Section, Body,
	// Table, Paragraph etc and provide custom handling for them.
	mBuilder.append("* Body Started *\r\n");
	return VisitorAction.CONTINUE;
	}

	public int visitBodyEnd(Body body) {
	mBuilder.append("* Body Ended *\r\n");
	return VisitorAction.CONTINUE;
	}

	/// <summary>
	/// Called when a HeaderFooter node is encountered in the document.
	/// </summary>
	public int visitHeaderFooterStart(HeaderFooter headerFooter) {
	// Returning this value from a visitor method causes visiting of this
	// Node to stop and move on to visiting the next sibling node
	// The net effect in this example is that the text of headers and footers
	// Is not included in the resulting output
	return VisitorAction.SKIP_THIS_NODE;
	}

	/// <summary>
	/// Adds text to the current output. Honors the enabled/disabled output flag.
	/// </summary>
	private void appendText(String text) {
	if (!mIsSkipText)
	mBuilder.append(text);
	}

	private StringBuilder mBuilder;
	private boolean mIsSkipText;
	}

view raw convert-doc-to-txt.java hosted with ❤ by GitHub

Extract Text Only

The ways to retrieve text from the document are:

Use Document.save with SaveFormat to save as plain text into a file or stream
Use Node.toString and pass the SaveFormat.Text parameter. Internally, this invokes save as text into a memory stream and returns the resulting string
Use Node.getText to retrieve text with all Microsoft Word control characters including field codes
Implement a custom DocumentVisitor to perform customized extraction

Using `Node.GetText` and `Node.ToString`

A Word document can contains control characters that designate special elements such as field, end of the cell, end of section etc. The full list of possible Word control characters is defined in the ControlChar class. The GetText method returns text with all of the control character characters present in the node.

Calling ToString returns the plain text representation of the document only without control characters. For further information on exporting as plain text see Using SaveFormat.Text.

The following code example shows the difference between calling the GetText and ToString methods on a node:

	// For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Java.git.
	Document doc = new Document();
	DocumentBuilder builder = new DocumentBuilder(doc);

	builder.insertField("MERGEFIELD Field");

	// When converted to text it will not retrieve fields code or special characters,
	// but will still contain some natural formatting characters such as paragraph markers etc.
	// This is the same as "viewing" the document as if it was opened in a text editor.
	System.out.println("ToString() Result: " + doc.toString(SaveFormat.TEXT));

view raw simple-extract-text.java hosted with ❤ by GitHub

Using `SaveFormat.Text`

This example saves the document as follows:

Filters out field characters and field codes, shape, footnote, endnote and comment references
Replaces end of paragraph ControlChar.Cr characters with ControlChar.CrLf combinations
Uses UTF8 encoding

The following code example shows how to save a document in TXT format:

	// For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Java.git.
	Document doc = new Document(getMyDir() + "Document.docx");
	doc.save(getArtifactsDir() + "BaseConversions.DocxToTxt.txt");

view raw docx-to-txt.java hosted with ❤ by GitHub

Extract Images from Shapes

You may need to extract document images to perform some tasks. Aspose.Words allows you to do this as well.

The following code example shows how to extract images from a document:

	// For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Java.git.
	Document doc = new Document(getMyDir() + "Images.docx");

	NodeCollection shapes = doc.getChildNodes(NodeType.SHAPE, true);
	int imageIndex = 0;

	for (Shape shape : (Iterable<Shape>) shapes) {
	if (shape.hasImage()) {
	String imageFileName =
	MessageFormat.format("Image.ExportImages.{0}_{1}", imageIndex, FileFormatUtil.imageTypeToExtension(shape.getImageData().getImageType()));

	// Note, if you have only an image (not a shape with a text and the image),
	// you can use shape.getShapeRenderer().save(...) method to save the image.
	shape.getImageData().save(getArtifactsDir() + imageFileName);
	imageIndex++;
	}
	}

view raw extract-images.java hosted with ❤ by GitHub

Working with Ranges in Java Working with Headers and Footers in Java

Extract Content Between Nodes in a Document

Why Extract Content

Extracting Content Algorithm

How to Extract Content

Extract Content Between Paragraphs

Extract Content Between Different Types of Nodes

Extract Content Between Paragraphs Based on Style

Extract Content Between Specific Runs

Extract Content using a Field

Extract Content from a Bookmark

Extract Content from a Comment

Extract Content using DocumentVisitor

Extract Text Only

Using Node.GetText and Node.ToString

Using SaveFormat.Text

Extract Images from Shapes

Using `Node.GetText` and `Node.ToString`

Using `SaveFormat.Text`