Analyzing your prompt, please hold on...
An error occurred while retrieving the results. Please refresh the page and try again.
When working with documents, it is important to be able to easily extract content from a specific range within a document. However, the content may consist of complex elements such as paragraphs, tables, images, etc.
Regardless of what content needs to be extracted, the method to extract that content will always be determined by which nodes are selected to extract content between. These can be entire text bodies or simple text runs.
There are many possible situations and therefore many different node types to consider when extracting content. For example, you might want to extract content between:
In some situations, you may even need to combine different node types, such as extracting content between a paragraph and a field, or between a run and a bookmark.
This article provides the code implementation for extracting text between different nodes, as well as examples of common scenarios.
Often the goal of extracting the content is to duplicate or save it separately in a new document. For example, you can extract content and:
This can be easily achieved using Aspose.Words and the code implementation below.
The code in this section addresses all of the possible situations described above with one generalized and reusable method. The general outline of this technique involves:
To extract the content from your document you need to call the extract_content method below and pass the appropriate parameters. The underlying basis of this method involves finding block level nodes (paragraphs and tables) and cloning them to create identical copies. If the marker nodes passed are block level then the method is able to simply copy the content on that level and add it to the array.
However if the marker nodes are inline (a child of a paragraph) then the situation becomes more complex, as it is necessary to split the paragraph at the inline node, be it a run, bookmark fields etc. Content in the cloned parent nodes not present between the markers is removed. This process is used to ensure that the inline nodes will still retain the formatting of the parent paragraph. The method will also run checks on the nodes passed as parameters and throws an exception if either node is invalid. The parameters to be passed to this method are:
The implementation of the extract_content method you can find here. This method will be referred to in the scenarios in this article.
We will also define a custom method to easily generate a document from extracted nodes. This method is used in many of the scenarios below and simply creates a new document and imports the extracted content into it.
The following code example shows how to take a list of nodes and inserts them into a new document:
# For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Python-via-.NET.git. | |
@staticmethod | |
def generate_document(src_doc: aw.Document, nodes): | |
dst_doc = aw.Document() | |
# Remove the first paragraph from the empty document. | |
dst_doc.first_section.body.remove_all_children() | |
# Import each node from the list into the new document. Keep the original formatting of the node. | |
importer = aw.NodeImporter(src_doc, dst_doc, aw.ImportFormatMode.KEEP_SOURCE_FORMATTING) | |
for node in nodes: | |
import_node = importer.import_node(node, True) | |
dst_doc.first_section.body.append_child(import_node) | |
return dst_doc |
This demonstrates how to use the method above to extract content between specific paragraphs. In this case, we want to extract the body of the letter found in the first half of the document. We can tell that this is between the 7 th and 11 th paragraph.
The code below accomplishes this task. The appropriate paragraphs are extracted using the CompositeNode.get_child method on the document and passing the specified indices. We then pass these nodes to the extract_content method and state that these are to be included in the extraction. This method will return the copied content between these nodes which are then inserted into a new document.
The following code example shows how to extract the content between specific paragraphs using the extract_content method above:
# For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Python-via-.NET.git. | |
doc = aw.Document(MY_DIR + "Extract content.docx") | |
start_para = doc.first_section.body.get_child(aw.NodeType.PARAGRAPH, 6, True).as_paragraph() | |
end_para = doc.first_section.body.get_child(aw.NodeType.PARAGRAPH, 10, True).as_paragraph() | |
# Extract the content between these nodes in the document. Include these markers in the extraction. | |
extracted_nodes = helper.ExtractContentHelper.extract_content(start_para, end_para, True) | |
dst_doc = helper.ExtractContentHelper.generate_document(doc, extracted_nodes) | |
dst_doc.save(ARTIFACTS_DIR + "ExtractContent.extract_content_between_paragraphs.docx") |
We can extract content between any combinations of block level or inline nodes. In this scenario below we will extract the content between first paragraph and the table in the second section inclusively. We get the markers nodes by calling Body.first_paragraph and CompositeNode.get_child method on the second section of the document to retrieve the appropriate Paragraph and Table nodes. For a slight variation let’s instead duplicate the content and insert it below the original.
The following code example shows how to extract the content between a paragraph and table using the extract_content method:
# For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Python-via-.NET.git. | |
doc = aw.Document(MY_DIR + "Extract content.docx") | |
start_para = doc.last_section.get_child(aw.NodeType.PARAGRAPH, 2, True).as_paragraph() | |
end_table = doc.last_section.get_child(aw.NodeType.TABLE, 0, True).as_table() | |
# Extract the content between these nodes in the document. Include these markers in the extraction. | |
extracted_nodes = helper.ExtractContentHelper.extract_content(start_para, end_table, True) | |
# Let's reverse the array to make inserting the content back into the document easier. | |
extracted_nodes.reverse() | |
for extracted_node in extracted_nodes: | |
end_table.parent_node.insert_after(extracted_node, end_table) | |
doc.save(ARTIFACTS_DIR + "ExtractContent.extract_content_between_block_level_nodes.docx") |
You may need to extract the content between paragraphs of the same or different style, such as between paragraphs marked with heading styles.
The code below shows how to achieve this. It is a simple example which will extract the content between the first instance of the “Heading 1” and “Header 3” styles without extracting the headings as well. To do this we set the last parameter to false, which specifies that the marker nodes should not be included.
In a proper implementation this should be run in a loop to extract content between all paragraphs of these styles from the document. The extracted content is copied into a new document.
The following code example shows how to extract content between paragraphs with specific styles using the extract_content method:
# For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Python-via-.NET.git. | |
doc = aw.Document(MY_DIR + "Extract content.docx") | |
# Gather a list of the paragraphs using the respective heading styles. | |
paras_style_heading1 = self.paragraphs_by_style_name(doc, "Heading 1") | |
paras_style_heading3 = self.paragraphs_by_style_name(doc, "Heading 3") | |
# Use the first instance of the paragraphs with those styles. | |
start_para1 = paras_style_heading1[0] | |
end_para1 = paras_style_heading3[0] | |
# Extract the content between these nodes in the document. Don't include these markers in the extraction. | |
extracted_nodes = helper.ExtractContentHelper.extract_content(start_para1, end_para1, False) | |
dst_doc = helper.ExtractContentHelper.generate_document(doc, extracted_nodes) | |
dst_doc.save(ARTIFACTS_DIR + "ExtractContent.extract_content_between_paragraph_styles.docx") |
# For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Python-via-.NET.git. | |
@staticmethod | |
def paragraphs_by_style_name(doc: aw.Document, style_name: str): | |
# Create an array to collect paragraphs of the specified style. | |
paragraphs_with_style = [] | |
paragraphs = doc.get_child_nodes(aw.NodeType.PARAGRAPH, True) | |
# Look through all paragraphs to find those with the specified style. | |
for paragraph in paragraphs: | |
paragraph = paragraph.as_paragraph() | |
if paragraph.paragraph_format.style.name == style_name: | |
paragraphs_with_style.append(paragraph) | |
return paragraphs_with_style |
You can extract content between inline nodes such as a Run as well. Runs from different paragraphs can be passed as markers. The code below shows how to extract specific text in-between the same Paragraph node.
The following code example shows how to extract content between specific runs of the same paragraph using the extract_content method:
# For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Python-via-.NET.git. | |
doc = aw.Document(MY_DIR + "Extract content.docx") | |
para = doc.get_child(aw.NodeType.PARAGRAPH, 7, True).as_paragraph() | |
start_run = para.runs[1] | |
end_run = para.runs[4] | |
# Extract the content between these nodes in the document. Include these markers in the extraction. | |
extracted_nodes = helper.ExtractContentHelper.extract_content(start_run, end_run, True) | |
for extracted_node in extracted_nodes: | |
print(extracted_node.to_string(aw.SaveFormat.TEXT)) |
To use a field as marker, the FieldStart node should be passed. The last parameter to the extract_content method will define if the entire field is to be included or not. Let’s extract the content between the “FullName” merge field and a paragraph in the document. We use the DocumentBuilder.move_to_merge_field method of DocumentBuilder class. This will return the FieldStart node from the name of merge field passed to it.
In our case let’s set the last parameter passed to the extract_content method to False
to exclude the field from the extraction. We will render the extracted content to PDF.
The following code example shows how to extract content between a specific field and paragraph in the document using the extract_content method:
# For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Python-via-.NET.git. | |
doc = aw.Document(MY_DIR + "Extract content.docx") | |
builder = aw.DocumentBuilder(doc) | |
# Pass the first boolean parameter to get the DocumentBuilder to move to the FieldStart of the field. | |
# We could also get FieldStarts of a field using GetChildNode method as in the other examples. | |
builder.move_to_merge_field("Fullname", False, False) | |
# The builder cursor should be positioned at the start of the field. | |
start_field = builder.current_node.as_field_start() | |
end_para = doc.first_section.get_child(aw.NodeType.PARAGRAPH, 5, True).as_paragraph() | |
# Extract the content between these nodes in the document. Don't include these markers in the extraction. | |
extracted_nodes = helper.ExtractContentHelper.extract_content(start_field, end_para, False) | |
dst_doc = helper.ExtractContentHelper.generate_document(doc, extracted_nodes) | |
dst_doc.save(ARTIFACTS_DIR + "ExtractContent.extract_content_using_field.docx") |
In a document the content that is defined within a bookmark is encapsulated by the BookmarkStart and BookmarkEnd nodes. Content found between these two nodes make up the bookmark. You can pass either of these nodes as any marker, even ones from different bookmarks, as long as the starting marker appears before the ending marker in the document. We will extract this content into a new document using the code below. The isInclusive parameter option shows how to retain or discard the bookmark.
The following code example shows how to extract the content referenced a bookmark using the extract_content method:
# For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Python-via-.NET.git. | |
doc = aw.Document(MY_DIR + "Extract content.docx") | |
bookmark = doc.range.bookmarks.get_by_name("Bookmark1") | |
bookmark_start = bookmark.bookmark_start | |
bookmark_end = bookmark.bookmark_end | |
# Firstly, extract the content between these nodes, including the bookmark. | |
extracted_nodes_inclusive = helper.ExtractContentHelper.extract_content(bookmark_start, bookmark_end, True) | |
dst_doc = helper.ExtractContentHelper.generate_document(doc, extracted_nodes_inclusive) | |
dst_doc.save(ARTIFACTS_DIR + "ExtractContent.extract_content_between_bookmark.including_bookmark.docx") | |
# Secondly, extract the content between these nodes this time without including the bookmark. | |
extracted_nodes_exclusive = helper.ExtractContentHelper.extract_content(bookmark_start, bookmark_end, False) | |
dst_doc = helper.ExtractContentHelper.generate_document(doc, extracted_nodes_exclusive) | |
dst_doc.save(ARTIFACTS_DIR + "ExtractContent.extract_content_between_bookmark.without_bookmark.docx") |
A comment is made up of the CommentRangeStart, CommentRangeEnd and Comment nodes. All of these nodes are inline. The first two nodes encapsulate the content in the document which is referenced by the comment, as seen in the screenshot below. The Comment node itself is an InlineStory that can contain paragraphs and runs. It represents the message of the comment as seen as a comment bubble in the review pane. As this node is inline and a descendant of a body you can also extract the content from inside this message as well.
The comment encapsulates the heading, first paragraph and the table in the second section. Let’s extract this comment into a new document. The isInclusive option dictates if the comment itself is kept or discarded.
The following code example shows how to do this:
# For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Python-via-.NET.git. | |
doc = aw.Document(MY_DIR + "Extract content.docx") | |
comment_start = doc.get_child(aw.NodeType.COMMENT_RANGE_START, 0, True).as_comment_range_start() | |
comment_end = doc.get_child(aw.NodeType.COMMENT_RANGE_END, 0, True).as_comment_range_end() | |
# Firstly, extract the content between these nodes including the comment as well. | |
extracted_nodes_inclusive = helper.ExtractContentHelper.extract_content(comment_start, comment_end, True) | |
dst_doc = helper.ExtractContentHelper.generate_document(doc, extracted_nodes_inclusive) | |
dst_doc.save(ARTIFACTS_DIR + "ExtractContent.extract_content_between_comment_range.including_comment.docx") | |
# Secondly, extract the content between these nodes without the comment. | |
extracted_nodes_exclusive = helper.ExtractContentHelper.extract_content(comment_start, comment_end, False) | |
dst_doc = helper.ExtractContentHelper.generate_document(doc, extracted_nodes_exclusive) | |
dst_doc.save(ARTIFACTS_DIR + "ExtractContent.extract_content_between_comment_range.without_comment.docx") |
The ways to retrieve text from the document are:
A Word document can contains control characters that designate special elements such as field, end of cell, end of section etc. The full list of possible Word control characters is defined in the ControlChar class. The Node.get_text method returns text with all of the control character characters present in the node.
Calling to_string returns the plain text representation of the document only without control characters. For further information on exporting as plain text see Using SaveFormat.TEXT.
The following code example shows the difference between calling the get_text and to_string methods on a node:
# For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Python-via-.NET.git. | |
doc = aw.Document() | |
builder = aw.DocumentBuilder(doc) | |
builder.insert_field("MERGEFIELD Field") | |
# When converted to text it will not retrieve fields code or special characters, | |
# but will still contain some natural formatting characters such as paragraph markers etc. | |
# This is the same as "viewing" the document as if it was opened in a text editor. | |
print("ToString() Result: " + doc.to_string(aw.SaveFormat.TEXT)) |
SaveFormat.Text
This example saves the document as follows:
The following code example shows how to save a document in TXT format:
# For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Python-via-.NET.git. | |
doc = aw.Document(MY_DIR + "Document.docx") | |
doc.save(ARTIFACTS_DIR + "BaseConversions.docx_to_txt.txt") |
You may need to extract document images to perform some tasks. Aspose.Words allows you to do this as well.
The following code example shows how to extract images from a document:
# For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Python-via-.NET.git. | |
doc = aw.Document(MY_DIR + "Images.docx") | |
shapes = doc.get_child_nodes(aw.NodeType.SHAPE, True) | |
image_index = 0 | |
for shape in shapes: | |
shape = shape.as_shape() | |
if shape.has_image: | |
image_extension = aw.FileFormatUtil.image_type_to_extension(shape.image_data.image_type) | |
image_file_name = "Image.ExportImages." + str(image_index) + image_extension | |
# Note, if you have only an image (not a shape with a text and the image), | |
# you can use shape.get_shape_renderer().save(...) method to save the image. | |
shape.image_data.save(ARTIFACTS_DIR + image_file_name) | |
image_index += 1 |
Analyzing your prompt, please hold on...
An error occurred while retrieving the results. Please refresh the page and try again.