Region-Based Extraction using Python
Extract text from a specific region of a page
Use TextAbsorber together with a Rectangle to limit extraction to a specific area of a page. This approach is useful for zone-based extraction from headers, footers, table cells, form fields, invoices, or other fixed-layout regions where the text position is known in advance.
- Open the source PDF as a Document.
- Create a
TextAbsorberinstance. - Configure
text_search_optionsto limit extraction to a rectangle. - Accept the absorber on the target page.
- Write the extracted text to an output file.
import aspose.pdf as ap
def extract_text_from_region(infile, page_number, rect_coords, outfile):
"""
Extract text from a specified rectangular region on a given page.
Args:
infile (str): Path to input PDF file.
page_number (int): 1-based index of the page.
rect_coords (tuple): (llx, lly, urx, ury) coordinates of the rectangle.
outfile (str): Output text file path.
"""
document = ap.Document(infile)
try:
absorber = ap.text.TextAbsorber()
# Set options to restrict search to the rectangle
absorber.text_search_options.limit_to_page_bounds = True
llx, lly, urx, ury = rect_coords
absorber.text_search_options.rectangle = ap.Rectangle(llx, lly, urx, ury, True)
# Accept on the specific page
document.pages[page_number].accept(absorber)
extracted_text = absorber.text
with open(outfile, "w", encoding="utf-8") as tw:
tw.write(extracted_text)
finally:
document.close()
Extract Paragraphs by iterating through them
Use ParagraphAbsorber when you need paragraph-aware extraction instead of plain page text. Unlike TextAbsorber or TextFragmentAbsorber, this API organizes output by page, section, and paragraph, which is useful for text analysis, structured export, and layout-sensitive processing.
- Open the source PDF as a Document.
- Create a
ParagraphAbsorberinstance. - Call
absorber.visit(document)to analyze all pages. - Iterate through
page_markups, then through each section and paragraph. - Read the text fragments from each paragraph and write the result to a file.
import aspose.pdf as ap
def extract_paragraphs_from_pdf(infile, outfile):
"""
Extract all paragraphs from a PDF document, and write each paragraph’s text into an output file.
Args:
infile (str): Path to input PDF file.
outfile (str): Path to output text file.
"""
document = ap.Document(infile)
try:
absorber = ap.text.ParagraphAbsorber()
absorber.visit(document)
with open(outfile, "w", encoding="utf-8") as tw:
for page_markup in absorber.page_markups:
for sec_idx, section in enumerate(page_markup.sections, start=1):
for para_idx, paragraph in enumerate(section.paragraphs, start=1):
# Concatenate all fragments/lines in the paragraph
parts = []
for line in paragraph.lines:
for fragment in line:
parts.append(fragment.text)
parts.append("\r\n")
paragraph_text = "".join(parts)
tw.write(
f"Page {page_markup.number}, Section {sec_idx}, Paragraph {para_idx}:\n"
)
tw.write(paragraph_text + "\n")
finally:
document.close()
Extract Paragraphs with bounding polygon rendering
You can also use ParagraphAbsorber to inspect paragraph geometry. In addition to extracting text, this approach records each section rectangle and paragraph polygon, which is useful for layout mapping, document analysis, accessibility tooling, or region-aware post-processing.
- Open the source PDF as a Document.
- Create a
ParagraphAbsorberinstance. - Visit the target page.
- Read the page markup from
absorber.page_markups. - Iterate through sections and paragraphs to capture geometry and text.
- Write the rectangle, polygon, and text data to the output file.
import aspose.pdf as ap
def extract_paragraphs_with_geometry(infile, outfile):
"""
Extract paragraphs and record geometry info (rectangle / polygon) for each paragraph in a PDF.
Args:
infile (str): Path to input PDF file.
outfile (str): Path to output text file.
"""
document = ap.Document(infile)
try:
absorber = ap.text.ParagraphAbsorber()
absorber.visit(document.pages[1]) # Visit page 2 (pages are 1-indexed)
page_markup = absorber.page_markups[0]
with open(outfile, "w", encoding="utf-8") as tw:
for sec_idx, section in enumerate(page_markup.sections, start=1):
tw.write(f"Section {sec_idx}: rectangle = {section.rectangle}\n")
for para_idx, paragraph in enumerate(section.paragraphs, start=1):
tw.write(f" Paragraph {para_idx}: polygon = {paragraph.points}\n")
# Concatenate paragraph text
parts = []
for line in paragraph.lines:
for fragment in line:
parts.append(fragment.text)
parts.append("\r\n")
tw.write(" Text: " + "".join(parts) + "\n\n")
finally:
document.close()