Basic Text Extraction using Python
Contents
[
Hide
]
Extract text from all pages of a PDF document
Aspose.PDF for Python teaches you how to extract text from every page of a PDF document. It uses the TextAbsorber class to capture all textual content from the entire document and saves it into a separate text file. Ideal for converting PDFs into searchable text, performing content analysis, or exporting text for indexing and processing tasks.
- Load the PDF file.
- Create a ‘TextAbsorber’ object.
- Call ‘document.pages.accept(text_absorber)’ so it scans all pages.
- Get the full text via ’text_absorber.text’.
- Write the result into a text file.
import os
import aspose.pdf as ap
def extract_text_from_all_pages(infile, outfile):
"""
Extract all text from every page of the PDF and write to an output text file.
Args:
infile (str): Path to input PDF file.
outfile (str): Path to output text file.
"""
# Open the PDF document
document = ap.Document(infile)
try:
# Create a TextAbsorber to extract text
text_absorber = ap.text.TextAbsorber()
# Accept the absorber for all pages
document.pages.accept(text_absorber)
# Get extracted text
extracted_text = text_absorber.text
# Write the text to an output file
with open(outfile, "w", encoding="utf-8") as tw:
tw.write(extracted_text)
finally:
document.close()
Extract text from a particular page
Extract text from a single page of a PDF document. By applying the TextAbsorber only to a specified page, you can isolate and save text from a particular section of a multi-page PDF.
Useful when you only need content from one page — for instance, extracting text from an invoice page, report section, or form field summary.
- Open the PDF.
- Create a TextAbsorber.
- Accept only the designated page (pages[page_number]).
- Extract text and write to file.
import os
import aspose.pdf as ap
def extract_text_from_page(infile, page_number, outfile):
"""
Extract text from a specific page number of the PDF.
Args:
infile (str): Path to input PDF file.
page_number (int): 1-based page index to extract.
outfile (str): Path to output text file.
"""
document = ap.Document(infile)
try:
text_absorber = ap.text.TextAbsorber()
# Accept the absorber on only the specified page
document.pages[page_number].accept(text_absorber)
extracted_text = text_absorber.text
with open(outfile, "w", encoding="utf-8") as tw:
tw.write(extracted_text)
finally:
document.close()