Basic Text Extraction using Python

Extract text from all pages of a PDF document

Aspose.PDF for Python teaches you how to extract text from every page of a PDF document. It uses the TextAbsorber class to capture all textual content from the entire document and saves it into a separate text file. Ideal for converting PDFs into searchable text, performing content analysis, or exporting text for indexing and processing tasks.

  1. Load the PDF file.
  2. Create a ‘TextAbsorber’ object.
  3. Call ‘document.pages.accept(text_absorber)’ so it scans all pages.
  4. Get the full text via ’text_absorber.text’.
  5. Write the result into a text file.

import os
import aspose.pdf as ap

def extract_text_from_all_pages(infile, outfile):
    """
    Extract all text from every page of the PDF and write to an output text file.
    Args:
        infile (str): Path to input PDF file.
        outfile (str): Path to output text file.
    """
    # Open the PDF document
    document = ap.Document(infile)
    try:
        # Create a TextAbsorber to extract text
        text_absorber = ap.text.TextAbsorber()
        # Accept the absorber for all pages
        document.pages.accept(text_absorber)
        # Get extracted text
        extracted_text = text_absorber.text
        # Write the text to an output file
        with open(outfile, "w", encoding="utf-8") as tw:
            tw.write(extracted_text)
    finally:
        document.close()

Extract text from a particular page

Extract text from a single page of a PDF document. By applying the TextAbsorber only to a specified page, you can isolate and save text from a particular section of a multi-page PDF.

Useful when you only need content from one page — for instance, extracting text from an invoice page, report section, or form field summary.

  1. Open the PDF.
  2. Create a TextAbsorber.
  3. Accept only the designated page (pages[page_number]).
  4. Extract text and write to file.

import os
import aspose.pdf as ap

def extract_text_from_page(infile, page_number, outfile):
    """
    Extract text from a specific page number of the PDF.
    Args:
        infile (str): Path to input PDF file.
        page_number (int): 1-based page index to extract.
        outfile (str): Path to output text file.
    """
    document = ap.Document(infile)
    try:
        text_absorber = ap.text.TextAbsorber()
        # Accept the absorber on only the specified page
        document.pages[page_number].accept(text_absorber)
        extracted_text = text_absorber.text
        with open(outfile, "w", encoding="utf-8") as tw:
            tw.write(extracted_text)
    finally:
        document.close()