Extract Text from PDF using Python

Extract Text from PDF Document

This example converts PDF content into plain text, which can be used for further text analysis, search indexing, or data extraction.

  1. Load the PDF Document
  2. Initialize a Text Absorber
  3. Extract Text from All Pages
  4. Write the Extracted Text to a File

    import aspose.pdf as apdf
    from io import FileIO
    from os import path
    import json
    from aspose.pycore import cast, is_assignable

    path_infile = path.join(self.dataDir, infile)
    path_outfile = path.join(self.dataDir, outfile)

    document = apdf.Document(path_infile)
    textAbsorber = apdf.text.TextAbsorber()
    document.pages.accept(textAbsorber)
    with open(path_outfile, "w", encoding="utf-8") as file:
        file.write(textAbsorber.text)

Extract Highlighted Text from PDF Document

This code snippet extracts highlighted text from a PDF document, which can help review key points or summarize content:


    import aspose.pdf as apdf
    from io import FileIO
    from os import path
    import json
    from aspose.pycore import cast, is_assignable

    path_infile = path.join(self.dataDir, infile)

    document = apdf.Document(path_infile)
    page = document.pages[1]

    for annotation in page.annotations:
        if is_assignable(annotation, apdf.annotations.HighlightAnnotation):
            highlight_annotation = cast(apdf.annotations.HighlightAnnotation, annotation)
            print(highlight_annotation.get_marked_text())

Extract Text from Stamp Annotations

Aspose.PDF for Python lets you extract text from stamp annotations. In order to extract text from Stamp Annotations in a PDF, the following steps can be used:

  1. Load the PDF Document
  2. Access the First Page
  3. Iterate Through Annotations
  4. Check for Stamp Annotations
  5. Initialize a Text Absorber
  6. Extract Appearance Information
  7. Extract Text from the Appearance Stream
  8. Print the Extracted Text

    import aspose.pdf as apdf
    from io import FileIO
    from os import path
    import json
    from aspose.pycore import cast, is_assignable

    path_infile = path.join(self.dataDir, infile)

    document = apdf.Document(path_infile)
    page = document.pages[1]
    # Get the annotation from the first page (index 0-based in Python)
    for annotation in page.annotations:
        if annotation.annotation_type == apdf.annotations.AnnotationType.STAMP:
            absorber = apdf.text.TextAbsorber()
            xforms = []
            # Get the appearance of the annotation
            if (annotation.appearance.try_get_value('N', xforms)):
                # Extract text from the appearance
                absorber.visit(xforms[0])

                # Print extracted text
                print(absorber.text)