Extract Tables from PDF in Python

Extract Table from PDF

Extracting tables from PDFs is useful for reporting, data migration, and analytics workflows. With Aspose.PDF for Python via .NET, you can detect and read table content from existing PDF documents efficiently.

This code snippet opens an existing PDF file, scans each page for tables, and extracts cell text content. It uses TableAbsorber to detect tables and then iterates through rows and cells to print the extracted text.

  1. Loads the PDF into an ap.Document object.
  2. Loop through pages.
  3. Creates a TableAbsorber object.
  4. Iterate through tables.
  5. Iterate through rows and cells.
  6. Extract and print text from cells.

This example reads a PDF, finds all tables, and prints out their cell contents row by row.

import aspose.pdf as ap
from os import path
import sys

def extract(infile: str) -> None:
    """Extract and print all tables from a PDF file."""
    document = ap.Document(infile)
    for page in document.pages:
        absorber = ap.text.TableAbsorber()
        absorber.visit(page)
        for table in absorber.table_list:
            print("Table ----")
            for row in table.row_list:
                print("Row:")
                row_txt = ""
                for cell in row.cell_list:
                    cell_txt = ""
                    text_fragment_collection = cell.text_fragments
                    for fragment in text_fragment_collection:
                        for seg in fragment.segments:
                            cell_txt += seg.text
                    row_txt += " | "
                    row_txt += cell_txt
                print(row_txt)