Extract Data from Table in PDF with Python

Extract Tables from PDF programmatically

Use TableAbsorber to detect tables on each page of a Document. After visiting a page, iterate through table_list, then walk through each row and cell to reconstruct the table content in a readable text format.

Open the PDF as a Document.
Iterate through the pages in document.pages.
Create a TableAbsorber for each page and call visit(page).
Loop through the detected tables, rows, and cells.
Read text fragments from each cell and assemble the extracted row output.

import aspose.pdf as apdf
from os import path

path_infile = path.join(self.dataDir, infile)

# Open PDF document
document = apdf.Document(path_infile)

# Iterate through each page in the document
for page in document.pages:
    absorber = apdf.text.TableAbsorber()
    absorber.visit(page)

    for table in absorber.table_list:
        print("Table")
        for row in table.row_list:
            row_text = []
            for cell in row.cell_list:
                cell_text = []
                for fragment in cell.text_fragments:
                    cell_text.append("".join(seg.text for seg in fragment.segments))
                row_text.append("|".join(cell_text))
            print("|".join(row_text))

Extract table in specific area of PDF page

If you need to extract only tables located inside a marked region, combine TableAbsorber with a SquareAnnotation. In this example, the annotation rectangle is used as a boundary, and only tables fully contained within that region are processed.

Open the PDF as a Document.
Select the target page.
Find the square annotation that marks the region of interest.
Create a TableAbsorber and visit the page.
Compare each detected table rectangle with the annotation rectangle.
Process only the tables that fall completely inside the marked area.

import aspose.pdf as apdf
from os import path

# The path to the documents directory
path_infile = path.join(self.dataDir, infile)

# Open PDF document
document = apdf.Document(path_infile)

# Get the first page (index starts from 1 in Aspose.PDF)
page = document.pages[1]

# Find the first square annotation
square_annotation = next(
    (
        ann
        for ann in page.annotations
        if ann.annotation_type == apdf.annotations.AnnotationType.SQUARE
    ),
    None,
)

if square_annotation is None:
    print("No square annotation found.")
    return

# Initialize the TableAbsorber
absorber = apdf.text.TableAbsorber()
absorber.visit(page)

# Iterate through tables on the page
for table in absorber.table_list:
    table_rect = table.rectangle
    annotation_rect = square_annotation.rect

    # Check if the table is inside the annotation region
    is_in_region = (
        annotation_rect.llx < table_rect.llx
        and annotation_rect.lly < table_rect.lly
        and annotation_rect.urx > table_rect.urx
        and annotation_rect.ury > table_rect.ury
    )

    if is_in_region:
        for row in table.row_list:
            row_text = []
            for cell in row.cell_list:
                cell_text = []
                for fragment in cell.text_fragments:
                    cell_text.append("".join(seg.text for seg in fragment.segments))
                row_text.append("|".join(cell_text))
            print("|".join(row_text))

Export Table Data from PDF to CSV

When you need the extracted data in a spreadsheet-friendly format, save the PDF using ExcelSaveOptions and set the output format to CSV. The resulting file can be opened in Excel, Google Sheets, or imported into analytics workflows.

Open the source PDF as a Document.
Create an ExcelSaveOptions instance.
Set excel_save.format to ExcelSaveOptions.ExcelFormat.CSV.
Save the document to the target CSV path.

import aspose.pdf as apdf
from os import path

path_infile = path.join(self.dataDir, infile)
path_outfile = path.join(self.dataDir, outfile)

document = apdf.Document(path_infile)
excel_save = apdf.ExcelSaveOptions()
excel_save.format = apdf.ExcelSaveOptions.ExcelFormat.CSV
document.save(path_outfile, excel_save)

Extract Fonts from PDF via Python Extract Data from AcroForm using Python