Extract Data from Table in PDF with Python
Contents
[
Hide
]
Extract Tables from PDF programmatically
Use TableAbsorber to detect tables on each page of a Document. After visiting a page, iterate through table_list, then walk through each row and cell to reconstruct the table content in a readable text format.
- Open the PDF as a
Document. - Iterate through the pages in
document.pages. - Create a
TableAbsorberfor each page and callvisit(page). - Loop through the detected tables, rows, and cells.
- Read text fragments from each cell and assemble the extracted row output.
import aspose.pdf as apdf
from os import path
path_infile = path.join(self.dataDir, infile)
# Open PDF document
document = apdf.Document(path_infile)
# Iterate through each page in the document
for page in document.pages:
absorber = apdf.text.TableAbsorber()
absorber.visit(page)
for table in absorber.table_list:
print("Table")
for row in table.row_list:
row_text = []
for cell in row.cell_list:
cell_text = []
for fragment in cell.text_fragments:
cell_text.append("".join(seg.text for seg in fragment.segments))
row_text.append("|".join(cell_text))
print("|".join(row_text))
Extract table in specific area of PDF page
If you need to extract only tables located inside a marked region, combine TableAbsorber with a SquareAnnotation. In this example, the annotation rectangle is used as a boundary, and only tables fully contained within that region are processed.
- Open the PDF as a
Document. - Select the target page.
- Find the square annotation that marks the region of interest.
- Create a
TableAbsorberand visit the page. - Compare each detected table rectangle with the annotation rectangle.
- Process only the tables that fall completely inside the marked area.
import aspose.pdf as apdf
from os import path
# The path to the documents directory
path_infile = path.join(self.dataDir, infile)
# Open PDF document
document = apdf.Document(path_infile)
# Get the first page (index starts from 1 in Aspose.PDF)
page = document.pages[1]
# Find the first square annotation
square_annotation = next(
(
ann
for ann in page.annotations
if ann.annotation_type == apdf.annotations.AnnotationType.SQUARE
),
None,
)
if square_annotation is None:
print("No square annotation found.")
return
# Initialize the TableAbsorber
absorber = apdf.text.TableAbsorber()
absorber.visit(page)
# Iterate through tables on the page
for table in absorber.table_list:
table_rect = table.rectangle
annotation_rect = square_annotation.rect
# Check if the table is inside the annotation region
is_in_region = (
annotation_rect.llx < table_rect.llx
and annotation_rect.lly < table_rect.lly
and annotation_rect.urx > table_rect.urx
and annotation_rect.ury > table_rect.ury
)
if is_in_region:
for row in table.row_list:
row_text = []
for cell in row.cell_list:
cell_text = []
for fragment in cell.text_fragments:
cell_text.append("".join(seg.text for seg in fragment.segments))
row_text.append("|".join(cell_text))
print("|".join(row_text))
Export Table Data from PDF to CSV
When you need the extracted data in a spreadsheet-friendly format, save the PDF using ExcelSaveOptions and set the output format to CSV. The resulting file can be opened in Excel, Google Sheets, or imported into analytics workflows.
- Open the source PDF as a Document.
- Create an
ExcelSaveOptionsinstance. - Set
excel_save.formattoExcelSaveOptions.ExcelFormat.CSV. - Save the document to the target CSV path.
import aspose.pdf as apdf
from os import path
path_infile = path.join(self.dataDir, infile)
path_outfile = path.join(self.dataDir, outfile)
document = apdf.Document(path_infile)
excel_save = apdf.ExcelSaveOptions()
excel_save.format = apdf.ExcelSaveOptions.ExcelFormat.CSV
document.save(path_outfile, excel_save)