Extract Table from PDF Document

Extract Table from PDF

Extracting tables from PDFs using Python can be incredibly useful for data extraction and analysis. With the Aspose.PDF for Python via .NET Library, you can efficiently work with tables embedded in PDF documents for various data-related tasks.

This code snippet opens an existing PDF file, scans each page for tables, and extracts their cell text content. It uses the ‘TableAbsorber’ to detect tables and then iterates through rows and cells to print out the text inside.

  1. Loads the PDF into an ap.Document object.
  2. Loop through pages.
  3. Creates a TableAbsorber object.
  4. Iterate through tables.
  5. Iterate through rows and cells.
  6. Extract and print text from cells.

This example reads a PDF, finds all tables, and prints out their cell contents row by row.


    import aspose.pdf as ap
    from os import path

    path_infile = path.join(self.data_dir, infile)
    document = ap.Document(path_infile)
    for page in document.pages:
        absorber = ap.text.TableAbsorber()
        absorber.visit(page)
        for table in absorber.table_list:
            print("Table ----")
            for row in table.row_list:
                print("Row")
                for cell in row.cell_list:
                    text_fragment_collection = cell.text_fragments
                    for fragment in text_fragment_collection:
                        txt = ""
                        for seg in fragment.segments:
                            txt += seg.text
                        print(txt)