Extract Table Data from PDF

Extract Tables from PDF programmatically

Extracting tables from PDFs is not a trivial task because the table can be created variously.

Aspose.PDF for PHP via Java has a tool to make it easy to retrieve tables. To extract table data, you should perform the following steps:

  1. Open the PDF document - instantiate a Document object;
  2. Create a TableAbsorber TableAbsorber object to extract tables from the document.
  3. Iterate through each page of the document.
  4. Decide which pages to be analyzed and apply visit to the desired pages. The tabular data will be scanned, and the result will be saved in a list of AbsorbedTable. We can get this list through getTableList method.
  5. To get the data iterate throught TableList and handle list of absorbed rows and list of absorbed cells. We can access to the first list by calling getTableList method and to the second by calling getCellList.
  6. Each AbsorbedCell contains TextFragmentCollections. You can process it for your own purposes.

The following example shows table extraction from the all pages:


$document = new Document($inputFile);
$tableAbsorber = new TableAbsorber();


for ($pageIndex = 1; $pageIndex <= java_values($pages->size()); $pageIndex++) {
    $page = $pages->get_Item($pageIndex);
    $tableAbsorber->visit($page);
    $tableList = $tableAbsorber->getTableList();
    $tableIterator = $tableList->iterator();

    while (java_values($tableIterator->hasNext())) {
        $table = $tableIterator->next();
        $tableRowList = $table->getRowList();
        $tableRowListIterator = $tableRowList->iterator();

        while (java_values($tableRowListIterator->hasNext())) {
            $row = $tableRowListIterator->next();
            $cellList = $row->getCellList();
            $cellListIterator = $cellList->iterator();

            // Iterate through each cell in the row.
            while (java_values($cellListIterator->hasNext())) {
                $cell = $cellListIterator->next();
                $fragmentList = $cell->getTextFragments();

                // Iterate through each text fragment in the cell.
                for ($fragmentIndex = 1; $fragmentIndex <= java_values($fragmentList->size()); $fragmentIndex++) {
                    $fragment = $fragmentList->get_Item($fragmentIndex);
                    $segments = $fragment->getSegments();

                    // Iterate through each segment in the text fragment.
                    for ($segmentIndex = 1; $segmentIndex <= java_values($segments->size()); $segmentIndex++) {
                        $segment = $segments->get_Item($segmentIndex);
                        $responseData .= $segment->getText();
                    }
                }
                $responseData .= "|";
            }
            $responseData .= PHP_EOL;
        }
    }
}

// Save the table data to the output file.
file_put_contents($outputFile, $responseData);

// Close the PDF document.
$document->close();

Extract Table Data from PDF and store it in CSV file

The following example shows how to extract table and store it as CSV file. To see how to convert PDF to Excel Spreadsheet please refer to Convert PDF to Excel article.


    // Load the input PDF document using the Document class.
    $document = new Document($inputFile);

    // Create an instance of the ExcelSaveOptions class to specify the save options.
    $saveOption = new ExcelSaveOptions();

    // Set the output format to CSV.
    $saveOption->setFormat(ExcelSaveOptions_ExcelFormat::$CSV);

    // Save the PDF document as an Excel file using the specified save options.
    $document->save($outputFile, $saveOption);