Extract Data from Table in PDF with C#

Extract Tables from PDF programmatically

Extracting tables from PDFs is not a trivial task because table can be created in the various way.

Aspose.PDF for .NET has a tool to make it easy to retrieve tables. To extract table data you shoud perform the following steps:

  1. Open document - instantiate a Document object;
  2. Create a TableAbsorber object.
  3. Decide which pages to be analyzed and apply Visit to the desired pages. The tabular data will be scanned and the result will be stored in TableList.
  4. TableList is a List of AbsorbedTable. To get the date iterate throught TableList and handle RowList and CellList
  5. Each AbsorbedCell contains TextFragments collection. You can process it for your own purposes.

The following example shows table extraction from the all pages:

public static void Extract_Table()
{
    // Load source PDF document
    var filePath="<... enter path to pdf file here ...>";
    Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(filePath);                       
    foreach (var page in pdfDocument.Pages)
    {
        Aspose.Pdf.Text.TableAbsorber absorber = new Aspose.Pdf.Text.TableAbsorber();
        absorber.Visit(page);
        foreach (AbsorbedTable table in absorber.TableList)
        {
            Console.WriteLine("Table");
            foreach (AbsorbedRow row in table.RowList)
            {
                foreach (AbsorbedCell cell in row.CellList)
                {                                 
                    foreach (TextFragment fragment in cell.TextFragments)
                    {
                        var sb = new StringBuilder();
                        foreach (TextSegment seg in fragment.Segments)
                            sb.Append(seg.Text);
                        Console.Write($"{sb.ToString()}|");
                    }                           
                }
                Console.WriteLine();
            }
        }
    }
}

Extract table in specific area of PDF page

Each abosorbed table has Rectangle property that describes position of the table on page.

So, if you need to extract tables located in a specific region, you have to work with specific coordinates.

The following example show how to extract table marked with Square Annotation:

public static void Extract_Marked_Table()
{
    // Load source PDF document
    var filePath="<... enter path to pdf file here ...>";
    Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(filePath);  
    var page = pdfDocument.Pages[1];
    var squareAnnotation =
        page.Annotations.FirstOrDefault(ann => ann.AnnotationType == Annotations.AnnotationType.Square)
        as Annotations.SquareAnnotation;


    Aspose.Pdf.Text.TableAbsorber absorber = new Aspose.Pdf.Text.TableAbsorber();
    absorber.Visit(page);

    foreach (AbsorbedTable table in absorber.TableList)
    {
        var isInRegion = (squareAnnotation.Rect.LLX < table.Rectangle.LLX) &&
        (squareAnnotation.Rect.LLY < table.Rectangle.LLY) &&
        (squareAnnotation.Rect.URX > table.Rectangle.URX) &&
        (squareAnnotation.Rect.URY > table.Rectangle.URY);

        if (isInRegion)
        {
            foreach (AbsorbedRow row in table.RowList)
            {
                foreach (AbsorbedCell cell in row.CellList)
                {

                    foreach (TextFragment fragment in cell.TextFragments)
                    {
                        var sb = new StringBuilder();
                        foreach (TextSegment seg in fragment.Segments)
                        {
                            sb.Append(seg.Text);
                        }
                        var text = sb.ToString();
                        Console.Write($"{text}|");
                    }
                }
                Console.WriteLine();
            }
        }
    }
}

Extract Table Data from PDF and store it in CSV file

The following example shows how to extract table and store it as CSV file. To see how to convert PDF to Excel Spreadsheet please refer to Convert PDF to Excel article.

public static void Extract_Table_Save_CSV()
{
    // For complete examples and data files, please go to https://github.com/aspose-pdf/Aspose.PDF-for-.NET

    // Load PDF document
    Document pdfDocument = new Document(_dataDir + "input.pdf");

    // Instantiate ExcelSave Option object
    ExcelSaveOptions excelSave = new ExcelSaveOptions { Format = ExcelSaveOptions.ExcelFormat.CSV };

    // Save the output in XLS format
    pdfDocument.Save("PDFToXLS_out.xlsx", excelSave);
}