Extract Data from Table in PDF with C#
Extract Tables from PDF programmatically
Extracting tables from PDFs is not a trivial task because table can be created in the various way.
Aspose.PDF for .NET has a tool to make it easy to retrieve tables. To extract table data you shoud perform the following steps:
- Open document - instantiate a Document object.
- Create a TableAbsorber object.
- Decide which pages to be analyzed and apply Visit to the desired pages. The tabular data will be scanned and the result will be stored in TableList.
TableList
is a List of AbsorbedTable. To get the date iterate throughtTableList
and handle RowList and CellList.- Each AbsorbedCell contains TextFragments collection. You can process it for your own purposes.
The following code snippet also work with Aspose.PDF.Drawing library.
The following example shows table extraction from the all pages:
// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void ExtractTable()
{
// The path to the documents directory
var dataDir = RunExamples.GetDataDir_AsposePdf_Tables();
// Open PDF document
using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
{
foreach (var page in document.Pages)
{
Aspose.Pdf.TableAbsorber absorber = new Aspose.Pdf.TableAbsorber();
absorber.Visit(page);
foreach (var table in absorber.TableList)
{
Console.WriteLine("Table");
foreach (var row in table.RowList)
{
foreach (var cell in row.CellList)
{
foreach (var fragment in cell.TextFragments)
{
var sb = new StringBuilder();
foreach (var seg in fragment.Segments)
{
sb.Append(seg.Text);
}
Console.Write($"{sb.ToString()}|");
}
}
Console.WriteLine();
}
}
}
}
}
Extract table in specific area of PDF page
Each abosorbed table has Rectangle property that describes position of the table on page.
If you need to extract tables located in a specific region, you have to work with specific coordinates.
The following code snippet also work with Aspose.PDF.Drawing library.
The following example show how to extract table marked with Square Annotation:
// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void ExtractMarkedTable()
{
// The path to the documents directory
var dataDir = RunExamples.GetDataDir_AsposePdf_Tables();
// Open PDF document
using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
{
var page = document.Pages[1];
var squareAnnotation =
page.Annotations.FirstOrDefault(ann => ann.AnnotationType == Annotations.AnnotationType.Square)
as Aspose.Pdf.Annotations.SquareAnnotation;
var absorber = new Aspose.Pdf.Text.TableAbsorber();
absorber.Visit(page);
foreach (var table in absorber.TableList)
{
var isInRegion = (squareAnnotation.Rect.LLX < table.Rectangle.LLX) &&
(squareAnnotation.Rect.LLY < table.Rectangle.LLY) &&
(squareAnnotation.Rect.URX > table.Rectangle.URX) &&
(squareAnnotation.Rect.URY > table.Rectangle.URY);
if (isInRegion)
{
foreach (var row in table.RowList)
{
foreach (var cell in row.CellList)
{
foreach (var fragment in cell.TextFragments)
{
var sb = new StringBuilder();
foreach (var seg in fragment.Segments)
{
sb.Append(seg.Text);
}
var text = sb.ToString();
Console.Write($"{text}|");
}
}
Console.WriteLine();
}
}
}
}
}
Extract Table Data from PDF and store it in CSV file
The following example shows how to extract table and store it as CSV file. To see how to convert PDF to Excel Spreadsheet please refer to Convert PDF to Excel article.
The following code snippet also work with Aspose.PDF.Drawing library.
// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void ExtractTableSaveExcel()
{
// The path to the documents directory
var dataDir = RunExamples.GetDataDir_AsposePdf_Tables();
// Open PDF document
using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
{
// Instantiate ExcelSave Option object
Aspose.Pdf.ExcelSaveOptions excelSave = new Aspose.Pdf.ExcelSaveOptions { Format = ExcelSaveOptions.ExcelFormat.CSV };
// Save the output in XLS format
document.Save(dataDir + "ExtractTableSaveXLS_out.xlsx", excelSave);
}
}