使用 C# 从 PDF 中提取表格数据

以编程方式从 PDF 中提取表格

从 PDF 中提取表格并不是一项简单的任务，因为表格可以以多种方式创建。

Aspose.PDF for .NET 提供了一种工具，使检索表格变得简单。要提取表格数据，您应执行以下步骤：

打开文档 - 实例化一个 Document 对象。
创建一个 TableAbsorber 对象。
决定要分析哪些页面，并将 Visit 应用到所需页面。表格数据将被扫描，结果将存储在 TableList 中。
TableList 是一个 AbsorbedTable 的列表。要获取数据，请遍历 TableList 并处理 RowList 和 CellList。
每个 AbsorbedCell 包含 TextFragments 集合。您可以根据自己的需要处理它。

以下代码片段也适用于 Aspose.PDF.Drawing 库。

以下示例显示了如何从所有页面提取表格：

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void ExtractTable()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Tables();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
    {                    
        foreach (var page in document.Pages)
        {
            Aspose.Pdf.TableAbsorber absorber = new Aspose.Pdf.TableAbsorber();
            absorber.Visit(page);
            foreach (var table in absorber.TableList)
            {
                Console.WriteLine("Table");
                foreach (var row in table.RowList)
                {
                    foreach (var cell in row.CellList)
                    {                                 
                        foreach (var fragment in cell.TextFragments)
                        {
                            var sb = new StringBuilder();
                            foreach (var seg in fragment.Segments)
                            {
                                sb.Append(seg.Text);
                            }
                            Console.Write($"{sb.ToString()}|");
                        }                           
                    }
                    Console.WriteLine();
                }
            }
        }
    }
}

从 PDF 页面特定区域提取表格

每个吸收的表格都有 Rectangle 属性，描述表格在页面上的位置。

如果您需要提取位于特定区域的表格，您必须使用特定坐标。

以下代码片段也适用于 Aspose.PDF.Drawing 库。

以下示例显示了如何提取带有方形注释的表格：

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void ExtractMarkedTable()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Tables();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
    { 
        var page = document.Pages[1];
        var squareAnnotation =
            page.Annotations.FirstOrDefault(ann => ann.AnnotationType == Annotations.AnnotationType.Square)
            as Aspose.Pdf.Annotations.SquareAnnotation;


        var absorber = new Aspose.Pdf.Text.TableAbsorber();
        absorber.Visit(page);

        foreach (var table in absorber.TableList)
        {
            var isInRegion = (squareAnnotation.Rect.LLX < table.Rectangle.LLX) &&
            (squareAnnotation.Rect.LLY < table.Rectangle.LLY) &&
            (squareAnnotation.Rect.URX > table.Rectangle.URX) &&
            (squareAnnotation.Rect.URY > table.Rectangle.URY);

            if (isInRegion)
            {
                foreach (var row in table.RowList)
                {
                    foreach (var cell in row.CellList)
                    {
                        foreach (var fragment in cell.TextFragments)
                        {
                            var sb = new StringBuilder();
                            foreach (var seg in fragment.Segments)
                            {
                                sb.Append(seg.Text);
                            }
                            var text = sb.ToString();
                            Console.Write($"{text}|");
                        }
                    }
                    Console.WriteLine();
                }
            }
        }
    }
}

从 PDF 中提取表格数据并将其存储在 CSV 文件中

以下示例显示了如何提取表格并将其存储为 CSV 文件。要查看如何将 PDF 转换为 Excel 电子表格，请参阅将 PDF 转换为 Excel 文章。

以下代码片段也适用于 Aspose.PDF.Drawing 库。

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void ExtractTableSaveExcel()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf_Tables();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "input.pdf"))
    {
        // Instantiate ExcelSave Option object
        Aspose.Pdf.ExcelSaveOptions excelSave = new Aspose.Pdf.ExcelSaveOptions { Format = ExcelSaveOptions.ExcelFormat.CSV };

        // Save the output in XLS format
        document.Save(dataDir + "ExtractTableSaveXLS_out.xlsx", excelSave);
    }
}

从 PDF 中提取字体 C# 使用 C# 从 AcroForm 提取数据