Extract Data from Table in PDF with Java
Contents
[
Hide
]
Extract tables from PDF
Use TableAbsorber to find tables on each page and iterate through rows, cells, text fragments, and text segments.
- Open the source PDF in a Document instance.
- Iterate through the document Page objects because tables are detected page by page.
- Create a TableAbsorber for each page and call
visit(page)to populate the detected table list. - Iterate through the detected AbsorbedTable, AbsorbedRow, AbsorbedCell, TextFragment, and
TextSegmentobjects. - Build the extracted row text from the fragment content and print the table data.
public static void extractTablesFromPdf(Path inputFile) {
try (Document document = new Document(inputFile.toString())) {
for (Page page : document.getPages()) {
TableAbsorber absorber = new TableAbsorber();
absorber.visit(page);
for (AbsorbedTable table : absorber.getTableList()) {
System.out.println("Table");
for (AbsorbedRow row : table.getRowList()) {
StringBuilder rowText = new StringBuilder();
for (AbsorbedCell cell : row.getCellList()) {
if (rowText.length() > 0) {
rowText.append("|");
}
StringBuilder cellText = new StringBuilder();
for (TextFragment fragment : cell.getTextFragments()) {
StringBuilder fragmentText = new StringBuilder();
for (TextSegment segment : fragment.getSegments()) {
fragmentText.append(segment.getText());
}
if (cellText.length() > 0) {
cellText.append("|");
}
cellText.append(fragmentText);
}
rowText.append(cellText);
}
System.out.println(rowText);
}
}
}
}
}
Extract a table from a specific marked area
This example finds a square annotation, compares its rectangle to each detected table, and outputs only tables inside the marked region.
- Open the source PDF in a Document instance.
- Get the target Page and locate the square Annotation that marks the extraction region.
- Create a TableAbsorber and call
visit(page)to detect tables on that page. - Compare each detected AbsorbedTable Rectangle with the annotation rectangle bounds.
- Iterate through the matching AbsorbedRow and AbsorbedCell objects and reconstruct the row text.
- Print the table data for the marked region only.
public static void extractTableFromSpecificArea(Path inputFile) {
try (Document document = new Document(inputFile.toString())) {
Page page = document.getPages().get_Item(1);
Annotation squareAnnotation = null;
for (Annotation annotation : page.getAnnotations()) {
if (annotation.getAnnotationType() == AnnotationType.Square) {
squareAnnotation = annotation;
break;
}
}
if (squareAnnotation == null) {
System.out.println("No square annotation found.");
return;
}
TableAbsorber absorber = new TableAbsorber();
absorber.visit(page);
for (AbsorbedTable table : absorber.getTableList()) {
Rectangle tableRect = table.getRectangle();
Rectangle annotationRect = squareAnnotation.getRect();
boolean isInRegion = annotationRect.getLLX() < tableRect.getLLX()
&& annotationRect.getLLY() < tableRect.getLLY()
&& annotationRect.getURX() > tableRect.getURX()
&& annotationRect.getURY() > tableRect.getURY();
if (isInRegion) {
for (AbsorbedRow row : table.getRowList()) {
StringBuilder rowText = new StringBuilder();
for (AbsorbedCell cell : row.getCellList()) {
if (rowText.length() > 0) {
rowText.append("|");
}
StringBuilder cellText = new StringBuilder();
for (TextFragment fragment : cell.getTextFragments()) {
StringBuilder fragmentText = new StringBuilder();
for (TextSegment segment : fragment.getSegments()) {
fragmentText.append(segment.getText());
}
if (cellText.length() > 0) {
cellText.append("|");
}
cellText.append(fragmentText);
}
rowText.append(cellText);
}
System.out.println(rowText);
}
}
}
}
}
Export tables to Excel
- Open the source PDF in a Document instance.
- Create ExcelSaveOptions for the export.
- Set the Excel output format to
XLSXso detected table layout is written as an Excel workbook. - Call
document.save(outputFile.toString(), excelSave)to export the document in Excel format.
public static void exportTablesToExcel(Path inputFile, Path outputFile) {
try (Document document = new Document(inputFile.toString())) {
ExcelSaveOptions excelSave = new ExcelSaveOptions();
excelSave.setFormat(ExcelSaveOptions.ExcelFormat.XLSX);
document.save(outputFile.toString(), excelSave);
}
}