Extract Table Data from PDF
Contents
[
Hide
]
Extract Tables from PDF programmatically
Extracting tables from PDFs is not a trivial task because the table can be created variously.
Aspose.PDF for Android via Java has a tool to make it easy to retrieve tables. To extract table data, you should perform the following steps:
- Open document - instantiate a Document object;
- Create a TableAbsorber object.
- Decide which pages to be analyzed and apply visit to the desired pages. The tabular data will be scanned, and the result will be saved in a list of AbsorbedTable. We can get this list through getTableList method.
- To get the data iterate throught
TableList
and handle list of absorbed rows and list of absorbed cells. We can access to the first list by calling getTableList method and to the second by calling getCellList. - Each AbsorbedCell contains TextFragmentCollections. You can process it for your own purposes.
The following example shows table extraction from the all pages:
public void extractTable () {
// Open document
try {
document=new Document(inputStream);
} catch (Exception e) {
resultMessage.setText(e.getMessage());
return;
}
com.aspose.pdf.TableAbsorber absorber=new com.aspose.pdf.TableAbsorber();
File file=new File(fileStorage, "extracted-text.txt");
FileOutputStream fileOutputStream;
try {
fileOutputStream=new FileOutputStream(file);
} catch (FileNotFoundException e) {
resultMessage.setText(e.getMessage());
return;
}
BufferedWriter bw=new BufferedWriter(new OutputStreamWriter(fileOutputStream));
// Scan pages
for (Page page : (Iterable<? extends Page>) document.getPages()) {
absorber.visit(page);
for (com.aspose.pdf.AbsorbedTable table : absorber.getTableList()) {
try {
bw.write("Table");
bw.newLine();
// Iterate through list of rows
for (com.aspose.pdf.AbsorbedRow row : table.getRowList()) {
// Iterate through list of cell
for (com.aspose.pdf.AbsorbedCell cell : row.getCellList()) {
for (com.aspose.pdf.TextFragment fragment : cell.getTextFragments()) {
StringBuilder sb=new StringBuilder();
for (TextSegment seg :
(Iterable<? extends TextSegment>) fragment.getSegments())
sb.append(seg.getText());
bw.write(sb.toString() + "|");
}
}
bw.newLine();
}
} catch (IOException e) {
resultMessage.setText(e.getMessage());
return;
}
}
}
try {
bw.close();
} catch (IOException e) {
resultMessage.setText(e.getMessage());
return;
}
resultMessage.setText(R.string.success_message);
}
Extract table in specific area of PDF page
Each abosorbed table has Rectangle property that describes position of the table on page.
So, if you need to extract tables located in a specific region, you have to work with specific coordinates.
The following example show how to extract table marked with Square Annotation:
public void extractMarkedTable () {
// Open document
try {
document=new Document(inputStream);
} catch (Exception e) {
resultMessage.setText(e.getMessage());
return;
}
com.aspose.pdf.Page page=document.getPages().get_Item(1);
com.aspose.pdf.AnnotationSelector annotationSelector=
new com.aspose.pdf.AnnotationSelector(
new com.aspose.pdf.SquareAnnotation(page, com.aspose.pdf.Rectangle.getTrivial()));
List list=annotationSelector.getSelected();
if (list.size() == 0) {
resultMessage.setText("Marked tables not found..");
return;
}
com.aspose.pdf.SquareAnnotation squareAnnotation = (com.aspose.pdf.SquareAnnotation) list.get(0);
com.aspose.pdf.TableAbsorber absorber=new com.aspose.pdf.TableAbsorber();
absorber.visit(page);
for (com.aspose.pdf.AbsorbedTable table : absorber.getTableList()) {
{
boolean isInRegion=(squareAnnotation.getRect().getLLX() < table.getRectangle().getLLX())
&& (squareAnnotation.getRect().getLLY() < table.getRectangle().getLLY())
&& (squareAnnotation.getRect().getURX() > table.getRectangle().getURX())
&& (squareAnnotation.getRect().getURY() > table.getRectangle().getURY());
if (isInRegion) {
File file=new File(fileStorage, "extracted-text.txt");
FileOutputStream fileOutputStream;
try {
fileOutputStream=new FileOutputStream(file);
} catch (FileNotFoundException e) {
resultMessage.setText(e.getMessage());
return;
}
BufferedWriter bw=new BufferedWriter(new OutputStreamWriter(fileOutputStream));
try {
//Parse table
for (com.aspose.pdf.AbsorbedRow row : table.getRowList()) {
{
for (com.aspose.pdf.AbsorbedCell cell : row.getCellList()) {
for (com.aspose.pdf.TextFragment fragment :
cell.getTextFragments()) {
bw.write(fragment.getText());
}
bw.write("|");
}
bw.newLine();
}
}
bw.close();
// -------------
} catch (IOException e) {
resultMessage.setText(e.getMessage());
return;
}
resultMessage.setText(R.string.success_message);
}
}
}
}
Extract Table Data from PDF and store it in CSV file
The following example shows how to extract table and store it as CSV file. To see how to convert PDF to Excel Spreadsheet please refer to Convert PDF to Excel article.
public void extractTableSaveCSV () {
// Open document
try {
document=new Document(inputStream);
} catch (Exception e) {
resultMessage.setText(e.getMessage());
return;
}
File file=new File(fileStorage, "PDFToXLS_out.csv");
// Instantiate ExcelSave Option object
com.aspose.pdf.ExcelSaveOptions excelSave=new com.aspose.pdf.ExcelSaveOptions();
excelSave.setFormat(com.aspose.pdf.ExcelSaveOptions.ExcelFormat.CSV);
try {
document.save(file.toString(), excelSave);
} catch (Exception e) {
resultMessage.setText(e.getMessage());
return;
}
resultMessage.setText(R.string.success_message);
}