Region-Based Extraction using Java
Contents
[
Hide
]
Extract text from a rectangular page region
Use TextSearchOptions with a Rectangle to restrict extraction to a defined area on a page.
- Open the source PDF Document.
- Create a TextAbsorber.
- Create TextSearchOptions for the target Rectangle and limit extraction to page bounds.
- Apply the search options to the absorber.
- Visit the target Page and write the extracted text to the output file.
public static void extractTextFromRegion(Path inputFile, Path outputFile, int pageNumber, Rectangle rectangle)
throws Exception {
try (Document document = new Document(inputFile.toString())) {
TextAbsorber absorber = new TextAbsorber();
TextSearchOptions options = new TextSearchOptions(rectangle);
options.setLimitToPageBounds(true);
absorber.setTextSearchOptions(options);
document.getPages().get_Item(pageNumber).accept(absorber);
Files.writeString(outputFile, absorber.getText());
}
}
Extract paragraphs with geometry information
Use ParagraphAbsorber to inspect section rectangles and paragraph polygons together with the extracted text.
- Open the source PDF Document.
- Create a ParagraphAbsorber and visit the target Page.
- Get the page markup from the absorber results.
- Iterate through the sections and paragraphs and read their geometry information.
- Build the output text with rectangles, polygons, and extracted paragraph text.
- Write the extracted details to the output file.
public static void extractParagraphsWithGeometry(Path inputFile, Path outputFile) throws Exception {
try (Document document = new Document(inputFile.toString())) {
ParagraphAbsorber absorber = new ParagraphAbsorber();
absorber.visit(document.getPages().get_Item(1));
PageMarkup pageMarkup = absorber.getPageMarkups().get(0);
StringBuilder text = new StringBuilder();
int sectionIndex = 1;
for (MarkupSection section : pageMarkup.getSections()) {
text.append("Section ").append(sectionIndex)
.append(": rectangle = ").append(section.getRectangle()).append("\n");
int paragraphIndex = 1;
for (MarkupParagraph paragraph : section.getParagraphs()) {
text.append(" Paragraph ").append(paragraphIndex)
.append(": polygon = ").append(Arrays.toString(paragraph.getPoints())).append("\n");
StringBuilder paragraphText = new StringBuilder();
for (List<TextFragment> line : paragraph.getLines()) {
for (TextFragment fragment : line) {
paragraphText.append(fragment.getText());
}
paragraphText.append("\r\n");
}
text.append(" Text: ").append(paragraphText).append("\n\n");
paragraphIndex++;
}
sectionIndex++;
}
Files.writeString(outputFile, text.toString());
}
}