Improving Text Extraction from Multi‑Column PDFs
Contents
[
Hide
]
Reduce font size manually and then extract
In some multi-column layouts, reducing the font size of text fragments before extraction can improve reading order and reduce overlap issues. This technique can help with tightly formatted documents such as magazines, research papers, brochures, or reports with dense text columns.
- Load the PDF.
- Use TextFragmentAbsorber to collect the text fragments.
- Reduce the font size of each fragment, then save and reopen the document.
- Use TextAbsorber to extract the text.
- Write the extracted text to an output file.
import io
import aspose.pdf as ap
def extract_text_reduce_font(infile, outfile, reduce_ratio=0.7):
"""
Extract text from a multi-column PDF by first reducing font size of all text fragments.
Args:
infile (str): Path to input PDF.
outfile (str): Output text file.
reduce_ratio (float): Ratio to reduce font size (e.g., 0.7 = 70%).
"""
doc = ap.Document(infile)
frag_absorber = ap.text.TextFragmentAbsorber()
doc.pages.accept(frag_absorber)
for frag in frag_absorber.text_fragments:
frag.text_state.font_size = frag.text_state.font_size * reduce_ratio
# Save to memory stream and reopen (to apply changes)
ms = io.BytesIO()
doc.save(ms)
ms.seek(0)
doc2 = ap.Document(ms)
text_absorber = ap.text.TextAbsorber()
doc2.pages.accept(text_absorber)
extracted_text = text_absorber.text
with open(outfile, "w", encoding="utf-8") as tw:
tw.write(extracted_text)
Extract text with scale factor
Another option for multi-column extraction is to configure TextExtractionOptions with a scale factor. Adjusting the scale factor can improve interpretation of tightly packed fragments and help preserve a more accurate reading order in dense layouts, tables, or column-based documents.
- Load the PDF.
- Create a TextAbsorber.
- Configure
TextExtractionOptions.scale_factor. - Assign the extraction options to the absorber.
- Extract the page text and write the result to an output file.
import aspose.pdf as ap
def extract_text_scale_factor(infile, outfile, scale_factor=0.5):
"""
Extract text from a PDF with multi-column layout using scale factor.
Args:
infile (str): Input PDF path.
outfile (str): Output text file path.
scale_factor (float): Scale factor between 0.1 and 1.0 or 0 for auto-scaling.
"""
doc = ap.Document(infile)
text_absorber = ap.text.TextAbsorber()
ext_opts = ap.text.TextExtractionOptions(
ap.text.TextExtractionOptions.TextFormattingMode.PURE
)
ext_opts.scale_factor = scale_factor
text_absorber.extraction_options = ext_opts
doc.pages.accept(text_absorber)
extracted_text = text_absorber.text
with open(outfile, "w", encoding="utf-8") as tw:
tw.write(extracted_text)