Improving Text Extraction from Multi‑Column PDFs

Reduce font size manually and then extract

In some multi-column layouts, reducing the font size of text fragments before extraction can improve reading order and reduce overlap issues. This technique can help with tightly formatted documents such as magazines, research papers, brochures, or reports with dense text columns.

  1. Load the PDF.
  2. Use TextFragmentAbsorber to collect the text fragments.
  3. Reduce the font size of each fragment, then save and reopen the document.
  4. Use TextAbsorber to extract the text.
  5. Write the extracted text to an output file.
import io
import aspose.pdf as ap


def extract_text_reduce_font(infile, outfile, reduce_ratio=0.7):
    """
    Extract text from a multi-column PDF by first reducing font size of all text fragments.
    Args:
        infile (str): Path to input PDF.
        outfile (str): Output text file.
        reduce_ratio (float): Ratio to reduce font size (e.g., 0.7 = 70%).
    """
    doc = ap.Document(infile)

    frag_absorber = ap.text.TextFragmentAbsorber()
    doc.pages.accept(frag_absorber)
    for frag in frag_absorber.text_fragments:
        frag.text_state.font_size = frag.text_state.font_size * reduce_ratio
    # Save to memory stream and reopen (to apply changes)
    ms = io.BytesIO()
    doc.save(ms)
    ms.seek(0)
    doc2 = ap.Document(ms)
    text_absorber = ap.text.TextAbsorber()
    doc2.pages.accept(text_absorber)
    extracted_text = text_absorber.text
    with open(outfile, "w", encoding="utf-8") as tw:
        tw.write(extracted_text)

Extract text with scale factor

Another option for multi-column extraction is to configure TextExtractionOptions with a scale factor. Adjusting the scale factor can improve interpretation of tightly packed fragments and help preserve a more accurate reading order in dense layouts, tables, or column-based documents.

  1. Load the PDF.
  2. Create a TextAbsorber.
  3. Configure TextExtractionOptions.scale_factor.
  4. Assign the extraction options to the absorber.
  5. Extract the page text and write the result to an output file.
import aspose.pdf as ap


def extract_text_scale_factor(infile, outfile, scale_factor=0.5):
    """
    Extract text from a PDF with multi-column layout using scale factor.
    Args:
        infile (str): Input PDF path.
        outfile (str): Output text file path.
        scale_factor (float): Scale factor between 0.1 and 1.0 or 0 for auto-scaling.
    """
    doc = ap.Document(infile)
    text_absorber = ap.text.TextAbsorber()
    ext_opts = ap.text.TextExtractionOptions(
        ap.text.TextExtractionOptions.TextFormattingMode.PURE
    )
    ext_opts.scale_factor = scale_factor
    text_absorber.extraction_options = ext_opts
    doc.pages.accept(text_absorber)
    extracted_text = text_absorber.text
    with open(outfile, "w", encoding="utf-8") as tw:
        tw.write(extracted_text)