Advanced Text Extraction from PowerPoint Presentations in Python

Overview

Extracting text from presentations is a common yet essential task for developers working with slide content. Whether you’re dealing with Microsoft PowerPoint files in PPT or PPTX format, or OpenDocument presentations (ODP), accessing and retrieving textual data can be critical for analysis, automation, indexing, or content migration purposes.

This article provides a comprehensive guide on how to efficiently extract text from various presentation formats, including PPT, PPTX, and ODP, using Aspose.Slides for Python. You’ll learn how to systematically iterate through presentation elements to accurately retrieve the text content you need.

Extract Text from a Slide

Aspose.Slides for Python provides the aspose.slides.util namespace, which includes the SlideUtil class. This class exposes several overloaded static methods for extracting all text from a presentation or slide. To extract text from a slide in a presentation, use the get_all_text_boxes method. This method accepts an object of type Slide as a parameter. When executed, the method scans the entire slide for text and returns an array of objects of type TextFrame, preserving any text formatting.

The following code snippet extracts all the text from the first slide of the presentation:

import aspose.slides as slides

# Instantiate the Presentation class that represents a PPTX file.
with slides.Presentation("sample.pptx") as presentation:
    slide = presentation.slides[0]
    # Get an array of TextFrame objects from all slides in the PPTX file.
    text_frames = slides.util.SlideUtil.get_all_text_boxes(slide)
    # Loop through the array of the text frames.
    for text_frame in text_frames:
        # Loop through paragraphs in the current text frame.
        for paragraph in text_frame.paragraphs:
            # Loop through text portions in the current paragraph.
            for portion in paragraph.portions:
                # Display the text in the current portion.
                print(portion.text)
                # Display the font height of the text.
                print(portion.portion_format.font_height)
                # Display the font name of the text.
                if portion.portion_format.latin_font is not None:
                    print(portion.portion_format.latin_font.font_name)

Extract Text from a Presentation

To scan text from the entire presentation, use the get_all_text_frames static method exposed by the SlideUtil class. It accepts two parameters:

  1. A Presentation object representing a PowerPoint or OpenDocument presentation from which text will be extracted.
  2. A Boolean value indicating whether the master slides should be included when scanning text from the presentation.

The method returns an array of objects of type TextFrame, including text formatting information. The code below scans the text and formatting details from a presentation, including the master slides.

import aspose.slides as slides

# Instantiate the Presentation class that represents a PPTX file.
with slides.Presentation("pres.pptx") as presentation:
    # Get an array of TextFrame objects from all slides in the PPTX file.
    text_frames = slides.util.SlideUtil.get_all_text_frames(presentation, True)
    # Loop through the array of text frames.
    for text_frame in text_frames:
        # Loop through paragraphs in the current text frame.
        for paragraph in text_frame.paragraphs:
            # Loop through text portions in the current paragraph.
            for portion in paragraph.portions:
                # Display text in the current portion.
                print(portion.text)
                # Display the font height of the text.
                print(portion.portion_format.font_height)
                # Display the font name of the text.
                if portion.portion_format.latin_font is not None:
                    print(portion.portion_format.latin_font.font_name)

Categorized and Fast Text Extraction

The PresentationFactory class also provides static methods for extracting all text from presentations:

PresentationFactory.get_presentation_text(stream, mode)
PresentationFactory.get_presentation_text(file, mode)
PresentationFactory.get_presentation_text(stream, mode, options)

The TextExtractionArrangingMode enum argument indicates the mode for organizing the text extraction result and can be set to the following values:

  • UNARRANGED - The raw text without regard to its position on the slide.
  • ARRANGED - The text is arranged in the same order as on the slide.

The UNARRANGED mode can be used when speed is critical; it’s faster than the ARRANGED mode.

PresentationText represents the raw text extracted from the presentation. It contains the slides_text property, which returns an array of objects of type ISlideText. Each object represents the text on the corresponding slide. The object of type ISlideText has the following properties:

  • text - The text within the slide’s shapes.
  • master_text - The text within the master slide’s shapes associated with this slide.
  • layout_text - The text within the layout slide’s shapes associated with this slide.
  • notes_text - The text within the notes slide’s shapes associated with this slide.
  • comments_text - The text within comments associated with this slide.
import aspose.slides as slides

arranging_mode = slides.TextExtractionArrangingMode.UNARRANGED
presentation_text = slides.PresentationFactory().get_presentation_text("sample.pptx", arranging_mode)
slide_text = presentation_text.slides_text[0]
print(slide_text.text)
print(slide_text.layout_text)
print(slide_text.master_text)
print(slide_text.notes_text)