Extract Text from Presentation

Extract Text from Slide

Aspose.Slides for Node.js via Java provides the SlideUtil class. This class exposes a number of overloaded static methods for extracting the entire text from a presentation or slide. To extract the text from a slide in a PPTX presentation, use the getAllTextBoxes overloaded static method exposed by the SlideUtil class. This method accepts the Slide object as a parameter. Upon execution, the Slide method scans the entire text from the slide passed as parameter and returns an array of TextFrame objects. This means that any text formatting associated with the text is available. The following piece of code extracts all the text on the first slide of the presentation:

// Instatiate Presentation class that represents a PPTX file
var pres = new aspose.slides.Presentation("demo.pptx");
try {
    for (var s = 0; s < pres.getSlides().size(); s++) {
        let slide = pres.getSlides().get_Item(s);
        // Get an Array of ITextFrame objects from all slides in the PPTX
        var textFramesPPTX = aspose.slides.SlideUtil.getAllTextBoxes(slide);
        // Loop through the Array of TextFrames
        for (var i = 0; i < textFramesPPTX.length; i++) {
            // Loop through paragraphs in current ITextFrame
            for (let j = 0; j < textFramesPPTX[i].getParagraphs().getCount(); j++) {
                let para = textFramesPPTX[i].getParagraphs().get_Item(j);
                // Loop through portions in the current IParagraph
                for (let k = 0; k < para.getPortions().getCount(); k++) {
                    let port = para.getPortions().get_Item(k);
                    // Display text in the current portion
                    console.log(port.getText());
                    // Display font height of the text
                    console.log(port.getPortionFormat().getFontHeight());
                    // Display font name of the text
                    if (port.getPortionFormat().getLatinFont() != null) {
                        console.log(port.getPortionFormat().getLatinFont().getFontName());
                    }
                });
            }
        }
    });
} finally {
    pres.dispose();
}

Extract Text from Presentation

To scan the text from the whole presentation, use the getAllTextFrames static method exposed by the SlideUtil class. It takes two parameters:

  1. First, a Presentation object that represents the presentation from which the text is being extracted.
  2. Second, a boolean value determining whether the master slide is to be included when the text is scanned from the presentation. The method returns an array of TextFrame objects, complete with text formatting information. The code below scans the text and formatting information from a presentation, including the master slides.
// Instatiate Presentation class that represents a PPTX file
var pres = new aspose.slides.Presentation("demo.pptx");
try {
    // Get an Array of ITextFrame objects from all slides in the PPTX
    var textFramesPPTX = aspose.slides.SlideUtil.getAllTextFrames(pres, true);
    // Loop through the Array of TextFrames
    for (var i = 0; i < textFramesPPTX.length; i++) {
        // Loop through paragraphs in current ITextFrame
        for (let j = 0; j < textFramesPPTX[i].getParagraphs().getCount(); j++) {
            let para = textFramesPPTX[i].getParagraphs().get_Item(j);
            // Loop through portions in the current IParagraph
            for (let k = 0; k < para.getPortions().getCount(); k++) {
                let port = para.getPortions().get_Item(k);
                // Display text in the current portion
                console.log(port.getText());
                // Display font height of the text
                console.log(port.getPortionFormat().getFontHeight());
                // Display font name of the text
                if (port.getPortionFormat().getLatinFont() != null) {
                    console.log(port.getPortionFormat().getLatinFont().getFontName());
                }
            }
        }
    }
} finally {
    pres.dispose();
}

Categorized and Fast Text Extraction

The new static method getPresentationText has been added to Presentation class. There are three overloads for this method:

IPresentationText getPresentationText(String file, int mode);
IPresentationText getPresentationText(InputStream stream, int mode);
IPresentationText getPresentationText(InputStream stream, int mode, ILoadOptions options);

The TextExtractionArrangingMode enum argument indicates the mode to organize the output of text result and can be set to the following values:

  • Unarranged - The raw text with no respect to position on the slide
  • Arranged - The text is positioned in the same order as on the slide

Unarranged mode can be used when speed is critical, it’s faster than Arranged mode.

PresentationText represents the raw text extracted from the presentation. It contains a getSlidesText method which returns an array of SlideText objects. Every object represent the text on the corresponding slide. SlideText object have the following methods:

There is also a SlideText class which implements the SlideText class.

The new API can be used like this:

var text1 = aspose.slides.PresentationFactory.getInstance().getPresentationText("presentation.pptx", aspose.slides.TextExtractionArrangingMode.Unarranged);
console.log(text1.getSlidesText()[0].getText());
console.log(text1.getSlidesText()[0].getLayoutText());
console.log(text1.getSlidesText()[0].getMasterText());
console.log(text1.getSlidesText()[0].getNotesText());