Extract SuperScripts and SubScripts text from PDF

Extract SuperScripts and SubScripts Text

Extracting text from a PDF document is a common thing. However, in such text, when extracted, the SuperScripts and SubScripts contained in them, which are typical for technical documents and articles, may not be displayed. A SubScript or SuperScript is a character, number, or letter placed below or above a regular line of text. It is usually smaller than the rest of the text.

SubScripts and SuperScripts are most often used in formulas, mathematical expressions, and specifications of chemical compounds. It is tough to edit them when there can be many of them in the same passage of text. In one of the latest releases, the Aspose.PDF for .NET library added support for extracting SuperScripts and SubScripts text from PDF.

Use the TextFragmentAbsorber class and you can already do anything with the found text, i.e., you can simply use the entire text. Try the next code snippet:

The following code snippet also work with Aspose.PDF.Drawing library.

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void ExtractSuperScriptsAndSubScripts()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "SuperScriptExample.pdf"))
    {
        // Create an absorber
        var absorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
        document.Pages[1].Accept(absorber);
        using (StreamWriter writer = new StreamWriter(dataDir + "SuperScriptExample_out.txt"))
        {
            // Write the extracted text in text file
            writer.WriteLine(absorber.Text);
        }
    }
}

Or use TextFragments separately and do all sorts of manipulations with them, for example, sort by coordinates or by size.

The following code snippet also work with Aspose.PDF.Drawing library.

// For complete examples and data files, visit https://github.com/aspose-pdf/Aspose.PDF-for-.NET
private static void ExtractSuperScriptsAndSubScriptsWithTextFragments()
{
    // The path to the documents directory
    var dataDir = RunExamples.GetDataDir_AsposePdf();

    // Open PDF document
    using (var document = new Aspose.Pdf.Document(dataDir + "SuperScriptExample.pdf"))
    {
        // Create an absorber
        var absorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
        document.Pages[1].Accept(absorber);
        using (StreamWriter writer = new StreamWriter(dataDir + "SuperScriptExample_out.txt"))
        {
            foreach (var textFragment in absorber.TextFragments)
            {
                // Write the extracted text in text file
                writer.Write(textFragment.Text);
            }

        }
    }
}

Extract Paragraph from PDF C#