Python を使った注釈と特殊テキスト

スタンプ注釈からテキストを抽出

使用テキストアブソーバーに埋め込まれたテキストを抽出するにはスタンプ注釈アピアランスストリーム。これは、スタンプの内容がプレーンテキストとして保存されるのではなく、フォーム XObject としてレンダリングされる場合に便利です。

を開きます文書.
ターゲットアノテーションには以下からアクセスしてください page.annotations.
それが次のものであることを確認してください StampAnnotationその後、通常の外観の XForm に戻します。
Xフォームをに渡す TextAbsorber.visit() 埋め込まれたテキストを抽出します。

import os
import aspose.pdf as ap


def extract_text_from_stamp(infile, page_number, annotation_index, outfile):
    """
    Extracts text from a stamp annotation on a given page in a PDF document.
    Args:
        infile (str): Path to the input PDF file.
        page_number (int): 1-based index of the page containing the stamp.
        annotation_index (int): 1-based index of the annotation in that page.
        outfile (str): Path to the output text file where extracted text will be saved.
    """
    document = ap.Document(infile)
    try:
        page = document.pages[page_number]
        annot = page.annotations[annotation_index]
        # Ensure it's a StampAnnotation
        if isinstance(annot, ap.annotations.StampAnnotation):
            # Get normal appearance XForm of the stamp
            xform = annot.appearance["N"]
            absorber = ap.text.TextAbsorber()
            absorber.visit(xform)
            extracted = absorber.text
            with open(outfile, "w", encoding="utf-8") as f:
                f.write(extracted)
    finally:
        document.close()

強調表示されたテキストを抽出

ページの注釈を繰り返し処理して使用するハイライト・アノテーション。get_marked_text () 各ハイライトでカバーされているテキストスパンを読むことができます。ページ・アノテーション・コレクションは 1 から始まります。

を開きます文書目的のページを選択します。
ループスルー page.annotations.
使用 is_assignable フィルタリング対象注記をハイライトインスタンス。
注釈をキャストして電話をかける get_marked_text() 強調表示されたコンテンツを取得します。

def extract_highlight_text(infile):
    """
    Extract text from highlight annotations.

    Args:
        infile (str): Input PDF filename

    Returns:
        None

    Example:
        extract_highlight_text("sample.pdf")

    Note:
        Prints marked text from each highlight annotation on first page.
    """
    document = ap.Document(infile)
    page = document.pages[1]

    for annotation in page.annotations:
        if is_assignable(annotation, ap.annotations.HighlightAnnotation):
            highlight_annotation = cast(ap.annotations.HighlightAnnotation, annotation)
            print(highlight_annotation.get_marked_text())

上付き文字と下付き文字を抽出

上付き文字と下付き文字は、式、数式、および化合物名によく登場します。.NET 経由の Python 用の Aspose.PDF は、次の方法でこのコンテンツを抽出することをサポートしています。テキストフラグメントアブソーバーは、文字レベルのポジショニングメタデータを検出します。

を開きます文書.
を作成 TextFragmentAbsorber インスタンス。
コール document.pages[page_number].accept(absorber) ターゲットページをスキャンします。
から抽出したテキスト全体を取得する absorber.text.
結果をファイルに書き込み、文書を閉じます。

import os
import aspose.pdf as ap


def extract_super_sub_text(infile, outfile, page_number=1):
    """
    Extract text (including superscript/subscript) from a specified page of a PDF and write to a text file.
    Args:
        infile (str): Path to input PDF file.
        outfile (str): Path to output text file.
        page_number (int): 1‑based index of the page to extract.
    """
    document = ap.Document(infile)
    try:
        absorber = ap.text.TextFragmentAbsorber()
        # Accept only the specific page for extraction
        document.pages[page_number].accept(absorber)
        extracted_text = absorber.text
        with open(outfile, "w", encoding="utf-8") as f:
            f.write(extracted_text)
    finally:
        document.close()

テキストフラグメントを繰り返し処理して上付き文字/下付き文字を検出

フラグメント単位の検査では、繰り返し処理してください absorber.text_fragments そして読んでください text_state.superscript そして text_state.subscript それぞれにブーリアンフラグテキストフラグメント.

を開きます文書そして作成するテキストフラグメントアブソーバー.
ターゲットページのアブソーバーを受け入れて入力してください absorber.text_fragments.
各フラグメントについて、以下をお読みください。 fragment.text, fragment.text_state.superscript、および fragment.text_state.subscript.
結果を出力ファイルに書き込み、文書を閉じます。

import os
import aspose.pdf as ap


def extract_super_sub_details(infile, outfile, page_number=1):
    """
    Extract details of each text fragment on a page, identifying superscript and subscript items.
    Args:
        infile (str): Path to input PDF file.
        outfile (str): Path to output text file.
        page_number (int): 1‑based page index.
    """
    document = ap.Document(infile)
    try:
        absorber = ap.text.TextFragmentAbsorber()
        document.pages[page_number].accept(absorber)

        with open(outfile, "w", encoding="utf-8") as f:
            for fragment in absorber.text_fragments:
                text = fragment.text
                is_sup = fragment.text_state.superscript  # True if superscript
                is_sub = fragment.text_state.subscript  # True if subscript
                f.write(
                    f"Text: '{text}' | Superscript: {is_sup} | Subscript: {is_sub}\n"
                )
    finally:
        document.close()

複数列の PDF からのテキスト抽出の改善