Convert PDF to Microsoft Word Documents in Python

Overview

This article explains how to convert PDF to Microsoft Word Documents using Python. It covers these topics.

Format: DOC

Format: DOCX

Format: Word

Python PDF to DOC and DOCX Conversion

One of the most popular features is the PDF to Microsoft Word DOC conversion, which makes content management easier. Aspose.PDF for Python allows you to convert PDF files not only to DOC but also to DOCX format, easily and efficiently.

Convert PDF to DOC (Word 97-2003) file

Convert PDF file to DOC format with ease and full control. Aspose.PDF for Python is flexible and supports a wide variety of conversions. Converting pages from PDF documents to images, for example, is a very popular feature.

A conversion that many of our customers have requested is PDF to DOC: converting a PDF file to a Microsoft Word document. Customers want this because PDF files cannot easily be edited, whereas Word documents can. Some companies want their users to be able to manipulate text, tables and images in files that started as PDFs.

Keeping alive the tradition of making things simple and understandable, Aspose.PDF for Python lets you transform a source PDF file into a DOC file with two lines of code. To accomplish this feature, we have introduced an enumeration named SaveFormat and its value .Doc lets you save the source file to Microsoft Word format.

The following Python code snippet shows the process of converting a PDF file into DOC format.

Steps: Convert PDF to DOC in Python

  1. Create an instance of Document object with the source PDF document.
  2. Save it to SaveFormat.Doc format by calling Document.Save() method.

from asposepdf import Api

documentName = "testdata/Hello.pdf"
doc = Api.Document(documentName)
documentOutName = "testout/out.doc"
doc.save(documentOutName, Api.SaveFormat.Doc)

Using the DocSaveOptions Class

The DocSaveOptions class provides numerous properties that improve the process of converting PDF files to DOC format. Among these properties, Mode enables you to specify the recognition mode for PDF content. You can specify any value from the RecognitionMode enumeration for this property. Each of these values has specific benefits and limitations:


from asposepdf import Api

DIR_INPUT = "testdata/"
DIR_OUTPUT = "testout/"

input_pdf = DIR_INPUT + "Hello.pdf"
output_pdf = DIR_OUTPUT + "convert_pdf_to_doc_with_options.doc"
# Open PDF document
document = Api.Document(input_pdf)

save_options = Api.DocSaveOptions()
save_options.format = Api.DocSaveOptions.DocFormat.Doc
# Set the recognition mode as Flow
save_options.mode = Api.DocSaveOptions.RecognitionMode.Flow
# Set the Horizontal proximity as 2.5
save_options.relative_horizontal_proximity = 2.5
# Enable the value to recognize bullets during conversion process
save_options.recognize_bullets = True

# Save the file into MS Word document format
document.save(output_pdf, save_options)

Convert PDF to DOCX

Aspose.PDF for Python API lets you read and convert PDF documents to DOCX using Python via .NET. DOCX is a well-known format for Microsoft Word documents whose structure was changed from plain binary to a combination of XML and binary files. Docx files can be opened with Word 2007 and lateral versions but not with the earlier versions of MS Word which support DOC file extensions.

The following Python code snippet shows the process of converting a PDF file into DOCX format.

Steps: Convert PDF to DOCX in Python

  1. Create an instance of Document object with the source PDF document.
  2. Save it to SaveFormat.DocX format by calling Document.Save() method.


from asposepdf import Api

DIR_INPUT = "testdata/"
DIR_OUTPUT = "testout/"

input_pdf = DIR_INPUT + "Hello.pdf"
output_pdf = DIR_OUTPUT + "convert_pdf_to_doc_with_options.docx"
# Open PDF document
document = Api.Document(input_pdf)

save_options = Api.DocSaveOptions()
save_options.format = Api.DocSaveOptions.DocFormat.Docx
# Set the recognition mode as Flow
save_options.mode = Api.DocSaveOptions.RecognitionMode.Flow
# Set the Horizontal proximity as 2.5
save_options.relative_horizontal_proximity = 2.5
# Enable the value to recognize bullets during conversion process
save_options.recognize_bullets = True

# Save the file into MS Word document format
document.save(output_pdf, save_options)

The DocSaveOptions class has a property named Format which provides the capability to specify the format of the resultant document, that is, DOC or DOCX. In order to convert a PDF file to DOCX format, please pass the Docx value from the DocSaveOptions.DocFormat enumeration.

See Also

This article also covers these topics. The codes are same as above.

Format: Word

Format: DOC

Format: DOCX