What is PDF file? | Knowledge Base

Introduction

As you already know from the article about “PDLs”, PDF is a static Page Description Language that has a strict unchangeable structure.

PDF is one of, if not, the most popular Page Description language due to a huge variety of features that developers from Adobe added to its specification. Moreover, Adobe also provides people with tools able to realize these features in documents. This article is a brief review of the syntax, structure and features of PDF.

What is PDF file?

The initial goal of developing PDF or abbreviated Portable Document Format was to create a document format that satisfied numerous requirements of digital document interchange in device-independent and resolution-independent environments. These requirements include interactive view, high-performance navigation, low disk space occupation, co-working on documents, support for different media content, encryption, signing, form creation, presentation, and so on. In spite of the initial intent to provide enterprises with the exhaustive format for digital document interchange, high-quality printing features were also added to the specification, though later.

Syntax of PDF file

PDF has an imaging model derived from PostScript’s one, also use 1-2-characters, long operators, as well as in AI format, and also has postfix BNF syntax, where all necessary operands go before the operator.

operand1...operandm operator

Besides operator length, there are some differences between PDF and PostScript operators. In PDF all necessary operands must precede operators while in Postscript operands are obtained from the PostScript stack. In PDF the operator doesn’t return a result as it can be in PostScript. PDF operator executes some action to compose a page, for example drawing graphics or text or sets some property in a graphics environment. In PostScript, operators do all the work.

Usually, the most of PDF files content is compressed with Flate encoding and, this way is binary. Besides compression PDF files also can be encrypted to limit access to document content. Therefore the whole file must be treated as binary. Only in the case when a PDF file is neither compressed nor encrypted and doesn’t contain binary content, such as images, sound, video, etc., it can be considered textual.

PDF specification Objects

In PDF specification object is a synonym of type, while in PostScript there are types that can be primitive and complex and the last can be called “objects”. Therefore all types in PDF, either simple or complex, are objects. PDF language consists of boolean values, integers, real numbers, names, strings, arrays, dictionaries, and streams. Strings can be in literal or hexadecimal format as it is shown below.

( This is a string )
<4E6F762073686D6F7A206B6120706F702E>

literal format
hexadecimal format

Arrays are bounded with square brackets. It includes a subtype Rectangle - array with 4 elements.

Dictionaries store the data in key-value pairs where the key is a name or string (for Names dictionary) and the value is object or object reference. It is enclosed in double-angle brackets. Dictionaries have a Type field that shows what data is stored in a given dictionary.

<< /Type /Example
  /Subtype /DictionaryExample
  /Version 0 . 01
  /IntegerItem 12
  /StringItem ( a string )
  /Subdictionary << /Item1 0 . 4
    /Item2 true
    /LastItem ( not ! )
    /VeryLastItem ( OK )
  >>
>>
endobj

Objects can be direct and indirect. Indirect objects are those that can be referred from other objects by their ID.

PDF indirect objects

Streams are objects that usually contain binary or encoded data. They are human-unreadable and don’t have limitations on length. Usually, PDF files streams contain compressed page content or images or some other media. Stream object consists of a direct dictionary with a length of the stream and an array of filters used for encoding the stream, and encoded data after keyword stream.

181 0 obj
  <<
    /Length 473 0 R
    /Subtype /Image
    /Width 2
    /Height 19
    /BitsPerComponent 8
    /ColorSpace /DeviceGray
    /Filter [/ASCII85Decode /FlateDecode]
  >>
stream
Gb"[2*s<F2i'/7_!,1%/hZ~>
endstream
endobj

PDF Operators

Operators are kind of direct objects that make page graphics and, as we mentioned earlier, are represented by 1- or 2-letters keywords. There are two kinds of PDF operators:

* executing actions or setting properties of the graphics state.

PDF operator

x y m
x y l
x1 y1 x2 y2 x3 y3 c
h
x y width height re
a b c d e f cm
S
s
f
F
W
font size Tf
charSpace Tc
q
Q
lineWidth w
lineCap J
font size Tf
charSpace Tc

Description

begin a new subpath by moving the current point to coordinates (x, y)
append a straight line segment from the current point to the point (x, y)
append a cubic Bezier curve to the current path
close the current subpath
append a rectangle to the current path
modify the current transformation matrix by concatenating the specified matrix
stroke the path
the same, but close path
fill the path
the same, but close path
modify the current clipping path by intersecting it with the current path
set the text font to font and the text font size to size
set the character spacing to charSpace
save the current graphics state on the graphics state stack
restore graphics state from the graphics state stack
set the line width in the graphics state
set the line cap style in the graphics state
set the text font to font and the text font size to size
set the character spacing to charSpace

* grouping

PDF operator

BT...ET
BI...EI
BMC...EMC
BX...EX

Description

begin and end a text object
begin and end an image object
begin and end a marked-content sequence
begin and end a compatibility section

Special kinds of grouping operators are BX…EX. They enclose portions of page content where unidentified objects must be ignored. Thus, they are equivalents of AI %_ pseudo-comments.

PDF file structure

PDF file has four mandatory structural elements.

PDF file structure

  1. One-line header, where the version of PDF language is written

%PDF-1.5

  1. Body that contains document’s objects. Structure of the body will be described later in this article.

  2. Cross-reference table. It is used for quick random access to the document’s objects. It contains an offset in bytes to the beginning of the objects from the start of the file.

xref
0 6
0000000003 65535 f
0000000017 00000 n
0000000081 00000 n
0000000000 00007 f
0000000331 00000 n
0000000409 00000 n

  1. Trailer, points to the last cross-reference table and contains a common quantity of objects in cross-reference tables, The ID of the document and references to:
    • previous cross-reference table if there are several ones in the file;
    • document Root that represented by Catalog dictionary;
    • Meta information dictionary with Author, Creator, Title, Keywords, creation and modification date fields;
    • Encryption dictionary if the document is encrypted.

trailer
  <<
    /Size 15
    /Root 2 0 R
    /Info 1 0 R
  >>
startxref
6224

A new cross-reference table and trailer are added after every update of the document. It will be described later in the article.

Document structure

The PDF document has a tree-like structure where the root is a Catalog dictionary.

PDF document structure

Catalog contains references on the pages description subtree, outline subtree and other document level subtrees and leaf nodes.

2 0 obj
  << /Type /Catalog
    /Pages 3 0 R
    /Outlines 4 0 R
    /PageMode /UseOutlines
    /ViewerPreferences 5 0 R
    /OpenAction [6 0 R /Fit]
  >>
endobj

Pages tree contains ordering of page-tree nodes and page-leaf nodes. Exactly tree-like structure of a set of pages together with search algorithm allows quick navigating across thousands of pages to find a needed one.

PDF page content stream

Page dictionary contains reference on Content stream that can be compressed as it is on the figure above or uncompressed. In the last case, we would see PDF operators in human-readable text as in the figure below.

7 0 obj
  <<
    /Length 8 0 R
  >>
stream
1 0 0 1 0 0 cm
0 0 m
595 0 l
595 842 l
0 842 l
h
W
n
q
/Alpha1 gs
0 0 0 rg
0 0 0 RG
0 J
q
0.96593 0.25882 -0.25882 0.96593 0 0 cm
1 0 0 1 0 0.25882 cm
0.02 w
-0.96593 0 m
0 -0.25882 l
0 -0.25882 0 -0.25882 0 -0.25882 c
0.14294 -0.25882 0.25882 -0.14294 0.25882 0 c
0.25882 0.14294 0.14294 0.25882 0 0.25882 c
h
S
Q
endstream
endobj

Besides an array of child nodes (it can be page-tree or page nodes) Pages, the dictionary contains reference to Resources dictionary, that in its turn refers to Fonts, ProcSets, Images (XObject), etc.

9 0 obj
  <<
    /ProcSet 10 0 R
    /XObject 11 0 R
    /Font 12 0 R
    /ExtGState 13 0 R
  >>
endobj

Annotation and others subtrees will be mentioned casually in Features section

Features

Graphics possibilities of PDF format

No sense in mentioning common for most of Page Description Languages possibilities in drawing graphics and text. We just say that the richness of supported fonts and color spaces are the same as in PostScript.

Fonts

- Adobe Type 0
- Adobe Type 1
- Compact Fonts (CFF)
- Chameleon
- TrueType
- CID-keyed

Color spaces

- DeviceGray
- DeviceRGB
- DeviceCMYK
- DeviceN
- Separated colors
- Spot
- CIE-based

Transparency

PDF supports transparency.

External files

Any media or document file can be embedded to PDF or referred to from a document.

Hyperlinks are supported in PDF.

Electoral and interactive view

PDF allows showing only parts of the content and its appearance that are necessary for certain usage and hiding the others. It is useful, for example, when importing Adobe Illustrator graphics that have layers some of which are necessary for working in Adobe Illustrator, but are not necessary for viewing in Adobe Acrobat Reader. Another case of electoral view can be an article written in different languages or represented for users with disabilities but saved in one document. There can also be different variants of usage: one view for viewing, designing, and printing.

An interactive view of PDF includes abilities:

Annotation is a sort of floating box containing some notes, sound, video, or some other content.

Interactive navigation

Navigating between different parts of documents can be realized in several ways:

Moving by viewports and hiding some parts of the document is realized by means of Viewport and NavigationNode dictionaries.

Incremental updates

All changes that were made in the PDF document are appended to the document without erasing previous content. And every time the documents are changed new xref (cross-reference table) and trailer are added. The new cross-reference table contains references on added or removed objects and on the previous cross-reference table. Such a mechanism allows putting together the final document content and, at the same time, storing previous states of the document.

Performance

High performance of navigating through pages is provided by Pages tree-like structure and effective search algorithm. However It can be increased further by combining repetitive graphics elements into one object, called Form XObject and using one object in all necessary places. There is also a way to optimize the whole document for a high-performance view. It is linearization. Linearization was initially invented for effective viewing of PDF documents accessed by the web. The linearized PDF document is read-only, any change to this will require repeated linearization.

High performance of navigating between document objects is realized by cross-reference tables that store object offsets from the start of the file.

Compression

Compression of PDF documents, usually Flat encoding, allows the creation of large documents with relatively low disk space occupation. For example, the PDF specification file that contains 758 pages with an outline, thumbnails, images, and tables has about 9 Mb size.

Security

PDF documents can be encrypted to give differentiated access only to certain users and they can be signed. The digital signing feature allows authenticating of the identity of the user and the document’s content. Digital Signature binds document state when it is signed with user information. Digital signature can be in any form: from purely mathematical to retinal scan if corresponding signature handler is provided.

Interactive forms

It is used for gathering information from users. Interactive forms, or so called AcroForms can validate, format and send user data to a server.

Presentation

There are several means of presentation in PDF:

Media content

Images, sounds, movie clips and 3D graphics can be added to PDF documents.

Extraction data

PDF allows adding certain markup that provides external applications with the possibility to extract necessary data. Document with such markup called Tagged PDF.

Prepress support

Preparing for publishing includes printer’s marks, color separation, output intents and trapping.

What is the use of a PDF file?

The main application of PDF documents is electronic document interchange and viewing in different environments.

How do I make a PDF file?

Creation and editing of PDF documents are possible in standalone Adobe Acrobat applications.

How do I open a PDF file?

You can open and view PDF files in standalone Adobe Acrobat Reader application or in Google Chrome browser with PDF plugin. Also simple utilities such as Sumatra PDF, Foxit Reader or Free PDFReader, will help you. Another way is to view PDF online, for example, on Google Drive.

Subscribe to Aspose Product Updates

Get monthly newsletters & offers directly delivered to your mailbox.