PDF Document

--work in progress--

A PDF Document is a series of 8-bit bytes that can be grouped into tokens. The PDF character set is divided into three classes.

white-space :

PDF treats any sequence of white-space characters as one character. It is mainly used to separate tokens(names and numbers) from each other.

Horizontal Tab
Line Feed
Form Feed
Carriage Return

The above characters are considered as white-line characters. Line Feed and Carriage Return are also considered as end-of-line markers. CR followed immediately by LF is conisdered as one EOL marker. It is mostly treated as other white-space characters except they are required to precede a token that must appear at the beginning of a line.

delimiter :

delimiters are used to separate entities within construct like array, names and comments.

regular :

All characters except white-space and delimiter characters are considered the regular characters. They include bytes outside the ASCII set. The sequence of consecuitve regular characters is considered a single token. PDF is case-sensitive.

PDF syntax can be thought of consisting of 4 parts.

Objects :

Eight basic types of objects - Boolean, integer and real numbers, String Names, Arrays, Dictionaries, Streams and the null object. Indirect objects are named objects that can be refrenced by other objects.

boolean -> true, false.

number -> integer, real numbers. the range and precision is limited by the computer in which the pdf processor is running.

Example : 0 , -34 , 87 , 63.5

string objects - can be written in two ways

string literals - Eg :(alex)

Blackslash is used to escape characters. the string literals can be multiline. Blackslash at the end of a line is used to indicate that the string continues in the next line.

The \ddd escape sequence is used to

hexadecimal string - <901FA3> Each pair of hexadecimal digit represent one byte of the string. If there are odd number of digits, the final digit is assumed to be 0.


Sequence of objects enclosed within square brackets. PDF Arrays are heterogeneous. PDF directly support only one-dimensional arrays. Arrays of higher dimensions can be constructed by using arrays as elements of arrays.

Name Objects:

Name may contain any character except NULL. When writing a name in PDF file, a SOLIDUS(/) should be used to start the name.

Any character that is not a regular character has to be written in hexadecimal format preceded by the NUMBER SIGN(#) Eg: #20 means space. Keywords are not preceded by the # sign.

Dictionary Objects:

It contains pairs of objects. The first element is the key and the second element is the value. The key shall be a name and they must be direct object. The value may be any kind of object (including another dictionary). A dictionary is written as a sequence of key-value pairs enclosed in double angle brackets.

Dictionary is the main buliding block and it is used to construct the attributes for complex objects. By convention, there are certain keys used for specific purpose.

Type - identifies the type of object the dictionary describes.

SubType(S) - identifies the specific type within the general category. Example : Type might be Font and the SubType will specify one of the fontfamily.

Stream Object :

Stream object is a sequence of bytes and can have unlimited length. They are used for objects with large amount of data like images and page descriptions.

A stream consists of a dictionary followed by zero or more bytes bracketed between the keywords stream ( followed by newline [CR and LF or LF alone] ) and endstream.

There should be EOL marker at the end of the stream and that should not be included in the stream length. The stream dictionary includes the length of the stream.

Length - indicates how many bytes of the PDF file are used for the stream's data.

From PDF 1.2, the bytes may be contained in the external file. The stream dictionary specifices the file and any bytes between the keywords are ignored.


NULL Object:

It refers to object of type null and is denoted by the keyword null.

Indirect Object:

any object can be labelled using a positive integer object number and a non negative integer generation number. both of them together helps in identifying an object. Here, the object value is bracketed between the keywords obj and endobj.

The indirect object can be used anywhere by indirect reference. The reference is done using the keyword R

File Structure.

Document Structure.

Content streams.