What is PDF File?


Portable Document Format (PDF) is a document type created in 1990s by Adobe. The aim of introducing this format is to provide a standard format for representation of documents that is independent of application software, hardware as well as operating systems. PDF files can be opened in Adobe Acrobat Reader as well as in multiple browsers like Chrome, Safari, Firefox, etc. through plugins or extensions. PDF file format contains images, text, hyperlinks, rich-media, digital signatures, attachments, metadata, and 3D objects.

Users generally convert their existing documents into PDF format but it does not mean that PDF can’t be manipulated or created by any software. Adobe Writer is Adobe’s own application for creating PDF files.

History of PDF format

Adobe made PDF format available free of charge in 1993. It was released as an open standard in July 2008 and was published by International Organization for Standardization as ISO 32000-1. Adobe published a Public Patent License to ISO 32000-1 in 2008, permit royalty-free rights for all patents taken by Adobe, compulsory to make, sell or distribute PDF-complaint implementations.

The first edition of PDF as PDF 1.0 later went through revisions up to PDF 1.7. PDF 1.7, the 6th edition that became ISO 32000-1 have some proprietary technologies instructed by Adobe, such as Adobe XML Forms Architecture and JavaScript extension for Acrobat.

It was in July 2017 when PDF 2.0 known as ISO 32000-2:2017 was published that does not include any non-standardized technologies.

PDF File Specifications

The PDF file is a set of bytes that are grouped in tokens according to syntax rules defined by PDF specifications.

File Structure

The PDF file contains the following inside the file in a sequence.

  • Header
  • Body
  • Cross-reference
  • Trailer

File Header

Irrespective of the PDF version, PDF files start with a header containing a unique identifier for PDF and the version of the format such as %PDF-1.x, where x ranges from 1-7.

File Body

The file body of PDF consists of a sequence of indirect objects presenting the contents of a document. The objects represent the components of a document such as fonts, pages, sample images, etc. The body also contains a sequence of indirect objects.

Cross-Reference Table

The cross-reference table holds information that allows random access to indirect objects so that the complete file does not need to read to locate any particular object.

The cross-reference table is also known as an index table, located near the end of the file and gives the byte offset of each indirect object from the beginning of the file.

File Trailer

PDF File Trailer enables users to quickly find the cross-reference table and special objects. The end line of the file shall contain only the end-of-file marker, %%EOF. The two earlier lines consist one per line and in order, the keyword startxref and the byte offset in the decoded stream from the starting of the file to the starting of xref keyword in last cross-reference section.

PDF Objects

PDF file generally consists of eight types of objects –

  • Boolean values, specify true or false
  • Numbers
  • Strings, ( (…) may contain 8-bit characters
  • Names start with a forward slash (/)
  • Arrays, collection of objects with square brackets ( […])
  • Dictionaries, enclosed with double pointy brackets ( <<…>>)
  • Streams, represents the sequence of bytes that can be of unlimited length
  • The Null object

There are other objects like comments that are introduced with % sign and may contain 8-bit characters.

Indirect Objects

Indirect objects are located in special streams known as object streams. Cross-referencing to indirect objects are maintained in index table and marked with xref keyword that follows the main body and gives the byte offset for each indirect object from the beginning of the file.

Linear and Non-linear PDF layouts

The layout of PDF is categorized into linear and non-linear based on the target applications and other factors.

Linear PDF – Linear PDF files are created in such a way that they are written to disk in a linear fashion. These do not need browser plugins for the whole document to load first before preview.

Non-linear PDF – They uses less disk space as compared to linear PDF files. PDF pages of the document reside in scattered form across PDF because of this non-linear PDF files are slower as compared to linear files.

Objects Overview

PDF body contains objects as discussed above. PDF files are largely based on PostScript without the control features of programming languages like if and loop commands. PostScript code issue commands to generate graphical content collected and tokenized in addition to files, graphics, or fonts.

Text

Text in PDF document is represented by text elements in page content streams. The text element specifies that characters should be drawn at certain positions.

Graphics

The graphic operators in PDF content streams explain the appearance of pages reproduced on a raster output device. Six main groups are formed by graphic operators.

  • Graphic state operators manipulate the data structure known as graphics state. The graphic state includes the current transformation matrix, that maps user space coordinates used within PDF content stream into output device coordinates. It includes color, the current clipping path, & other parameters.
  • Path construction operators that define paths, shapes, regions of various sorts. These include operators for beginning a new path, adding line segments and curves to it.
  • Path painting operators fill the path with color, paint a stroke, or use it as a clipping boundary.
  • Text operators choose and show character glyphs from fonts. PDF takes glyphs as general graphical shapes, many of text operators are grouped with graphic state or painting operators. However, the data structures and mechanisms for dealing with glyph & font descriptions are specialized.
  • Other painting operators paint self-describing graphic objects. These contain sampled images, geometrically defined shadings, and entire content streams that in turn contain sequences of graphic operators.
  • Marked content operators associate higher-level logical details with objects in the content stream. These details do not have any effect on the appearance of the content and are useful to applications that use PDF for document interchange.