Hacking at 0300

Creating PDFs with PHP: Syntax

I wanted to allow my webservices to create PDF files, and I figured it couldn't be too hard—after all, it's just a bunch of graphics commands in a text file, right? Foolish me. The reference manual is 756 pages long, not including the javascript reference, another 769 pages. The place to start is fPDF, which is open source and pretty easy to understand, and its derivative tFPDF that lets you use and embed True Type fonts (it's the 21^st century; who uses anything but True Type fonts?). Using it is simple:

define('_SYSTEM_TTFONTS', '/path/to/your/truetype/fonts/'); // Took a bit of experimenting to find the right values for these
define('FPDF_FONTPATH', _SYSTEM_TTFONTS);
putenv('GDFONTPATH='._SYSTEM_TTFONTS); // so we can use GD images as well
$pdf=new tFPDF();
$pdf->AddPage();
$pdf->SetFont('Arial','B',16);
$pdf->Cell(40,10,'Hello World!');
$pdf->Output();

One gotcha is that you need to create the unifont directory within the fonts folder, and copy tFPDF's ttfonts.php file into that.

The result is here.

PDF Syntax

But to do much more you have to know what a PDF file looks like. The language itself is relatively simple (this has been further simplified so I can remember it):

Comments start with % and go to the end of the line. Whitespace is, in general, the only delimiter. There are 6 data types:

Number

Can be real or integer; no exponential notation. E.g.: 1, -1.2

String

Ascii only! This is a format from 1993, after all. Delimited by parentheses, with the backslash as an escape character. It's smart enough to count parenthesis so you technically don't have to escape them if they're matched, but that's not useful in real life for me. I use regular expressions, and regular expressions can't count. Just escape the parentheses. E.g. (Hello, World), (Escaped $parentheses$ )

(Strings can
contain newlines)

If you want Unicode, the string has to be UTF-16 (big endian) encoded with a byte order mark (!), enclosed in angle brackets: שלום, עולם is <FEFF05E905DC05D505DD002C05E205D505DC05DD>. A pain but it's codable:

function textstring($s){
  // Assumes $s is UTF8
  // use mb strings with explicit encodings to avoid problems with overloaded
  // regular string functions
  if (mb_strlen($s, 'UTF-8') == mb_strlen($s, '8bit')){
    // Ascii
    return '('.str_replace(array('\\', '(', ')',), array ('\\\\', '\(','\)'), $s).')';
  }else{
    $ret = '';
    $s = mb_convert_encoding ($s, 'UTF-16BE', 'UTF-8');
    foreach (str_split($s) as $char) $ret .= sprintf('%02X', ord($char)); // str_split should not be overloaded according to the manual
    return "<FEFF$ret>";
  }
}

Name

These are the symbols used to represent external objects (like fonts or images) or keywords and indices into associative arrays. A name is a slash followed by "regular characters" which for all intents and purposes means alphanumerics. E.g. /Name, /Image3

Technically, names can include any character if encoded as # then 2-digit hex value, with all characters 8-bit. Programmatically:

define (PDF_DELIMITERS, '()<>{}%#/');
function name ($data){
	$ret = '/';
	foreach(str_split($data) as $c){
		$ord = ord($c);
		if ($ord == 0){
			 // str_split will give nulls for empty strings; ignore them
		}elseif ($ord < 33 /* whitespace and control characters */ || $ord > 126 /* hi bit set */ || strpos(PDF_DELIMITERS, $c) !== FALSE){
			$ret .= sprintf('#%02X', $ord);
		}else{
			$ret .= $c;
		}
	}
	return $ret;
}

Dictionary

An associative array, with the key for each pair being a name and the data being any single data item. Delimited by << and >>, with the key/data pairs just listed (the order is irrelevant). E.g.

<<
  /Name (textbox)
  /Width 20
  /Rules << /Numeric /Yes /Positive /No >> % dictionaries are data and can be nested. 
                % This subdictionary has two entries, /Numeric and /Positive
>>

Programmatically,

function dictionary($arr){
	$ret = "<<\n";
	foreach($arr as $key=>$value) $ret .= this->name($key)." $value\n";
	return "$ret>>\n";
}

Array

A linear list of data, delimited by square brackets. The order matters. E.g. [ 1 2 3 ], [ /First 2 /Third (Fourth) <</dic 1>> ]

That's simply

function pdfarray($arr){
	$ret = "[\n";
	foreach($arr as $value) $ret .= " $value ";
	return "$ret]\n";
}

Stream

A string of bytes, surrounded by the words stream and endstream on lines by themselves, preceded by a dictionary that describes the string. At a minimum the dictionary needs a /Length entry with the length of the string in bytes. Yes, that's redundant in a delimited string, but that's the definition. Adobe Reader is smart enough to figure out the length if you leave out this entry. The bytes are not necessarily Ascii and are not defined by the PDF definition. Pages are streams, with the dictionary providing the metadata about the page and the string being the list of drawing commands (which is its own language). E.g.

<<
  /Length 54
  /Fonts [/F1 /F2 /F3]
  /Width 37
  /Height 92
>> % the following is *not* the actual PDF drawing language
stream
1 2 moveto
4 5 lineto
/F1 setfont
(Hello, World) text
endstream

It's possible to store a compressed string if there's a /Filter entry in the dictionary whose value is the name of the compression algorithm used. The types of algorithms are built-in; there are a couple for image files and zlib for text (that filter name is /FlateDecode for reasons I can't fathom) (Yes, it would make sense to just gzip the whole file rather than pieces, but I'm not in charge). The compression makes the PDF harder to debug, so FPDF includes a function $pdf->SetCompression(false) to not use it. The /Length refers to the final, compressed length.

Programmatically,

function stream($arr, $data){
	if ($this->bCompress){ // assumes this is a flag set somewhere
		$data = gzcompress($data);
		$arr['Filter'] = this->name('FlateDecode');
	}
	$arr['Length'] = strlen($data);
	return this->dictionary($arr)."stream\n$data\nendstream";
}

In an ordinary programming language, you would declare variables to hold the data above. In PDF, these are called "indirect objects" and they are numbered, not named. In fact, they get two numbers, the object number and the "generation number" which is used when the PDF file is updated. Since we're generating PDFs from scratch, all our generation numbers are zero. Indirect objects are assigned with object-number generation-number obj datum endobj. E.g.

1 0 obj
  [ (array) (of) (strings) ]
endobj

creates object 1. You can use an indirect object anywhere data is required, with object-number generation-number R. That's R as in reference. As you may have noticed, PDF uses a lot of reverse Polish notation, from its origins in Postscript and Forth. E.g.

2 0 obj
  <<
    /Words 1 0 R % Use the array we declared above
    /Language 3 0 R % Forward references are fine
  >>
endobj

3 0 obj
  /English
endobj

Streams have to be indirect objects on their own; they can't be members of a dictionary or array. But references to streams are legal. Thus:

% Illegal
1 0 obj
  <<
    /Type /Page
    /Content
      <</Length 11>>
stream
      1 2 lineto
endstream
  >>
endobj

% Legal
2 0 obj
  <<
    /Type /Page
    /Content 3 0 R
  >>
endobj
3 0 obj
<</Length 11>>
stream
      1 2 lineto
endstream
endobj

The object numbers start at 1 and go up (technically they don't need to be continuous, but it's a headache otherwise). Object 0 is a special magic object with generation number 65,535 that acts as the head of a linked list of deleted objects, used for updating PDF files.

PDF File Structure

A PDF file is just a list of indirect objects, with a Catalog dictionary containing a reference to a Pages dictionary that contains an array of references to Page dictionaries that each contain references to their Contents streams, which is the drawing instructions for that page. The Page dictionary also contains a reference to a Resources dictionary, which associates names with other objects like fonts, graphic states and images. The drawing instructions use those names, not the object references.

A logical way to organize this would be just to list the objects and have the reading program parse the file and create an array of objects. PDF assumes that you don't have enough memory for that, so the file itself contains the table of byte offsets of each object and acts as its own internal representation. Thus when creating the PDF, FPDF does something similar to:

function newobject($data){
  $objectNumber = count($this->objectOffsets)+1; // need to start from object 1
  $this->objectOffsets[] = strlen($this->buffer); // keep track of where this object starts in the final file
  $this->buffer .= "$objectNumber 0 obj \n $data \n endobj \n";
}

The actual file starts with a comment with the PDF version number (the ISO standard is 1.7): %PDF-1.7, then the objects (in any order; the object number is determined by the n 0 obj statement). This is followed by the crossreference table (the above mentioned table of byte offsets) that has a fixed byte-level format, to make access faster:

function xref(){
  $numObjects = count($this->objectOffsets) + 1; // include the magic object 0
  $this->xrefOffset = strlen($this->buffer); // we'll need this later
  $this->buffer .= "xref\n";
  $this->buffer .= "0 $numObjects\n";
  $this->buffer .= "0000000000 65535 f \n"; // the magic object 0
  // output the offset, the generation number (always 0) and "n" for "in use"
  // use sprintf to make sure it has exactly the right number of bytes
  foreach ($this->objectOffsets as $offset) $this->buffer .= sprintf("%010d %05d n \n",$offset);
}

After that is the "trailer," a dictionary that tells the PDF reader how many objects there are and which one is the "root," the main Catalog dictionary, and a pointer to the start of the crossreference table, and a comment to mark the end of the file (yes, there's lots of redundancy here):

function trailer(){
  $xrefOffset = $this->xrefOffset;
  $numObjects = count($this->objectOffsets) + 1; // include the magic object 0
  $root = $this->rootObject; // this needs to have been set at some point
  $this->buffer .= 'trailer << ';
  $this->buffer .= "/Size $numObjects ";
  $this->buffer .= "/Root $root 0 R ";
  $this->buffer .= ">>\n";
  $this->buffer .= "startxref\n$xrefOffset\n%%EOF";
}

You can put more information in the trailer dictionary; see page 43 of the PDF spec, table 15.

And then dump the buffer (with the appropriate content header for the web) and you're done!. Of course, this page doesn't say anything about what goes into those objects, but at least it's a start to understanding what FPDF does and debugging the resulting document.

Continued…

This entry was posted by Danny on March 18, 2011 at 5:06 am under PDF, PHP. You can leave a response, or trackback from your own site. Follow any responses to this entry through the RSS 2.0 feed.

Creating PDFs with PHP: Syntax

PDF Syntax

PDF File Structure

Leave a Reply

Free Medical Advice

Recent Posts

Pages

Archives

Judaism

Medical Informatics

Web Design

Meta

Hacking at 0300

Creating PDFs with PHP: Syntax

PDF Syntax

PDF File Structure

Leave a Reply

Free Medical Advice

Recent Posts

Pages

Categories

Archives

Judaism

Medical Informatics

Web Design

Meta