PDF is a really useful format for conveying information when you care about how it will look. While HTML formatting is left to the user agent, PDF formatting is well defined by the document itself, so a PDF document will always look the same, no matter what you use to view it (depending on the quality of the viewer I guess). Also, unlike HTML, which does not lend itself to being printed, PDF documents are usually ideal for printing.

The PDF document specification is available from Adobe at http://partners.adobe.com/asn/developer/acrosdk/docs/filefmtspecs/PDFReference.pdf, along with a lot of other documentation about Acrobat and PDF in the developer's section of Adobe's website.

PDF documents can be generated from XML by using formatting objects and a formatting objects parser, such as the Apache project's FOP.

After PDF became a standard for inter-system formatted documents, Microsoft started using PDF as a file extension on some of their system data files in early versions of Windows beta programs. The outcry from their beta testers (also called paying customers) and a threatened lawsuit from Adobe changed their tune. Their excuse: "We didn't know what Adobe Acrobat was." Duh.

With the advent of Adobe Acrobat 5.0, it is now possible to use a PDF form in your web applications. Adobe has beefed up their JavaScript implementation so that most of the things you would ever want to do with a PDF can be done programmatically. There is even an object called the ADBC, or Acrobat Database Connectivity, object that allows for database access in much the same way as Microsoft's ActiveX Data Objects, and it serves as an interface to ODBC for the PDF. Subordinate to it there are the Connection object and Statement object. As the names might imply, the Connection object manages the actual connectivity to the database, and the Statement object allows for the execution of SQL statements.

This is all well and good, but why do this when you could do a standard HTML form with server scripting behind it? There are several reasons.

  • Your form looks just like the printed version. WYSIWYG. This may make the learning curve easier if you are moving from paper forms to the web. Or perhaps your forms are already in PDF, and you don't want to recreate the wheel.
  • The ability to save the document in PDF. Duh.
  • Collaboration. Version 5.0 allows users to annotate PDFs as if they were on their desktop, and the annotations are instantly updated on the server so that real time collaboration is possible. It is also possible to download the form, mark it up offline, and upload the annotations later.
  • Encryption. Acrobat 5.0 offers 128-bit RSA encryption right out of the box. You can publish your document on your web server, restricting access only to those you specify on your trusted certificate list.
  • Digital signatures. This is a pretty cool feature. It uses the RSA algorithm to produce the public/private key pair and the X.509 standard for certificates.

The neat thing is, when you have your form published on your web server — intranet, internet, whatever — if they are in your trusted certificate list, users can log in to the document, fill out the form, sign it, and submit it. Through JavaScript, you can have data validation down to the field level. You can also specify, through code, to only submit the document if the signature is valid and authenticated. The submitted form data can be sent in Adobe's Forms Data Format, HTML, or XML. You can also send the entire PDF. Choosing to send it by HTML puts the field data into the body of a HTTP Request, which can be parsed and dealt with at the server.

I'm using this setup at work. I have a customer that does oversight for a federal program, and he wants to move away from paper. The problem was having a signature is a requirement for this program. So he sent me the form in Microsoft Word format, I distilled it into PDF, added Acrobat form fields, and published it. The signatures are all housed in a secure directory, and the scripting writes data to the database only when Acrobat validates the signature on the form. Though we haven't put it into production yet, it seems to work as expected.

Caveat: This only works in version 5.0. Also, everyone accessing the form has to have either Acrobat 5.0 on their machine or Acrobat Approval 5.0. The annotation, saving to disk, and signature features are not available in Acrobat Reader. Also, 128-bit encryption might have been strong a few years ago, but it's pretty much obsolete having been cracked by Ian Goldberg in 1997. Using 250 computers, it took him about four hours to break it. I wouldn't rely on this if you are doing top secret work, but it will keep the boys in accounting or whatever from peeping where they shouldn't. Unless they are hella 1337 accountants and have a beowulf cluster or something.

PDF stands for Portable Document Format. It is a proprietary format, owned and maintained by Adobe Systems Incorporated, but it has become something of a lingua franca for document exchange on the web and elsewhere.

Its popularity is no accident.

PDF and PostScript

Adobe Systems also developed (and still own) the PostScript language, which remains unrivaled as a Page Description Language for print applications. Their experience with PostScript was an important proving ground for the concepts that have made PDF so successful, but they have changed and extended them with PDF:

  • Device Independence - Unlike HTML, which achieves device independence by leaving most of the duties of rendering the appearance of the content to the tender mercies of the client, the goal of PostScript was to create page descriptions that render as nearly identically as possible on whatever display device they are thrown at. PDF remains true to that ideal, but it allows document creators to trade some absolute fidelity for file size based on the document's target purpose (still not device).
  • Extensibility - PostScript is a full-featured programming language optimized for rendering text and graphics. PDF is 'just' a document format, but its specification expressly enables the document creator to include information of any type, and rules for creating new types of content (including 'program' content) in a PDF.
  • Accessibility - Adobe Systems went against the common wisdom of the day when it made the PostScript Language Specification freely available, but this move allowed anyone to create PostScript, which required a PostScript interpreter to run. Adobe sells PS interpreters for large sums of money to makers of printers and other display systems. When it comes to PDF, the idea is flipped around - Adobe gives away a PDF interpreter (Acrobat Reader), but it markets the canonical PDF creation tool, Adobe Acrobat/Distiller, as a reasonably priced mass-market application.
Notwithstanding their considerable commonalities, PDF is not really a variant or an extension of PostScript, as is often claimed. For one thing, their structures are radically different - a PostScript document is a program whose instructions are executed sequentially. A PDF document is a collection of 'contents', not necessarily in any particular order, with a catalog that allows them to be looked up when needed. The master catalog contains pages, document information, and many other things, quite a few of which may also be catalogs in themselves. These hierarchical structures are referred to as 'dictionaries'; it is a construct borrowed from PS, but it's used for purposes unimaginable to PostScript.

It is correct in a way to say that PDF is a superset of PostScript, because a PDF page content may be, for instance, an Encapsulated PostScript picture, and the viewer must be able to render it. However, the content may also be a bitmap or a movie or sound file, or a backup of E2. The viewer may have to pass the content off to some other application (or plugin) to handle, but from the document's viewpoint they are all on an equal footing.

Adobe's Free Viewer

Adobe makes available free PDF viewers for Windows, Mac, and many Unix-like operating systems, as well as corresponding plugins for the Big Browsers. This is an important factor in PDF's wide acceptance, but not enough to define a defacto standard. Toward that end, they cleverly embedded their base 14 PostScript fonts in the reader, and made it easy for PDF creators to make font-free documents that would render reasonably (okay, extremely) well across platforms. Throw in painless downsampling of raster graphics, and you have a tightly controlled layout in an economical file size. Oh, by the way, it also has form fillout capability and internal and external hyperlinking. All in all, an attractive choice for network document sharing where layout is paramount.

Adobe's PDF Suite

The reader is built to take maximum advantage of PDF's capabilities, but the hub of a PDF environment is the Acrobat suite. It includes Acrobat, which is like an Acrobat Reader with editing features, and Distiller, which is a PostScript to PDF converter.

Acrobat has never really caught on at the typical desktop level, but in print graphic and some e-biz circles, it's as common as MS Word. Of course, it's nothing like Word - you don't type up a PDF. Usually, original document creation is done in any application of choice, then "printed" to a Postscript file and converted to PDF using Distiller.

Distiller has settings that enable a PostScript file to be repurposed for different target uses. You may want to embed fonts, high-resolution graphics, and complex color information for printing, or you may want a stripped-down version for the web. Maybe both.

Once you have a PDF to work with, you can open it in Acrobat, where you can add form fields, digital signatures, media clips, annotations, javascript, bookmarks, embedded comments; you can scan and OCR hardcopy, index directories of PDFs for full-text searching... whew, it's a Swiss Army Knife. The suite lists for US $249.00, and I know of no other application that drags in as much cool stuff for the price.

Still, the app's assumption is that you'll be doing all your document creation and heavy editing somewhere else, so there are only the most rudimentary tools for changing the page content. We can forgive Adobe somewhat for this, because they have also made available an extensive, thoroughly documented API for extending Acrobat.

Acrobat SDK

The PDF Reference and Acrobat Core API Reference comprise nearly 4000 pages of densely written material, and there are thousands more on ancillary features and programming interfaces. The SDK and all the documentation are free for downloading at adobe.com, and Adobe does not impose a license fee on applications or Acrobat plugins developed with them*. You have complete freedom to add menu items and tools into Acrobat, and to define your own actions in the application, and your own extensions to the PDF format. Adobe only asks that you register a four-character identifier with them so that your plugins and PDF content types don't end up with the same name as someone else's. There is no charge for registering an identifier.

Finding your way through all that documentation to begin writing Acrobat plugins is not very easy (or it wasn't for me), and the cross-platform imperative produces some oddities, but there are some very powerful Acrobat plugins on the market, and more on the horizon.

* There are license fees associated with plugins for the Acrobat Reader, but not for the full version of Acrobat.

Contrary Musings

Much of the PDF/Acrobat design seems oriented toward the web, almost as though Adobe thought of PDF as some kind of replacement for HTML. That's a kooky idea, and I really hate having to load that browser plugin to view content that would have been just as nifty in HTML.

It's handy to have documents in a compact, portable format to move them around, but the usual reason to exert fine control over a document's layout is because you expect it to be printed. Many of the byte-trimming features of PDF are deleterious to the quality of printed output. A default installation of Distiller will downsample raster graphics to 150 or 72 DPI, depending on the version. Your family album will look fine on the screen, but it will print like crap.

Acrobat enforces TrueType font license protection rigidly. This will result in some fonts being unable to be embedded in a PDF, even some fonts that are not intended to be so restricted. If you intend to prepare print-ready documents using freeware fonts or those from Corel products, you should be especially aware of this issue.

Acrobat contains safeguards against malicious content, but PDF's extensibility has inevetibly led to the advent of PDF-borne trojans. This is not a widespread problem yet, but Pandora's box is open.

Conclusion

I can't heap enough compliments on Adobe for their care, skill, and cleverness in the design of PDF and Acrobat. Much of that skill went toward the marketing of the product, but their strategy seems to have been to make it irresistible.

It's great for sharing business documents around, and the forms capability has some interesting possibilities. It really shines in a printing environment where, for instance, a high-res "real" version and a lightweight electronic proof can be made from the same source document. It's a big pain in the neck when graphic designers overuse it for reasons of ego, rather than functionality.

PDF has become ubiquitous, and it will probably stay that way, not only because of its intrinsic merits, but because it is maintained and marketed by such a deft company.


Author's Note (January 2008): The above was written in 2002, when Acrobat was shipping Version 5, which associates with with PDF version 1.4. As I write this, Adobe is shipping Acrobat 8. Some of the details in this writeup are no longer accurate (for instance, the document format formerly known as PDF 1.7 is now managed not by Adobe, but by the ISO, under Standard 32000). I was going to delete this, because I'm no longer qualified to opine on current PDF internals, but I find that the overall flavor of the writeup has held up pretty well so far. I'll leave it here for now, but it's a fat target for a good superseding.

Log in or registerto write something here or to contact authors.