XML, Network Design, and Content Management

For some time now, I've been thinking about content management systems that we could use for our web sites at work. From the start, I looked at using XML. Initially this may have been because I was just learning about XML and was eager to put it to use. More than that, I knew that data stored as XML could be usefully transformed with things like XSLT into a variety of formats, which sounded like exactly what we needed.

Now that I've been actively working with XML for a couple years, I keep coming back to the idea of using it in a content management system. Why? Am I falling prey to the buzzwords and to the enticements of working with cool technologies? Does the XML hammer make everything look like a nail, or is XML honestly a critical part of a sensible solution?


Here's the situation: we have a couple dozen Web sites, maintained by a variety of people with a wide variety of experience with the Web. The large majority of the "web site coordinators," as we call them, are department secretaries with little or no training in web-related matters. All of them use Dreamweaver for site maintenance. I'm sure that this is not entirely dissimilar to how Web sites are managed in many organizations, perhaps your own.

A few problems with this setup:

In short, we need a content management system. Desperately.

Content Management Systems

A content management system — at least as it pertains to the web — is a tool for managing the production, storage, and delivery of web content. The basic idea is to use templates built by a designer to deliver content that is produced by a content expert. The templates and content are entered into the system separately and are merged before being delivered to the end user as a web page. In addition, a CMS may involve ways of controlling workflow, letting an editor give final approval before publishing to a live site.

CMS workflow

A CMS would help address many of the problems I cited earlier, because the template design could be left to those who understand accessibility and web standards. The content itself could be cleaned up by the CMS so that it, too, would meet our standards. The coordinators / content experts could focus on just supplying content without worrying about everything else.

The Search

So far, I haven't found a CMS that meets our needs.

I've looked at a number of different systems (everyone on my team has; I'm describing only my experiences here). At first I looked at XML-based systems like Cocoon and AxKit, thinking that it'd be great to use XML and be able to transform the content into a variety of formats on demand (HTML, PDF, plain text, RTF, even Excel). I very much feel like the only one in the office interested in XML, though, and I certainly couldn't imagine teaching people who can't handle HTML how to edit XML, much less create it. Rather than simplifying the process, we'd be throwing up road blocks.

I spent a lot of time looking at browser-based systems like Zope's CMF, Spectra, Midgard, Bricolage, and even blogging systems like PostNuke and Movable Type. This is all great software and can make for a very powerful system. In the end, though, the great failing of a browser-based CMS is that it's all handled through the browser. Who wants to always have to edit large files in a tiny little text box? That works reasonably well with weblogs, where generally only short entries are being made, but with documents of any size it becomes unmanageable.

Too, what do we do in five years when the system no longer meets our needs? The site would be in a database and it would take a lot of work to get it out. My team went through that a couple years ago with a site that had been maintained in an ill-fitting cookie-cutter-style database-driven CMS designed by an external developer. It took a long time to extract the content and convert it to static HTML pages. We are not eager to do that again.

Once I had faced the problems of a browser-based CMS, I started to give serious consideration to applications like HTML Transit. Transit translates Microsoft Office files into HTML, formatting them according to a variety of templates, maintaining links, etc. It's actually quite impressive, especially in its more recent incarnations like Stellent's Content Server. And there's a bonus: we already own Transit and have been using it for a couple years on one site. That experience, however, has shown me little more than that HTML Transit (at least the three-year-old version that we own) is a pain to use and that we're unlikely to upgrade. What happens when we upgrade to a version of Office that Transit can't read? We'd either be stuck with thousands of Word documents to convert to HTML, or back where we started with far too many static HTML documents to edit by hand. No good.

Still, I'm entranced by the notion of leveraging site coordinators' and content experts' familiarity with office software. Most of our site content originates in Word or Excel. We should find a way to use those files as the content source in a CMS.

The Problem is the Page

A large part of what's been blinding us in our search for a good CMS is our emphasis on what the page will look like in a PC-based browser. We have to stop thinking about things in those terms. The focus should be on the content.

In theory a CMS should address this issue. But because our focus has so far been on page and template design rather than content production and storage, we've tended to direct our search toward systems whose emphasis is on templating or database management -- managing content once it's in the system. Yes, these are all important considerations. I've come to realize, though, that this is a backwards way of approaching the problem.

The goal is a system that can be used long-term, in which the same content can be repurposed or processed so that it can be delivered to end users in accessible formats, including some that may not have been invented yet. It also needs to be dead simple for non-techies to use. If we don't want to recreate the content in five or ten years or every time we change the look and feel of a site, then we need to focus on finding a system that stores the content in a way that facilitates delivering the content in a variety of formats.

Sounds a lot like XML, doesn't it?

End-to-End Network Design.

Let me take a quick diversion into the design of the Internet and draw some parallels with the CMS situation. This will make sense in the end, trust me.

The Internet was designed as an end-to-end network. That is, the intelligence in the network is placed in the applications at the fringes, while the core of the network is kept simple. This core serves one purpose: move packets of data. That's it. The network doesn't care if the packets are fragments of an email message, a web page, a JPEG image, streaming video, a phone call -- they're just data that need to be routed.

The engineers who designed it this way did so deliberately, because they could not foresee all the ways the network would be used.

This elegant design has allowed and inspired tremendous innovation. As Lawrence Lessig writes in his excellent The Future of Ideas [1]:

[B]ecause applications run on computers at the edge of the network, innovation with new applications need only connect their computers to the network to let their applications run. No change to the computers within the network is required. If you are a developer, for example, who wants to use the Internet to make telephone calls, you need only develop that application and get users to adopt it for the Internet to be capable of making "telephone" calls. You can write the application and send it to the person on the other end of the network. Both of you install it and start talking. That's it.

The web itself is a good example. HTTP works on top of TCP/IP, allowing all sorts of different documents on all sorts of computers and operating environments to be linked and shared across a network. The network itself needs no modification for the web to run; the complexity in the system is pushed to the edges, to the applications that connect to the Net. I think it fair to claim that much of the web's success is due, as with TCP/IP, to its open nature: not only does it connect many different documents over HTTP, it allows a variety of protocols, as well: Gopher, FTP, NNTP, etc. The core is kept simple, application and platform agnostic, leaving room for innovative uses and developments.

What does this have to do with content management?

Much as the simplicity of the end-to-end network design has meant a tremendous flexibility and innovative productivity on the Internet, using XML as the basis for a content management system will open the door to a great deal of power and flexibility.

The content could come from anywhere: MS Office files, StarOffice files, hand-edited XML, a relational database. The output could be anything: (X)HTML, PDF, RTF, RSS. As long as the core files are XML, the system can remain flexible and accessible.

Chances are, for some time to come the content will originate in Word documents. Over time, as we change or upgrade office software, we can tweak the system to convert files to XML. Perhaps in a few years there will be software available that makes editing XML as easy as using Word. Or maybe we'll be able to switch to StarOffice / OpenOffice, whose native file format is XML. Whatever happens, we'll have this core collection of XML documents to work with, which will leave us free to maneuver at the edges of the system.

CMS workflow

Because the core is XML, we can transform or repurpose our XML content in any number of different ways to meet end user needs. We can transform it to XHTML, wrap templates around it, and send it out as a traditional web page. The system can produce high-end DHTML versions for the latest browsers, or more basic versions for older or text-only browsers. Versions for PDAs and cell phones. SOAP messages. Or uses we haven't thought of yet.

All through this, as the content is created in different ways, as we use new systems for delivering it to a variety of end purposes, the XML core will remain the same. Flexible, extensible, and powerful.

So where does this leave us?

As you can tell, I'm pretty solidly convinced that XML is the way to go for storing content. More than likely, we'll end up using Apache and AxKit to handle the delivery of styled content to the end user. For web services, we may use XML-RPC, SOAP, even Jabber.

The trick is going to be creating the XML content in the first place. I'm honestly not sure how best to do that. Whatever we do, it needs to be dead simple for the site coordinators, or they'll never use it. That's part of my motivation for relying on office software like Word. There are third-party tools like wvWare and Majix that could help us convert Word files to XML, and it looks like Microsoft Office does a lot with XML to start with. That will take a good deal of experimentation.

There are also issues of metadata to consider, and perhaps most daunting: link management. In order to manage links — update pages if a document is deleted or moved, that sort of thing — we may need to create some sort of check-in system. Perhaps we can use AxKit for that, too, or Zope. I'm really not sure how to handle those things and am quite happy to take suggestions.

It may turn out that focusing on data storage is blinding me to some obvious and top-notch ideas for handling content creation and entry into the system. If so, hopefully that will become apparent soon. Nevertheless, I remain convinced that building around an XML core will allow us to remain flexible and able to build web resources in a way that takes advantage of changes in technology rather than being hampered by them.

Comments? Ideas? Let me know! Send email to sam at afongen.com
Sam Buchanan, February 2002

[1] Lawrence Lessig, The Future of Ideas: The Fate of the Commons in a Connected World (New York: Random House, 2001), 36-37.