Moving to an XML-based Web Site

This article describes the experiences and lessons learnt when transferring a static, HTML-based Web site to a DITA and XML-based solution.

By Tony Self

Introduction

In early 2007, I started the task of reworking the ageing HyperWrite Web site. The site was originally created in 1995. It underwent a major rework (to a frames-based design) in 1997, and was reworked in 1999, 2000 and 2002. In the decade since the Web site was launched, not only has Web technology moved on, but HyperWrite's activities, focus and business direction are now quite different.

Screen capture of HyperWrite Web site circa 1995

Time and budget were allocated to renovate the site to better serve HyperWrite's business needs, and to serve as a practical example of the company's capabilities.

Reasons for the Site Renovation

For the last few years, the site was maintained using Microsoft FrontPage 2003®. When it was decided to completely revise the site in 2007, one of the prime motivators was to move to being fully standards-based (XHTML and CSS). But the reasons for updating the site were not only technical.

Analysis of the Web site logs over a 12 month period showed the most popular area of the site was the knowledge part (where magazine-style articles relating to technical documentation and Help technologies were published). It was decided to give that area greater prominence. The services offered by HyperWrite were better categorised into training, consultancy and conferences (rather than lumped together as services).

The Web logs also showed that Firefox was used on average by 15% of site visitors. Considering the rate of Firefox adoption is increasing, the percentage for the last month of the year would be a lot higher. Previously, when HyperWrite was mainly providing Windows Help systems consultancy, we could assume that our target audience nearly all used Internet Explorer for browsing. The greater importance of open systems in our business was another argument towards fully embracing XHTML.

Web Statistics for HyperWrite site showing 15% Firefox Users

The site had many inconsistencies, accidentally introduced over the years via editing tool changes and style changes. Any site revamp will provide the opportunity to standardise the pages, but I was keen to find a way to reduce the likelihood of the site drifting in future.

The Role of XML

XML is great for enforcing standards; if a document doesn't conform to its XML rules, it won't save! But there are hundreds of XML languages. For Web sites and similar types of content, the XML applications most appropriate are RSS, DITA, DocBook and, of course, XHTML.

DITA plays an increasingly important role in HyperWrite's consultancy and training business, so I wanted to include DITA content in the site. For the past twelve months, all new articles had been written in DITA or Simplified DocBook, and transformed to HTML for use on the Web site. Ideally, these articles would be dynamically transformed to HTML, and "wrapped" within the Web navigation and branding elements.

As the project developed, a site map format was required to store information about the structure of the site. As ASP.Net was the technology platform on which the site would be deployed, the ASP.Net "sitemap" format was an option, as was the "ditamap" format.

ASP.Net and Visual Web Developer

Some portions of the site, such as the newsletter subscription page, required server-side processing. Previously, the site had used Microsoft's ASP technology for this purpose. The site was, and would continue to be, hosted on a Windows 2003 Server with Internet Information Server, which supports both ASP and ASP.Net.

For page editing, FrontPage was discounted as an option, because of its inability to work in pure XHTML. Adobe DreamWeaver® was considered, but Microsoft Visual Web Developer® (VWD) was selected as the page editing software. VWD is a solid XHTML and generic XML editor, with an integrated CSS editor. Its primary role, though, is as a Web application development tool for ASP.Net. A single editor could therefore be used for programming of server-side logic, and for any static Web pages.

ASP.Net allows easy server-side XSL transformations of XML content. Provided an XSL-T file is already available for the transformation, the task of creating an ASP.Net page to turn an XML data file into HTML can take as little as 30 seconds. Similarly, if a site map is available (in the ASP.Net "sitemap" XML format), a dynamic table of contents (TOC) for the site can be created instantly. As soon as the sitemap is updated, the TOC is automatically updated.

Architecture

The Web site's new architecture is essentially a three column design, with major navigation buttons in the left column, the main content in the centre, and sidebar information in the right. A branding banner and a breadcrumbs trail run across the top of the design, and a footer block along the bottom.

The branding banner is an ASP.Net "included page". As the server delivers a page to the browser, it inserts the included page content at the top. The actual banner code only occurs once, in the included page itself. It is re-used on every page in the site. If the banner needs to be changed, only the included page needs to be altered.

The breadcrumb trail is automatically generated through a standard ASP.Net design-time control. The design-time control simply references the sitemap XML file, and automatically generates the breadcrumb trail.

Likewise, the main navigation buttons in the left column are derived through a design-time control referencing the same sitemap XML file. The ASP.Net sitemap XML file format follows a simple sitemap/sitemapnode/sitemapnode structure. For the HyperWrite site, the sitemap XML is generated (through an XSL-T file) from a ditamap file.

To further simplify matters, ASP.Net provides a "master page" feature, which allows common (repeated) elements of a page to be locked into a template-like skeleton. The new site uses a master page to set the banner, breadcrumbs, navigation, sidebar and the footer block. This leaves just the main content to be composed for each page.

The main content can be:

normal XHTML, typed directly in Visual Web Developer
RSS, transformed on the server by an XSL-T file
DocBook or DITA XML, transformed on the server by an XSL-T file.

The transformed RSS, DITA and DocBook content is dynamically placed within the master page template.

Like the banner, the footer block is an included page.

The sidebar was used in the previous design, and was intended to carry snippets of news, hints and related links. However, experience shows that the material was very rarely updated, and was often stale. This was probably because we focussed on keeping the main content up-to-date. If we didn't happen to notice that the sidebar information was obsolete, it would never get changed.

The new approach is to make the sidebar information a "conditional included page". If the master page script finds a file with a .inc extension and the same name as the current page, it displays that .inc page in the sidebar. If it can't find a specific .inc page, it looks in the current folder for a file named sidebar.inc, and places that file's content in the sidebar column. The .inc file can be XHTML or RSS; if it's RSS, it will be transformed (by an XSL-T file) to XHTML on the server.

For example, when a page within the /Training folder is requested, the server pulls in the sidebar.inc within the /Training folder. Likewise, a page within the /Conferences folder will pull in the /Conferences/sidebar.inc file. This approach meant that the master page could still be used for sidebar content, and that any changes to sidebar content would only have to be made once per section in the applicable sidebar.inc file.

As you can deduce, the whole idea is highly dependent upon XSL-T. This should make perfect sense, because the source content needs to be turned into XHTML before it reaches the browser. Additionally, it is more efficient to transform to XHTML on-the-fly, as required, rather than pre-transform the content using an XSL processor.

The animation below shows the major components of the design and their providence.

Legacy Links

A search through Google found that many other sites were linking to pages within the HyperWrite site. So that the rework of the site would not break these links, a re-direction facility was needed.

The previous site used a very flat structure, with most pages at the top level (root directory). Most pages had a .htm file extension, where the new pages would have a .aspx (ASP.Net) file extension. The new structure proposed having pages stored in section-level folders. So pages relating to training courses would be stored in a /Training folder. A link to crs_xdkintro.htm would need to be re-directed to /Training/crs_xdkintro.aspx.

One approach would be to keep files with the old names, with those pages comprising a single HTTP Redirect code line to bounce the user to the correct page. This seemed cumbersome.

The solution I eventually devised was to use the fact that custom Error 404 pages can be used within Internet Information Server ( IIS). When a page that no longer exist is browsed to, the custom 404 page would display. This custom 404 page would bounce to a special .aspx page, passing it the name of the page originally requested. The .aspx page would then work out the name of the corresponding new page, and suggest the user link to it. (You can see an example of this in action: this is a link to a non-existent page called dita_training.asp.) This also gave the opportunity for the user to be prompted to update his or her bookmarks or links to the correct file name.

Flowchart showing the logic of the page redirection approach

Roles of DITA and DocBook

If the driving force behind the site revamp was to improve efficiency and consistency, the underlying concept was the separation of content and form. For this separation to work, content has to be marked up semantically. The two semantic markup languages in the documentation field are DocBook and DITA. Simplistically, DocBook is best suited to articles and books, while DITA is best suited to topic-based documents such as Web sites and Help systems.

In the HyperWrite scenario, apart from news items, most new content was magazine-style articles. For this type of content, DocBook and DITA were both appropriate. In fact, some "legacy" articles had been originally written in DITA or DocBook. Ideally, the new site should be able to accept both DocBook and DITA content, as well as plain old XHTML content.

Benefits of a DITA/DocBook Approach

The two main benefits of using DITA and DocBook in the Web maintenance cycle are:

efficiency (only bothering with content and semantics, rather than formatting and layout); and
consistency (content all shares the same look-and-feel when transformed, and can easily be changed globally).

Efficiency and consistency benefits are nicely demonstrated in a single example. Information articles sometimes include a by-line. Normally, a by-line appears at the top of the article, under the heading. But it may appear at the bottom of the article. It may be centred, right-aligned, or left-aligned, and may be italic, or perhaps in a smaller font. When writing in HTML, the author has to remember what settings are the standard for this particular Web site. (Such rules are often incorporated in a site style guide.) Typically, the author has to find a previous article with a by-line, and repeat its formatting in the new article. In DITA, there is a prolog section, with an author element. That's the only place the author name can go, and that's the only rule the author has to remember. That <author> tag will always be represented the same way in the HTML output.

Example of inconsistent alignment and size presentation of by-line in HTML

Authoring in DITA: the by-line information can only go in one place

How DITA is Transformed

When a new article (DITA, DocBook or XHTML) needs to be added to the site, the first task is to update the site master ditamap. That ditamap provides a hierarchical site map of the site.

Extract from the HyperWrite site's master ditamap

Metadata attributes in the <topicref> elements are used to identify the type of topic (XHTML, DITA or DocBook), the priority (order in which it will be presented), whether it should be included in a contents list, where it is obsolete, and its unique identification number.

Once a new article's details have been added to the ditamap, the article itself has to be written, and then copied to the Web server. And that is the entire process.

The server side scripts then:

automatically adds a link from the Articles menu page to the showarticle.aspx page, including the ID of the new article in the link (eg, showarticle.aspx?id=68);
builds breadcrumb links at the topic of the page when the article is displayed;
makes the page findable through the site's search feature; and
adds a link to the page on the Sitemap page.

When the "showarticle" link is followed, the ASP.Net scripts dynamically assembles the page by including the banner, navigation, sidebar and footer panels, and then transforming the referenced DITA or DocBook page (after looking up the ditamap to see how it should be processed) into XHTML and placing it into the main content area. The page delivered to the user's Web browser is just XHTML.

The page you are reading now is stored on the HyperWrite Web server only in DITA format. It was written (and is maintained) using XMetaL 5 DITA Edition. On the Articles menu page, I have displayed the type of document in the right column, just for interest's sake. Likewise, my XSL-T transformations add a DITA or DocBook logo at the bottom of each transformed page.

Complications Set In - Searching

Being able to search through the content of a site is a very welcome feature for many readers. The existing HyperWrite site used Microsoft's Indexing Service to provide a useable search facility through an ASP page. Indexing Service is a base service for Windows, intended to allow users to search through text on their own PCs, but can be integrated with Internet Information Server. It knows how to index HTML files, PDF files, and most Microsoft Office formats.

It should be easy, I thought, to migrate the search page from ASP to ASP.Net, and the task would be complete. How wrong I was. Complications set in.

The first complication was that I no longer had HTML files to search. Much of my content was in DITA (usually ".dita") or DocBook (usually ".xml"). Indexing Service didn't know how to "index" these XML files.

After spending some time trawling the Web, I found a Beta Microsoft XML "filter" for Index Service, which would let it index XML files. (It turned out to be in the nick of time. Shortly after, the filter disappeared from the Microsoft site!)

I thought my problems were over. But alas, they were not. The Service now correctly found search terms in the DITA and DocBook content, but linked back to those XML files. I didn't want to display the raw XML file to readers! With a lot of work, I managed to change the link returned by the Index Service (when it found an XML file) to an ASP.Net page called "getarticle", with the name of the XML file passed as a query string. So a phrase found in "abcd.xml" would link to "getarticle.aspx?stub=abcd".

I then had to work on that getarticle ASP.Net page. It would have to look up a list of files, find the one with a file name matching the "stub", find out whether it was DITA or DocBook, and then transform it to XHTML using the appropriate XSL-T file. This ended up being easier than first thought, because I already had a list of files, with their file type. It was the ditamap file that was already in place to provide a list of links to the articles.

Just when I thought I'd solved the problem, another one arose. Microsoft's Indexing Service does not, it transpired, correctly reference ASP.Net pages that use Master Pages. And my ASP.Net pages all used Master Pages. The problem was that the title of the page was not collected in the indexing process. So I was in the ironic position of being able to index my XML files, but not my plain old ASPX pages.

Another trawl through the Web saw me find a work-around, which tricked the Indexing Service into collecting the title. So the task of migrating the search from ASP to ASP.Net proved to be a very complex nest of complications, but ultimately, a solution was found. Fortunately, once a solution is found to these sort of indexing problems, the job is done; the Indexing Services is set and forget - it doesn't need to be maintained or administered.

RSS

One of the ambitions of the site re-design was to make it easier to update. It's all very well having content in DITA and DocBook, and in ASP.Net, but ultimately, someone with DITA/DocBook/VWD expertise will have to edit the content.

The most volatile content is news. A workshop is coming up, a new software tool has been released, there's a position open, or some survey results have been published. RSS, or "Really Simple Syndication" format, was made for this information. RSS is an XML schema for news snippets. It is extremely easy to edit, and extremely easy to incorporate in ASP.Net pages.

RSS has an optional category element. I decided to use this element so that I could present the same RSS file in different ways on different pages. On the home page, I display the most recent three items of the RSS file. On the News page, I display the six most recent news items. On the Conferences page, I display the four most recent news items with a category of "Events".

Over time, I may choose to make further use of the categories. Every re-use of an RSS file simply requires a different XSL-T file.

Limitations

Two of DITA's key features are ditamaps (a file collecting a group of topics into a publishable collection) and conrefs (a technique to embed content from one topic into another). The HyperWrite site does not (currently) cope with either of these important features. It can only handle articles written as standalone DITA topics; though this hasn't been a significant restriction as articles are self-contained anyway. (The system can cope with links from one DITA topic to another DITA topic.)

Conclusion

Since the site went live, everything has run smoothly. The site is easy to update, and requires very little maintenance other than refreshing content. Most articles for the site are written in DITA; in fact, no new HTML content has been added since the site relaunch. (Finding time to write content is still a struggle!) The search has worked flawlessly, and interestingly, the site is noticeably quicker than before. The look-and-feel is totally consistent.

The primary time-saving feature stems from the separation of content from format. Provided my ditamap, RSS, DocBook or DITA files are valid, they will display and interact correctly.

Although the development of this XML approach required a lot of XML and to a lesser extent ASP.Net expertise, once the infrastructure is in place, it is a simple environment to write in. I'm confident the approach will stand the test of time!