Transforming XML with XSLT

David Nash's article in C Vu of October 2003 covered reading XML data into a program. Here, I hope to introduce newcomers to manipulating XML with XSLT scripts, using an example drawn from that article.

What Is It?

Simply stated, XML is portable, self-describing data structured as a tree of named nodes, each possibly containing named attributes, text, and sub-nodes. It may already be the most popular way to bridge systems built using disparate technologies. XSLT stands for eXtensible Stylesheet Language for Transformations. The stylesheet part concerns the presentation of XML ('pure content'), but that's about all I have to say about that. The fact that transformations are part of XSLT shows the W3C's recognition of the need for manipulation, as well as presentation, of XML content.

If you ever accidentally opened an XML file in a Microsoft environment, you might have been surprised to see a collapsible tree view like that in Figure 1. That is produced by the web browser using a built in XSL transform which converts the data to HTML, and is perhaps the most widespread example of using XSLT to adapt input for a pre-existing parser. (It's also handy way for looking at non-trivial XML files.)

Figure 1. XSLT used to render XML as HTML

An XSL transform is itself valid XML, usually stored in a file with the extension ".xsl" and commonly called an XSL script. Programs that execute XSL scripts against XML data are called XSLT processors (see 'Tools' below). Modern integrated development environments (IDEs) often contain a built in XSLT processor to support rapid prototyping of transforms. For example in IBM's Websphere IDE you can view two XML files side by side and point & click to specify how the source data should map to the destination - and the IDE will create a transform for you.

What Can I Do With It?

Although XML is rapidly becoming the preferred mechanism for data sharing, the XML that one system produces might not be structured precisely the way all of its consumers expect, despite containing the needed information ^{[

1

]} . For example, David's article shows a C++ program that can read personal data like person.xml , shown below.

<?xml version="1.0"?>
<Person>
<FirstName>Elvis</FirstName>
<LastName>Presley</LastName>
<DateOfBirth>
<Year>1935</Year>
<Month>01</Month>
<Day>08</Day>
</DateOfBirth>
</Person>

Assume today you build and debug a parser to read person.xml. What if tomorrow you're confronted with the need to read personal information from a new source which structures its data slightly differently? For example, consider flatperson.xml, which contains the same information as person.xml but more concisely ^{[

2

]} :

<?xml version="1.0"?>
<flatperson firstname="Elvis"
            surname="Presley" dob="19350108" />

Often, as I hope to show here, an XSLT script can save you having to touch your parser code by adapting the new data:

$ xsltproc expand.xsl flatperson.xml
<?xml version="1.0"?>
<Person>
  <FirstName>Elvis</FirstName>
  <LastName>Presley</LastName>
  <DateOfBirth>
    <Year>1935</Year>
    <Month>01</Month>
    <Day>08</Day>
  </DateOfBirth>
</Person>

This article guides the novice in stages to construct scripts like expand.xsl (and its converse, flatten.xsl ) which can interconvert between a flatperson and a Person .

Naturally, if your system produces output for others, you can be sure that the XML structure it uses internally will not perfectly match the format expected by each potential consumer. Here, a set of XSLT scripts can adapt that internal representation for each consumer. The classic example of this is a web portal that transforms XML into browser- or device- specific markup as a final stage of request processing. (Despite the ECMAScript standard, browsers still have their quirks in the way they handle web content.)

Tools

XSLT processors are available in many flavours just like XML parsers. They typically provide an API allowing you to incorporate a transformation capability into a program, as well as a command line wrapper for experimenting with. By default, a processor will interpret scripts (for rapid prototyping), but can also be told to precompile them for performance. I often use the Cygwin environment which includes Gnome's libxslt and its xsltproc command (which we have seen in action above). My JDK installation includes the Apache Xalan ^{[

3

]} processor, so I could achieve the same effect thus:

$ java org.apache.xalan.xslt.Process -XSL expand.xsl -IN flatperson.xml

See Resources at end for pointers where you can download these. I use xsltproc in this article for brevity.

How Does a Transform Work?

An XSL script has a top-level xsl:transform element typically containing a number of 'functions' that you write, having the job of operating on some part of the input XML to produce some part of the output data. These 'functions' are embodied in xsl:template elements and can be written declaratively and/or imperatively, depending on how you want them to be invoked. By default a template will just copy any text it contains to the output when called; however templates also have at their disposal rich data manipulation and control flow constructs similar to those lurking in your favourite programming languages.

The XSLT processor treats the input XML data as a tree of nodes, each of which has an associated path (a route, by name, down the tree to that node). The processor starts at the top of this tree (path: " / ") then searches the transform script for a template matching the current path (via the template's match="..." attribute ).

If the processor finds a matching template, it invokes it. What happens next depends on how the template is coded. E.g. it could imperatively call other templates as part of its processing, just like a C function can call other functions.

If no user-supplied template matches the root node, a "default template" is invoked, which prints out the text of the current node then recursively traverses its descendants. This is the effect of transforming person.xml using an 'empty' XSL transform (one containing no user templates):

$ xsltproc empty.xsl person.xml
  Elvis
  Presley
    1935
    1
    8

Hello...

Time for a concrete example ^{[

4

]} . This is the hello.xsl script, containing one simple template.

<?xml version="1.0"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text"/>
<xsl:template match="/"> Hello World
</xsl:template>
</xsl:transform>

Don't worry too much about the packaging - focus on the single xsl:template element. You'll often see scripts with a template declared to match " / " whose job is simply to invoke other templates in the desired order, a role akin to a main() function. Our template is less ambitious, doing its work with no help.

Two things to note are: (1) The match attribute declares its intention to be called in the context of the root input node and (2) It just ignores the input data and outputs the greeting we've all grown to love, with a nod to Kernighan & Ritchie.

This shows hello.xsl being applied to person.xml on the command line.

$ xsltproc hello.xsl person.xml 
Hello World

As you can see, the contents of person.xml don't figure in the output - all we see is the greeting.

So, we've seen one script blindly copy all text from its input to the output and another ignoring its input altogether. Time for something with a bit more intent behind it.

Personal Hello

When manipulating XML you generally want your templates to read parts of the input structure and create output with a structure more suitable for your purposes. Let me introduce the xsl:value-of tag, which is a bit like the SQL select statement, as it returns the value of some specified aspect of the data. You specify what you want to retrieve, in its select attribute, with a path (i.e. a particular route to a particular node in the input tree).

For example, in our hello template we could query the Person 's first name with the path /Person/FirstName . To upgrade hello.xsl so it greets the right person on first name terms, replace the text World with the tag <xsl:value-of select="/Person/FirstName"/> .

This shows the resulting script personal.xsl and its effect on our Person data.

$ cat personal.xsl
<?xml version="1.0"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >
  <xsl:output method="text"/>
  <xsl:template match="/" >
    Hello <xsl:value-of select="/Person/FirstName"/>
  </xsl:template>
</xsl:transform>

$ xsltproc personal.xsl person.xml
Hello Elvis

Expand an Attribute

We're nearing the original goal of converting flatperson data to Person data (for which, hypothetically, we have a preexisting parser). New aspects to this are: A flatperson holds its information in attributes (like firstname ) which the 'expanding' transform must read; and the output must contain XML elements (like FirstName ) encapsulating this information, rather than plain text like " Hello ... ".

We can reference an attribute (in xsl:value-of and elsewhere) using @<AttributeName> in a path. For example a flatperson 's first name has the path " /flatperson/@firstname ". (This has a modicum of charm, I confess.)

Creating XML structure in the output is achieved by simply embedding XML tags in the template body.

This shows the script expandfirst.xsl and how it partly reconstitutes a flatperson to a Person (for brevity, it only reconstitutes the first name).

$ cat expandfirst.xsl
<?xml version="1.0"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >
  <xsl:output indent="yes" />
  <xsl:template match="/" >
    <Person>
      <FirstName>
        <xsl:value-of select="/flatperson/@firstname"/>
      </FirstName>
    </Person>
  </xsl:template>
</xsl:transform>

$ xsltproc expandfirst.xsl flatperson.xml
<?xml version="1.0"?>
<Person>
<FirstName>Elvis</FirstName>
</Person>

(Note the xsl:output tag tells the processor what kind of output is expected. Without the method="text" attribute, the processor defaults to emitting an <?xml..?> header . Without the indent attribute, redundant whitespace like newlines would not be added, leading to slightly leaner output at the expense of readability.)

Flatten an Element

For completeness let's look at the reverse direction. The xsl:attribute tag can inject an attribute into an output XML element. To illustrate, here is flattenfirst.xsl , the 'inverse' of expandfirst.xsl . It takes a Person on input and outputs a minimal flatperson containing just the first name:

$ cat flattenfirst.xsl
<?xml version="1.0"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >
  <xsl:template match="/" >
    <flatperson>
      <xsl:attribute name="firstname">
        <xsl:value-of select="/Person/FirstName"/>
      </xsl:attribute>
    </flatperson>
  </xsl:template>
</xsl:transform>

$ xsltproc flattenfirst.xsl person.xml
<?xml version="1.0"?>
<flatperson firstname="Elvis"/>

(Note that all xsl:attribute tags must precede other content for an element. You might see why if you consider where attributes would end up in the output in relation to, say subelements.)

Completing the Scripts

Within the context of a flatperson , the path " /flatperson/@dob " refers to the dob attribute whose value (for Elvis) is "19350108". The function substring() can pick out individual parts so we can populate the Year , Month and Day elements of a reconstituted Person . For example this tag extracts the YYYY digits of a flatperson 's dob attribute:

<xsl:value-of select="substring(/flatperson/@dob,1,4)"/>

Armed with this, the hands-on reader is encouraged to upgrade expandfirst.xsl into the script expand.xsl which reconstitutes the whole Person from a flatperson .

XSLT additionally provides access to a rich set of string-related functions, including regular expressions. For details see the Xquery and Xpath specifications (Resources).

Now, by default, the XSL processor merges a sequence of text items into a single text item when creating an output node, so the following would be one way to concatenate a Person 's DateOfBirth sub-elements, for readers wishing to complete flattenfirst.xsl :

<xsl:attribute name="dob">
  <xsl:value-of select="/Person/DateOfBirth/Year"/>
  <xsl:value-of select="/Person/DateOfBirth/Month"/>
  <xsl:value-of select="/Person/DateOfBirth/Day"/>
</xsl:attribute>

Conclusion

Of course you can parse XML and navigate/manipulate the resulting DOM tree using various languages. However XSLT was specifically designed to transform XML so it supports working at a higher level than SAX or DOM. Though transforming then parsing can be slower than one-step parsing with a new parser, building and debugging that new parser will often be overkill as a first port of call when a simple XSLT script lets you reuse an existing parser. Once you're happy with a script, you would typically dispense with the command line interpreter in favour of programmatically invoking a precompiled version of your script from your application.

I hope I have helped curious readers in their first few steps with XSLT, with simple but self-contained examples, and shown how relatively painlessly it can adapt XML data for a pre-existing parser.

Acknowledgements

LOTS of people kindly read drafts! My thanks in particular to Frederek Althoff, Phil Bass, Dr Islam Choudhury, Dr Trevor Hopkins and Dirk Laessig for helpful feedback.

References & Resources

David Nash, "Combining the STL with SAX and Xpath for Effective XML Parsing", C Vu Volume 15 No 5 (October 2003, pp.18-20)

Ivan Kiselev, Aspect Oriented Programming with AspectJ , SAMS Publishing.

Kernighan & Richie, The C Programming Language , Prentice-Hall (The archetypal use of "Hello World" to introduce a programming language.)

XSLT specification on the W3C site: http://tinyurl.com/2ewsm

Xpath/Xquery functions/operators: http://tinyurl.com/2puva

Gnome project's libxslt: http://xmlsoft.org/XSLT

Windows binary distribution libxslt: http://tinyurl.com/2p9yt

Xalan-c at Apache website: http://xml.apache.org/xalan-c

Sun have a Web Services tutorial with a good intro to XML and XSLT: http://tinyurl.com/2572z

The newsgroup comp.text.xml is full of helpful stuff.

Check http://cocoon.apache.org for an approach that heavily uses XSLT for multi channel user interfaces.

^{[

1

]} See http://www.oasis-open.org for an effort underway to change this situation.

^{[

2

]} Ivan Kiselev's soother for people (like me) who find XML configuration files overly verbose for name-value pairs: "..any design decision is a compromise, some like it hot and nobody's perfect".

^{[

3

]} In JDK 1.4 the Xalan classes live in <JAVA_HOME>\jre\lib\rt.jar . In case of classpath woes, try adding this jar to your classpath. If your JDK precedes 1.4 you can download the Xalan classes (see Resources).

^{[

4

]} I recommend installing one of the free XSLT processors and trying out the examples.