XML is big business. Although it's far too early to declare whether it really is the killer content format, its popularity coupled with increasing support from industry heavyweights ensures it will be around for a good while yet. If you're interested in what XML can do for you then read on. This is the first of three articles on XML in applications, and introduces SAX, the Simple API for XML. The remaining articles will introduce DOM, the Document Object Model and XSL, the XML Stylesheet Language.
There are two main ways to process XML for use in an application. The first is an event-based approach, with handler methods being fired in response to certain parsing events (for example, a start element, some data, an error). The second approach is to build an internal tree representation of the XML document in order to query or traverse it.
The standard for the first approach is called SAX, the Simple API for XML. The standard for the second approach is called DOM (Document Object Model) level 1, and is a W3C recommendation. This article will describe SAX, how it came about, and how to use it to parse your XML documents.
A brief history of SAX
The first XML parsers began appearing in early 1997. These early applications mainly displayed XML documents as tree views. In late 1997 on the XML-Dev mailing list, Peter Murray-Rust (author of the JUMBO application for viewing CML (Chemical Markup Language) documents) insisted that parser writers should all support a common Java event-based API. In discussions with Tim Bray (author of the Lark parser) and David Megginson (author of Microstar's Ælfred parser), the idea for SAX was born. The design discussion took place publicly on the XML-Dev mailing list, and many people contributed ideas, comments, and criticisms. The first draft interfaces of SAX was released in January 1998, and shortly, SAX 1.0 was released in June of 1998.
A SAX compliant XML parser reports parsing events to the application using callbacks on an interface implemented by the handler class. This isolation of reporting from processing logic enables the same SAX parser to be used with different handlers for different purposes (e.g. validation, display, data import). XML parser implementations using SAX have been written in Java, Python, Perl and C++. Sun, IBM, Oracle and DataChannel/Microsoft have all produced Java XML parsers with SAX 1.0 drivers.
SAX 1.0 consists of two Java packages, org.xml.sax and org.xml.sax.helpers .
org.xml.sax interfaces
Six interfaces are defined in org.xml.sax . The following four are the most helpful.
- DocumentHandler
-
This is the main interface that most SAX applications implement: if the application needs to be informed of basic parsing events, it implements this interface and registers an instance with the SAX parser using the setDocumentHandler method. The parser uses the instance to report basic document-related events like the start and end of elements and character data.
- ErrorHandler
-
If a SAX application needs to implement customised error handling, it must implement this interface and then register an instance with the SAX parser using the parser's setErrorHandler method. The parser will then report all errors and warnings through this interface.
- DTDHandler
-
If a SAX application needs information about notations and unparsed entities, then the application implements this interface and registers an instance with the SAX parser using the parser's setDTDHandler method. The parser uses the instance to report notation and unparsed entity declarations to the application.
- Parser
-
All SAX parsers must implement this basic interface: it allows applications to register handlers for different types of events and to initiate a parse from a URI, or a character stream.
All SAX parsers must also implement a zero-argument constructor (though other constructors are also allowed).
SAX parsers are reusable but not re-entrant: the application may reuse a parser object (possibly with a different input source) once the first parse has completed successfully, but it may not invoke the parse() methods recursively within a parse.
org.xml.sax classes
- HandlerBase
-
This class implements the default behaviour for four SAX interfaces: EntityResolver, DTDHandler, DocumentHandler, and ErrorHandler.
Application writers can extend this class when they need to implement only part of an interface; parser writers can instantiate this class to provide default handlers when the application has not supplied its own.
- InputSource
-
This class allows a SAX application to encapsulate information about an input source in a single object, which may include a public identifier, a system identifier, a byte stream (possibly with a specified encoding), and/or a character stream.
There are two places that the application will deliver this input source to the parser: as the argument to the Parser.parse method, or as the return value of the EntityResolver.resolveEntity method.
The org.xml.sax package also defines two exceptions for use with SAX applications: SAXException and SAXParseException.
Using SAX to get what you want
It's a bit of a no-brainer to write your own XML parser when there are so many out there already. It's far more productive to reap the benefits of someone else's hard labour! Here's how you use an existing parser:
-
Create an instance of the parser object.
-
Register your handler with the parser.
-
Wrap your input with an InputSource object.
-
Pass the InputSource to the parse() method of the parser.
As an example, here is how to achieve the above using the IBM XML parser, XML4J.
import org.xml.sax.*; import com.ibm.xml.parsers.* // SAXParser import java.io.*; public void ImportXML(File inputFile) { SAXParser parser = new SAXParser(); EchoHandler eHandler = new EchoHandler(); parser.setDocumentHandler( eHandler ); InputSource iStream = new InputSource( new FileInputStream(inputFile)); parser.parse( iStream ); }
Listing 1 - Using the IBM XML parser with a SAX compliant handler.
Simple isn't it?
In the above example, EchoHandler is a handler I wrote to echo the input file to the standard output. Here is the implementation of it:
// EchoHandler.java - a SAX handler for echoing back input XML package org.accu.cornish.xml; import org.xml.sax.*; public class EchoHandler extends HandlerBase { protected final String spaces = " "; protected int numspaces = 0; public EchoHandler() { } private void spaces() { for (int i = 0; i < numspaces; ++i) { System.out.print(spaces); } } public void startElement (String parm1, AttributeList parm2) throws org.xml.sax.SAXException { spaces(); System.out.println("<" + parm1 + ">"); ++numspaces; } public void endElement(String parm1) throws org.xml.sax.SAXException { --numspaces; spaces(); System.out.println("</" + parm1 + ">"); } public void characters (char[] parm1, int parm2, int parm3) throws org.xml.sax.SAXException { spaces(); for (int i = 0; i < parm3; ++i) { System.out.print(parm1[parm2 + i]); } System.out.println(); } }
Listing 2 - EchoHandler.java
The SAX API provides a class called HandlerBase that implements all the handler interfaces, but provides no-op versions of all the methods. Since the EchoHandler only needs to override a small part of the four interfaces, I have derived EchoHandler from HandlerBase. This handler only implements a subset of the org.xml.sax.DocumentHandler interface, but it's enough to demonstrate how to use SAX compliant parsers.
A better example
In the following example, we will be parsing a simple XML document which just contains tags (elements) and data; the elements have no attributes. We use a Document Type Definition (DTD) to define a set of rules about the XML structure. Here is the DTD our example XML has to conform to:
<!ELEMENT recipe (recipe_name, author, meal, preptime, cooktime, ingredients, directions)> <!ELEMENT ingredients (item)+> <!ELEMENT meal (#PCDATA?, course?)> <!ELEMENT recipe_name (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT course (#PCDATA)> <!ELEMENT item (#PCDATA)> <!ELEMENT directions (#PCDATA)> <!ELEMENT preptime (#PCDATA)> <!ELEMENT cooktime (#PCDATA)>
Listing 3 - Recipe.dtd
What does the DTD tell us?
Line 1 says our root element is called "recipe", and has seven sub-elements, all compulsory.
Line 2 says the "ingredients" element has one or more "item" elements.
Line 3 says the "meal" element may have some text, and it may have a "course" sub-element.
The other lines show that the remaining elements contain text.
An example of XML conforming to this DTD is shown below:
<?xml version="1.0"?> <!DOCTYPE recipe SYSTEM "recipe.dtd"> <recipe> <author>Steve Cornish</author> <recipe_name>Thick Veg Stew</recipe_name> <meal>Dinner <course>Main</course> </meal> <preptime>15 minutes</preptime> <cooktime>30 minutes</cooktime> <ingredients> <item>2 carrots</item> <item>2 parsnips</item> <item>2 leeks…</item> </ingredients> <directions>Chop the vegetables into large discs, etc. ... </directions> </recipe>
Listing 4 - VegetableStew.xml
The first line of VegetableStew.xml is compulsory as it declares that the document conforms to the XML 1.0 Standard (see www.w3.org/xml ). The second line declares that the rules for out "recipe" tag can be found in the file "recipe.dtd".
Our handler has to be able to extract the data from the recipe, and populate a java Recipe object. The public and package interface for the Recipe class is shown here:
package org.accu.cornish.xml; import java.util.Vector; import java.io.*; public class Recipe { public Recipe() { /* … */ } public void setName(String name) { /* … */ } public void setAuthor(String author) { /* … */ } public void setPreparationTime(String time) { /* … */ } public void setCookingTime(String time) { /* … */ } public void setDirections(String directions) { /* … */ } public void setMeal(String meal) { /* … */ } public void setCourse(String course) { /* … */ } public void addIngredients(String name) { /* … */ } public String toString() { /* … */ } void printSelfAsXML() { /* print self as XML */ } }
Listing 5 - Recipe.java
Note that printSelfAsXML() has no visibility modifier - this means it is visible to the package org.accu.cornish.xml . This is fine by me since I want my other classes to be able to use this method for diagnostic purposes.
Now to write the handler. I think a good strategy is to run through the source file, and store all the tag data in a HashTable. Then after parsing, we can request the constructed Recipe object from the handler, and the handler can create it on demand.
package org.accu.cornish.xml; import org.xml.sax.*; import java.util.HashMap; import java.util.Stack; public class RecipePopulator extends HandlerBase { protected HashMap properties; protected Stack tagStack; private String currentTag; private int item_suffix = 0; public RecipePopulator() { properties = new HashMap(); tagStack = new Stack(); } public void startElement (String parm1, AttributeList parm2) throws SAXException { currentTag = (String) tagStack.push(parm1); if (parm1.equals("item")) { ++item_suffix; } } public void endElement(String parm1) throws SAXException { if (parm1.equals("ingredients")) { item_suffix = 0; } currentTag = (String) tagStack.pop(); if (currentTag == null) { throw new SAXException(" End tag without start tag: " + parm1); } } public void characters (char[] parm1, int parm2, int parm3) throws SAXException { // first, do we have a current tag? if (currentTag == null || currentTag.equals("")) { throw new SAXException("Data with no element"); } // extract string String data = new String( parm1, parm2, parm3); String keyname = currentTag; // if the currentTag is "item" // add a unique suffix if (currentTag.equals("item")) { keyname += item_suffix; } // strip whitespace properties.put( keyname, data.trim() ); } public Recipe getRecipe() throws InstantiationException { /* create and populate Recipe object */ } }
Listing 6 - RecipePopulator.java
The RecipePopulator class maintains two collections; a HashMap of tag and data pairs, and a Stack of the tag names. Both HashMap and Stack are defined in java.util. Because the "ingredients" tag can have many "item" tags, an index has to be suffixed to the key to prevent overwriting the previous items.
The methods startElement() and endElement() maintain the value of the current tag and any suffix values for the "item" tags.
The method characters() does the work of putting the key / value pairs into the HashMap.
Our getRecipe() method has to check that the compulsory fields of the target Recipe object exist. If they don't we throw an InstantiationException (java.lang). If they do, we can get on with the work of creating the Recipe object.
public Recipe getRecipe() throws InstantiationException { // check the compulsory fields String author = (String) properties.get( "author" ); String name = (String) properties.get( "recipe_name" ); String prepTime = (String) properties.get( "preptime" ); String cookTime = (String) properties.get( "cooktime" ); String directions = (String) properties.get( "directions" ); if ( author == null || name == null || prepTime == null || cookTime == null || directions == null ) { throw new InstantiationException( "Cannot create Recipe object" + " - missing elements"); } // otherwise, we can carry on Recipe r = new Recipe(); r.setAuthor( author ); r.setName( name ); r.setMeal((String) properties.get("meal")); r.setCourse( (String) properties.get( "course" ) ); r.setPreparationTime( prepTime ); r.setCookingTime( cookTime ); r.setDirections( directions ); // now, add the ingredients int item_index = 1; String ingredient = null; while ((ingredient = (String) properties.get("item" + item_index)) != null) { ++item_index; r.addIngredients(ingredient); } return r; }
Listing 7 - RecipePopulator.getRecipe()
Although this is a highly trivial example (the elements have no attributes), it is not hard to see that a handler could be written to populate data objects belonging to an existing application. For example, what if the Recipe class above belonged to a recipe catalogue application we wrote? Imagine that the only way to enter new recipes was to fill out a GUI form by hand. Using the steps above, we can easily provide for import of new recipes using XML.
SAX is a very simple API (hence the name), but its simplicity is also its strength. SAX parsers are best suited to processing XML documents that only need to be read, and only need to be read once. In the next article, I will offer a different approach to parsing XML; using DOM, the Document Object Model.
Free SAX parsers
There are a number of free XML Parsers that support the SAX 1.0 interface. Here are the main players:
- XML4J (IBM)
- Ælfred (Microstar)
- Java Project X (Sun)
-
http://developer.java.sun.com/developer/earlyAccess/xml/index.html
- XML Parser for Java 2 (Oracle)
- XP (James Clark)
Further References
David Megginson's SAX site - http://www.megginson.com/SAX
SAX online API - http://www.megginson.com/SAX/javadoc/packages.html
The World Wide Web Consortium - http://www.w3.org/
XML-Dev mailing list - < xml-dev@ic.ac.uk >
A good XML site from Seybold and O'Reilly - www.xml.com