ACCU Mentored Developers XML Project

ACCU Mentored Developers XML Project

By Paul Grenyer, Jez Higgins

Overload, 12(62):, August 2004


This article was originally written in December 2002 as part of the ACCU Mentored Developers [ MDevelopers ] XML [ XMLRec ] project. It has now been revised, with considerable help from Jez Higgins, for publication in Overload.

The first exercise set for the project students by the project mentors was as follows:

Incorporate either the Xerces[ Xerces ] or Microsoft XML[ MSXML ] parsers into a C++ project and use it to:

  1. Parse XML strings and files.

  2. Output the element structure as an indented tree.

As most of my development experience has been on Windows I followed the MSXML route.

Downloading and Installing MSXML

The MSXML parser can be downloaded from the Microsoft website. The latest version at the time of writing is version 4.0 and requires the latest Windows installer, which is incorporated into Windows XP and comes with Windows service pack 3. The installer can also be downloaded as single executable [ InstMsi ].

Assuming the latest Windows Installer is present on your system installing MSXML is simply a case of running the installer package. As MSXML is Component Object Model (COM) based this will register the MSXML dynamic link library ( msxml4.dll ). The installer also creates a directory with all necessary files needed to use the parser in a C++ project.

An XML Mini-Glossary

Attributes

XML elements can have attributes. An attribute is a name-value pair attach to the element's start tag. Names are separated from their values by an equals sign, and values are enclosed in single or double quotes. Attribute order is not significant.

<bigbrain invented="SGML">Charles Goldfarb</bigbrain>
DOM

The Document Object Model is a W3C recommendation which an application programming interface well-formed XML documents [ DOMRec ], defining the logical structure of documents and the way a document is accessed and manipulated. The DOM is defined in programming-language neutral terms. This leads to some slightly clumsy looking code, but that aside the DOM is widely used (if not necessarily wellloved). Its in-memory representation makes it well suited to document editing, navigation and data retrieval applications.

DTD

Document Type Definition, the original XML schema language described in the XML recommendation. A Document Type Definition defines the legal building blocks of an XML document. It defines the document structure with a list of legal elements, each element's allowed content and so on.

Elements & Tags

Here's a tiny XML document

<bigbrain>Charles Golbfarb</bigbrain>

It consists of a single element named bigbrain and the element's content, the text string Charles Goldfarb . The element is delimited by the start tag <bigbrain> and the end tag </bigbrain> .

Valid

Documents which conform to a particular XML application are said to be valid . In the early days of XML (all of five years ago) validity meant conforming to a DTD. With the development and widespread adoption of other schema languages, valid has come to mean valid to whatever schema you happen to be using .

Well-formed

Not all, quite probably most, XML documents are not valid, nor do they need to be. However they are all well-formed . An XML document is well-formed if it satisfies the basic XML grammar - the elements are properly delimited, start and end tags match and so on. A document which is not wellformed is like a C++ program with a missing semi-colon, no good for anything.

XML Application

A set of XML elements and attributes for a particular purpose - for instance DocBook, SVG, WSDL, Open Office file format - is called an XML application . An XML application is often expressed in one of the many available schema languages - DTD, XML Schema, RelaxNG, Schematron, etc. An XML application is not an application which uses XML.

Testing MSXML

Although there are the usual Microsoft help files incorporated with MSXML there aren't any examples, so I used Google to try and find some and found the PerfectXML[ PerfectXML ] website. The website includes a number of MSXML C++ examples and one in particular, Using DOM [ UsingDOM ], that downloads an XML file from an Internet location, parses it, modifies it and writes it to the local hard disk. I used this example as a template for the following simple MSXML console application test program:

#include <iostream>
#include <string>
#include <windows.h>
#include <atlbase.h>
#import "msxml4.dll"
int main() {
  std::cout << "MSXML DOM: Simple Test 1: Creating"
     << " of COM object and parsing of XML.\n\n";
  ::CoInitialize(0);
  {
    MSXML2::IXMLDOMDocument2Ptr pXMLDoc = 0;
    // Create MSXML DOM object
    HRESULT hr = pXMLDoc.CreateInstance(
                   "Msxml2.DOMDocument.4.0");
    if (SUCCEEDED(hr)) {
      // Load the document synchronously
      pXMLDoc->async = false;
      _variant_t varLoadResult((bool)false);
      const std::string xmlFile("poem.xml");
      // Load the XML document
      varLoadResult = pXMLDoc->load(xmlFile.c_str());
      if(varLoadResult) {
        std::cout << "Successfully loaded XML file: "
                  << " file: " << xmlFile << "\n";
      }
      else {
        std::cout << "Failed to load XML file: " 
                  << xmlFile << "\n";
        // Get parseError interface
        MSXML2::IXMLDOMParseErrorPtr pError = 0;
        if(SUCCEEDED(pXMLDoc->get_parseError(
                                      &pError))) {
          USES_CONVERSION;
          std::cout << "Error: " 
                    << W2A(pError->reason) << "\n";
        }
      }
    }
    else {
      std::cout << "Failed to create MS XML COM "
                << "object.\n";
    }
  }
  ::CoUninitialize();
  return 0;
}

This program takes the following XML file and parses it:

<?xml version="1.0" encoding="UTF-8"?>
<poem>
  <line>Roses are red,</line> 
  <line>Violets are blue.</line> 
  <line>Sugar is sweet,</line> 
  <line>and I love you</line>
</poem>

If the parse fails an error message is written to std::cout giving the reason. Although this code snippet does the intended job, it is a bit rough and needs some work in order to achieve the objective of this exercise. Among other things it would benefit from wrapping of MSXML and some proper exception handling.

It is worth noting #import is specific to Microsoft Visual C++ and is not supported by other Win32 compilers.

Engineering the Exercise Solution: Part 1

I'm going to look at the exercise solution in two parts. The first part will reengineer the PerfectXML example into a more general solution with a clean interface, proper runtime handling and exception handling. The second part will look at writing the element structure to a stream.

COM Runtime

As MSXML is COM based, the COM runtime must be started before any COM objects can be instantiated. The COM runtime is started by the CoInitializeEx API function and stopped with CoUninitialize . MSDN states that every call to CoInitializeEx must be matched by a call to CoUninitialize , even if CoInitializeEx fails.

CoUninitialize must not be called until all COM objects have been released. For instance in the example above there is an extra scope wrapping the MSXML code so that the IXMLDOMDocument2Ptr smart pointer destructor is called, destroying the DOM, before CoUninitialize is called.

The easiest way to achieve this, even in the presence of exceptions, is to take advantage of C++'s RAII (Resource Acquisition Is Initialization) and place CoInitialiseEx in the constructor of a class and CoUninitialize in the destructor and to create an instance of the class on the stack, at the beginning of the program before anything else. COMRuntimeInit , shown below, is just such a class. The copy constructor and copy-assignment operator are both private and undefined, to prevent copying. A COMRuntimeInit object has no state and therefore it does not make sense to copy it. This method of preventing copying and some more of the reasons behind it are discussed by Scott Meyers in Effective C++[ ECpp ].

#include <stdexcept>
#include <string>
#include <windows.h>
class COMRuntimeInit {
public:
  COMRuntimeInit() {
    HRESULT hr = ::CoInitializeEx(0,
                         COINIT_APARTMENTTHREADED);
    if(FAILED(hr)) {
      UnInitialize();
      std::string errorMsg = "Failed to start COM "
                             "Runtime: ";
      switch(hr) {
        case E_INVALIDARG:
          errorMsg += "An invalid parameter was "
                      "passed to the returning "
                      "function.";
          break;
        case E_OUTOFMEMORY:
          errorMsg += "Out of memory.";
          break;
        case E_UNEXPECTED:
          errorMsg += "Unexpected error.";
          break;
        case S_FALSE:
          errorMsg += "The COM library is already "
                      "initialized on this "
                      "thread.";
          break;
        default:
          errorMsg += "Unknown.";
          break;
      }
      throw std::runtime_error(errorMsg);
    }
  }
  ~COMRuntimeInit() {
    UnInitialize();
  }
private:
  void UnInitialize() const {
    ::CoUninitialize();
  }
  COMRuntimeInit(const COMRuntimeInit&);
  COMRuntimeInit& operator=(const COMRuntimeInit&);
};

There are of course times when the initial call to CoInitialiseEx may fail. The cause of the failure can be ascertained from its return value. The obvious way to communicate the cause of the failure to the user is via an exception. This has the drawback that the destructor will not be called when the constructor throws and therefore CoUninitialize must be called manually. For now std::runtime_error will be thrown when CoInitializeEx fails, later on we'll look at a custom exception type.

As stated above, the COMRuntimeInit instance must be declared before any other object on the stack. The instance cannot be put at file scope as it throws an exception if it fails, so the obvious place is at the top of main's scope. A try / catch block is also needed to detect the failure.

#include <iostream>
#include "comruntimeinit.h"
int main() {
  try {
    COMRuntimeInit comRuntime;
  }
  catch( const std::runtime_error& e) {
    std::cout << e.what() << "\n";
  }
  return 0;
}

Instantiating the MSXML DOM

Code that uses COM, as with most Microsoft API code, is just plain ugly and really should be hidden behind an interface. Exercise 1 of the XML project states that either the Xerces parser or the MSXML parser can be used. Ideally they should be easily interchangeable and their use completely hidden from the user. Hiding the ugly code and making the parsers easily interchangeable can be achieved with the Pimpl Idiom, as discussed by Herb Sutter in Exceptional C++ [ ExCpp ].

The first stage in the exercise is to create the MSXML DOM parser. This is achieved with the DOM class:

// dom.h
// Forward declaration so that implementation 
// can be completely hidden.
class DOMImpl;
class DOM {
private:
  DOMImpl *impl_;
public:
  DOM();
  ~DOM();
private:
  DOM(const DOM&);
  DOM& operator=(const DOM&);
};

The DOM class will form a basic wrapper for the DOMImpl class which will do all the work. DOMImpl is forward declared, so that its implementation can be completely hidden.

The DOM class implementation is shown below. It creates an instance of the DOMImplclass on the heap in the constructor and deletes it in the destructor.

// dom.cpp
#include "dom.h"
#include "domimpl.h"
DOM::DOM() : impl_(new DOMImpl) {}
DOM::~DOM() { delete impl_; }

DOMImpl creates the MSXML DOM parser in the same way as the PerfectXML example:

// domimpl.h
#import "msxml4.dll"
class DOMImpl {
private:
  MSXML2::IXMLDOMDocument2Ptr xmlDoc_;
public:
  DOMImpl() : xmlDoc_(0) {
    xmlDoc_.CreateInstance(
                    "Msxml2.DOMDocument.4.0");
  }
private:
  DOMImpl(const DOMImpl&);
  DOMImpl& operator=(const DOMImpl&);
};

Both DOM and DOMImpl have private copy constructors and copy assignment operators, again to prevent copying.

The above code does not include any error checking. It is possible for the call to CreateInstance to fail. The msxml4.dll may not be registered, for example. The success or failure of the CreateInstance call can be detected by its return value.

DOMImpl() : xmlDoc_(0) {
  HRESULT hr = xmlDoc_.CreateInstance(
                   "Msxml2.DOMDocument.4.0");
  if(FAILED(hr)) {
    std::string errorMsg = "Failed to start "
                           "create MSXML "
                           "DOM: ";
    switch(hr) {
      case CO_E_NOTINITIALIZED:
        errorMsg += "CoInitialize has not "
                    "been called.";
        break;
      case CO_E_CLASSSTRING:
        errorMsg += "Invalid class string.";
        break;
      case REGDB_E_CLASSNOTREG:
        errorMsg += "A specified class is "
                    "not registered."
        break;
      case CLASS_E_NOAGGREGATION:
        errorMsg += "This class cannot be "
                    "created as part of an "
                    "aggregate.";
        break;
      case E_NOINTERFACE:
        errorMsg += "The specified class "
                    "does not implement the "
                    "requested interface";
        break;
      default:
        errorMsg += "Unknown error.";
        break;
    }
    throw std::runtime_error(errorMsg );
  }
}

NonCopyable

We now have three classes which are "copy prevented", with a private copy constructor and copy assignment operator. There is a clearer way to document the fact that a class is not intended to be copied. When used by a number of different classes it also reduces the amount of code.

The NonCopyable class, show below, has a private copy constructor and assignment operator to prevent prevent copying. When another class inherits from NonCopyable , the private copy constructor and assignment operator are also inherited. This both prevents the subclass from being copied and documents the intention. The relationship between NonCopyable and its subclass is not IS-A and therefore the inheritance can be private.

As NonCopyable is intended only to provide behaviour to a derived class, rather than act as a class in its own right, its default constructor is protected, preventing a free NonCopyable object being created. Its destructor too, is protected to prevent a subclass being deleted via a pointer to NonCopyable . To further document this intention, the destructor is not virtual.

class NonCopyable {
protected:
  NonCopyable() {}
  ~NonCopyable() {}
private:
  NonCopyable(const NonCopyable&);
  NonCopyable& operator=(const NonCopyable&);
};

The NonCopyable class was written by Dave Abrahams for the boost [ boost ] library. I have recreated it here so that a dependency on the boost library is avoided.

Now that the NonCopyable class is in place the copy constructors and assignment operators can be removed from COMRuntimeInit , DOM and DOMImpl . They can then be changed to privately inherit from NonCopyable .

class COMRuntimeInit : private NonCopyable {
  ...
};

class DOM : private NonCopyable {
  ...
};

class DOMImpl : private NonCopyable {
  ...
};

Loading and Validating the XML

The MSXML DOM has a method that loads and parses an XML file. While parsing the file it is checked to make sure it is well formed and if there is a DTD or Schema specified it is also validated. If the file cannot be opened, is not well formed or cannot be validated the call fails.

The method is called load and takes a single parameter which is the full path to the XML file. To load and parse an XML file, a similar method can be added to DOMImpl and a corresponding forwarding function added to DOM.

class DOMImpl : private NonCopyable {
public:
  ...
  void Load(const std::string& fullPath) {
    xmlDoc_->load(fullPath.c_str());
  }
};

main can then be modified to call the new function with the path to an XML file.

try {
  COMRuntimeInit comRuntime;
  DOM dom;
  dom.Load("poem.xml");
}
catch(const std::runtime_error& e) {
  std::cout << e.what() << "\n";
}

Once again there is no way of detecting failure and the return value of the MSXML DOM load method must be tested to find out if it failed. If a failure has occurred an exception should be thrown.

void Load(const std::string& fullPath) {
  if(!xmlDoc_->load( fullPath.c_str())) {
    throw std::runtime_error(ErrorMessage());
  }
}

The method of extracting an error message from an MSXML DOM is a little fiddly, so I have placed it in its own function, ErrorMessage .

class DOMImpl : private NonCopyable {
public:
  ...
  std::string ErrorMessage() const {
    std::string result = "Failed to extract "
                         "error.";
    MSXML2::IXMLDOMParseErrorPtr pError =
                         xmlDoc_->parseError;
    if(pError->reason.length()) {
      result = pError->reason;
    }
    return result;
  }
};

A parse error is extracted from an MSXML DOM as an XMLDOMParserError object. The error description is fetched from the reason property. If no description is available, the bstr_t returned by reason has a length of 0. bstr_t is a wrapper class for COM's native unsigned short* string type. It provides a conversion to const char* , and thus can be assigned to a std::string .

Custom Exception Types

Our main function's body is

try {
  COMRuntimeInit comRuntime;
  DOM dom;
  dom.Load("poem.xml");
}
catch(const std::runtime_error& e) {
  std::cout << e.what() << "\n";
}

Currently the example throws a std::runtime_error if the COM runtime fails to initialise or if there is an XML failure. In both cases the error message is prefixed with a description of the type of error. Exceptions thrown as a result of the COM runtime failing to initialise are probably fatal and it may be appropriate for the program to exit, while for exceptions thrown due to an XML parse fail it might be more appropriate to log the error and move on to the next file.

These different categories of error would be better communicated by the exception's actual type and it is easy to add custom exceptions. Throwing different types of exceptions helps to maintain the context in which the exception was thrown and enables the behaviour of a program to change based on the type of exception that is thrown.

Deriving from std::exception not only means that custom exception types can be caught along with other standard exception types in a single catch statement if necessary, but also provides an implementation for the custom exception object.

class BadCOMRuntime : public std::exception {
public:
  BadCOMRuntime(const std::string& msg)
        : exception(msg.c_str()) {}
};

std::exception 's constructor takes a char* , but I know that I will be building exception messages with strings and following the model of std::runtime_error , BadCOMRuntime 's constructor takes a std::string .

COMRuntimeInit 's constructor must be modified for the new exception:

COMRuntimeInit() {
  HRESULT hr = ::CoInitialize(0);
  if(FAILED(hr)) {
    UnInitialize();
    std::string errorMsg = "Unknown.";
    switch(hr) {
      case E_INVALIDARG:
        errorMsg = "An invalid parameter was "
                   "passed to the returning "
                   "function.";
        break;
      ...
      default:
        break;
    }
    throw BadCOMRuntime(errorMsg);
  }
}

and main must be modified to catch the new exception:

try {
  COMRuntimeInit comRuntime;
  DOM dom;
  dom.Load("poem.xml");
}
catch(const BadCOMRuntime& e) {
  std::cout << "COM initialisation error: "
            << e.what()
            << "\n";
}
...

The exceptions thrown by DOMImpl are a little more complicated. DOMImpl throws exceptions when two different things happen and therefore requires two different exception types, which should be in some way related. One way to solve this is to have a common exception type for DOMImpl from which two other exception types derive.

DOMImpl is the implementation of DOM and any exception thrown by DOMImpl is most likely to be caught outside DOM . Therefore, to the user of DOM , who is unaware of DOMImpl , it is more logical for DOM to be throwing exceptions of type BadDOM rather than BadDOMImpl .

#include <stdexcept>
#include <string>
class BadDOM : public std::exception {
public:
  BadDOM(const std::string& msg)
        : exception(msg.c_str()) {}
};
class CreateFailed : public BadDOM {
public:
  CreateFailed(const std::string& msg)
        : BadDOM(msg) {}
};
class BadParse : public BadDom {
public:
  BadParse(const std::string& msg)
        : BadDOM(msg) {}
};

The constructor and Load function in DOMImpl can now be modified to use the new exception types and main modified to catch a BadDOM exception. For completeness sake, we also need a third catch block. The COM smart pointers generated by #import raise a _com_error if a function call fails.

try {
  COMRuntimeInit comRuntime;
  DOM dom;
  dom.Load("poem.xml");
}
catch(const BadCOMRuntime& e) {
  std::cout << "COM initialisation error: "
            << e.what() << "\n";
}
catch(const BadDOM& e) {
  std::cout << "DOM error: "
            << e.what() << "\n";
}
catch(const _com_error& e) {
  std::cout << "COM error: "
            << e.ErrorMessage() << "\n";
}

Engineering the Exercise Solution: Part 2

Now that the DOM is loading and validating XML the next part of the exercise is write the elements to an output stream as an indented tree.

Writing the Element Structure

The first step in enabling the elements to be written to an output stream is to pass one in. The obvious way to do this is to is to add a function to DOMImpl , and a forwarding function to DOM , which takes a std::ostream reference.

#include <ostream>
class DOMImpl : private NonCopyable {
...
public:
  void WriteTree(std::ostream& out) {}
...
};

Modifying main to call the new function means that results can be seen straight away as the WriteTree implementation is developed.

try {
  COMRuntimeInit comRuntime;
  DOM dom;
  dom.Load("poem.xml");
  dom.WriteTree(std::cout);
}
...

In order to write the complete tree, every element must be visited. Starting with the root element, the rest of the elements can then be visited in a depth-first traversal. I wrote the following function, based on some Delphi written by Adrian Fagg, which gets a pointer to the root element and then calls the function WriteBranch which recurses the rest of the tree.

void WriteTree(std::ostream& out) {
  MSXML2::IXMLDOMElementPtr root =
                     xmlDoc_->documentElement;
  WriteBranch(root, 0, out);
}

The WriteBranch function is also based on Adrian Fagg's Delphi code. The code is self explanatory, but basically it:

  1. Gets the tag name of the element passed to it.

  2. Writes tag names to the supplied std::ostream at twice the specified indentation.

  3. The supplied element is then used to get a pointer to its first child.

  4. If the child pointer is not 0, it is used to get the node type.

  5. If the node is of type NODE_ELEMENT the WriteBranch method is called again (recursion).

  6. The child pointer is then used to get the next sibling.

  7. If there are no more siblings, the method returns.

void WriteBranch(
            MSXML2::IXMLDOMElementPtr element, 
            unsigned long indentation,
            std::ostream& out) {
  bstr_t cbstr element->tagName;
  out << std::string(2 * indentation, ' ') 
      << cbstr << std::endl;
  MSXML2::IXMLDOMNodePtr child =
                          element->firstChild;
  while(child != 0) {
    if(child->nodeType ==
                      MSXML2::NODE_ELEMENT) {
      WriteBranch(child,
                  indentation + 1, out);
    }
    child = child->nextSibling;
  }
}

The result of running the program is now that the following is written to the console:

poem
  line
  line
  line
  line

With that the exercise is complete.

Next Step

The logical next step would of course be exercise 2. However, as well as completing the exercises which help the students learn about XML, one of the aims of the ACCU Mentored Developers XML Project is to write a standard interface behind which any parser, such as MSXML or Xerces can be used. Therefore, the next step is to design a common interface to the DOM.

Paul Grenyer and Jez Higgins

Thank You

Thanks to all the members of the ACCU Mentored Developers XML Project, especially Adrian Fagg, Rob Hughes, Thaddaeus Frogley and Alan Griffiths for the proof reading and code suggestions.

References

[boost] The boost library: http://www.boost.org

[DOMRec] W3C Document Object Model (DOM): http://www.w3.org/DOM/

[ECpp] Scott Meyers, Effective C++: 50 Specific Ways to improve Your Programs and Designs . Addison Wesley: ISBN 0-201-9288-9

[ExCpp] Herb Sutter, Exceptional C++ . Addison Wesley: ISBN 0201615622

[InstMsi] Windows Installer 2.0: http://www.microsoft.com/downloads/details.aspx ?FamilyID=4b6140f9-2d36-4977-8fa1-6f8a0f5dca8f &displaylang=en

[MDevelopers] ACCU Mentored Developers: http://www.accu.org/mdevelopers/

[MSXML] Microsoft XML parser: http://www.microsoft.com/downloads/details.aspx?FamilyID=3144b72b-b4f2-46da-b4b6-c5d7485f2b42 &displaylang=en

[PerfectXML] PerfectXML: www.perfectxml.com/msxml.asp

[UsingDOM] Using DOM: http://www.perfectxml.com/CPPMSXML/20020710.asp

[Xerces] Xerces XML parser: http://xml.apache.org/xerces-c

[XMLRec] Extensible Markup Language (XML): http://www.w3.org/XML/






Your Privacy

By clicking "Accept Non-Essential Cookies" you agree ACCU can store non-essential cookies on your device and disclose information in accordance with our Privacy Policy and Cookie Policy.

Current Setting: Non-Essential Cookies REJECTED


By clicking "Include Third Party Content" you agree ACCU can forward your IP address to third-party sites (such as YouTube) to enhance the information presented on this site, and that third-party sites may store cookies on your device.

Current Setting: Third Party Content EXCLUDED



Settings can be changed at any time from the Cookie Policy page.