Chapter 2. Parsers

Table of Contents

Like the underlying libxml2 library, libxml++ allows the use of 3 parsers, depending on your needs - the DOM, SAX, and TextReader parsers. The relative advantages and behaviour of these parsers will be explained here.

All of the parsers may parse XML documents directly from disk, a string, or a C++ std::istream. Although the libxml++ API uses only xmlpp::ustring, indicating the UTF-8 encoding, libxml++ can parse documents in any encoding, converting to UTF-8 automatically. This conversion will not lose any information because UTF-8 can represent any locale.

Remember that white space is usually significant in XML documents, so the parsers might provide unexpected text nodes that contain only spaces and new lines. The parser does not know whether you care about these text nodes, but your application may choose to ignore them.

DOM Parser

The DOM (Document Object Model) parser parses the whole document at once and stores the structure in memory, available via DomParser::get_document(). With methods such as Document::get_root_node() and Node::get_children(), you may then navigate into the hierarchy of XML nodes without restriction, jumping forwards or backwards in the document based on the information that you encounter. Therefore the DOM parser uses a relatively large amount of memory.

You should use C++ RTTI (via dynamic_cast<>) to identify the specific node type and to perform actions which are not possible with all node types. For instance, only Elements have attributes. Here is the inheritance hierarchy of node types:

  • xmlpp::Node

    • xmlpp::Attribute

      • xmlpp::AttributeDeclaration

      • xmlpp::AttributeNode

    • xmlpp::ContentNode

      • xmlpp::CdataNode

      • xmlpp::CommentNode

      • xmlpp::EntityDeclaration

      • xmlpp::ProcessingInstructionNode

      • xmlpp::TextNode

    • xmlpp::Element

    • xmlpp::EntityReference

    • xmlpp::XIncludeEnd

    • xmlpp::XIncludeStart

All Nodes created by the DOM parser are leaves in the node type tree. For instance, the DOM parser can create TextNodes and Elements, but it does not create objects whose exact type is ContentNode or Node.

Although you may obtain pointers to the Nodes, these Nodes are always owned by their parent Node. In most cases that means that the Node will exist, and your pointer will be valid, as long as the Document instance exists.

There are also several methods which can create new child Nodes. By using these, and one of the Document::write_*() methods, you can use libxml++ to build a new XML document.

Example

This example looks in the document for expected elements and then examines them. All these examples are included in the libxml++ source distribution.

Source Code

File: main.cc

#include <libxml++/libxml++.h>
#include <iostream>
#include <cstdlib>

std::ostream& operator<<(std::ostream& o, const std::optional<xmlpp::ustring>& s)
{
  o << s.value_or("{[(no value)]}");
  return o;
}

void print_node(const xmlpp::Node* node, unsigned int indentation = 0)
{
  const std::string indent(indentation, ' ');
  std::cout << std::endl; //Separate nodes by an empty line.

  const auto nodeContent = dynamic_cast<const xmlpp::ContentNode*>(node);
  const auto nodeText = dynamic_cast<const xmlpp::TextNode*>(node);
  const auto nodeComment = dynamic_cast<const xmlpp::CommentNode*>(node);

  if(nodeText && nodeText->is_white_space()) //Let's ignore the indenting - you don't always want to do this.
    return;

  const auto nodename = node->get_name2();

  if(!nodeText && !nodeComment && nodename) //Let's not say "name: text".
  {
    const auto namespace_prefix = node->get_namespace_prefix2();

    std::cout << indent << "Node name = ";
    if(namespace_prefix)
      std::cout << namespace_prefix << ":";
    std::cout << nodename << std::endl;
  }
  else if(nodeText) //Let's say when it's text. - e.g. let's say what that white space is.
  {
    std::cout << indent << "Text Node" << std::endl;
  }

  //Treat the various node types differently:
  if(nodeText)
  {
    std::cout << indent << "text = \"" << nodeText->get_content2() << "\"" << std::endl;
  }
  else if(nodeComment)
  {
    std::cout << indent << "comment = " << nodeComment->get_content2() << std::endl;
  }
  else if(nodeContent)
  {
    std::cout << indent << "content = " << nodeContent->get_content2() << std::endl;
  }
  else if(auto nodeElement = dynamic_cast<const xmlpp::Element*>(node))
  {
    //A normal Element node:

    //line() works only for ElementNodes.
    std::cout << indent << "     line = " << node->get_line() << std::endl;

    //Print attributes:
    for (const auto& attribute : nodeElement->get_attributes())
    {
      const auto namespace_prefix = attribute->get_namespace_prefix2();

      std::cout << indent << "  Attribute ";
      if(namespace_prefix)
        std::cout << namespace_prefix << ":";
      std::cout << attribute->get_name2() << " = "
                << attribute->get_value2() << std::endl;
    }

    const auto attribute = nodeElement->get_attribute("title");
    if(attribute)
    {
      std::cout << indent;
      if (dynamic_cast<const xmlpp::AttributeNode*>(attribute))
        std::cout << "AttributeNode ";
      else if (dynamic_cast<const xmlpp::AttributeDeclaration*>(attribute))
        std::cout << "AttributeDeclaration ";
      std::cout << "title = " << attribute->get_value2() << std::endl;
    }
  }

  if(!nodeContent)
  {
    //Recurse through child nodes:
    for(const auto& child : node->get_children())
    {
      print_node(child, indentation + 2); //recursive
    }
  }
}

int main(int argc, char* argv[])
{
  bool validate = false;
  bool set_throw_messages = false;
  bool throw_messages = false;
  bool substitute_entities = true;
  bool include_default_attributes = false;

  int argi = 1;
  while (argc > argi && *argv[argi] == '-') // option
  {
    switch (*(argv[argi]+1))
    {
      case 'v':
        validate = true;
        break;
      case 't':
        set_throw_messages = true;
        throw_messages = true;
        break;
      case 'e':
        set_throw_messages = true;
        throw_messages = false;
        break;
      case 'E':
        substitute_entities = false;
        break;
      case 'a':
        include_default_attributes = true;
        break;
     default:
       std::cout << "Usage: " << argv[0] << " [-v] [-t] [-e] [filename]" << std::endl
                 << "       -v  Validate" << std::endl
                 << "       -t  Throw messages in an exception" << std::endl
                 << "       -e  Write messages to stderr" << std::endl
                 << "       -E  Do not substitute entities" << std::endl
                 << "       -a  Include default attributes in the node tree" << std::endl;
       return EXIT_FAILURE;
     }
     argi++;
  }
  std::string filepath;
  if(argc > argi)
    filepath = argv[argi]; //Allow the user to specify a different XML file to parse.
  else
    filepath = "example.xml";

  try
  {
    xmlpp::DomParser parser;
    if (validate)
      parser.set_validate();
    if (set_throw_messages)
      parser.set_throw_messages(throw_messages);
    //We can have the text resolved/unescaped automatically.
    parser.set_substitute_entities(substitute_entities);
    parser.set_include_default_attributes(include_default_attributes);
    parser.parse_file(filepath);
    if(parser)
    {
      //Walk the tree:
      const auto pNode = parser.get_document()->get_root_node(); //deleted by DomParser.
      print_node(pNode);
    }
  }
  catch(const std::exception& ex)
  {
    std::cerr << "Exception caught: " << ex.what() << std::endl;
    return EXIT_FAILURE;
  }

  return EXIT_SUCCESS;
}