Biz & IT —

Manipulating XML at the command line with xmlstarlet

Linux.Ars returns with a tutorial on how to mess around with XML using …

Tools, Tips, and Tweaks

Manipulating XML at the command line with xmlstarlet

In the world of open-source software, where open data formats are a necessity, XML is poised to become the de facto standard. A number of popular open-source applications already use XML as their primary data format, and many developers utilize it extensively in specialized, personal-use applications. There is a clear need for powerful and effective tools that facilitate dynamic and interactive manipulation of XML content stored in files on the local drive or acquired from remote locations.

Xmlstarlet is a versatile command-line utility that enables users to manipulate, filter, edit, search, validate, and apply stylesheets to XML content. Unfortunately, the versatility of xmlstarlet comes at the expense of usability. It is extremely unintuitive, and many users struggle with the obfuscated command line parameters and peculiar scripting idiom. There are too many features to cover here, but I would like to introduce this powerful utility and show you a few ways that you can use it to simplify some basic, everyday tasks.

For these examples, I have constructed a simple XML file that contains information about several of the astronaut monkeys launched into space by NASA. Each monkey element contains a name attribute that specifies the name of the individual monkey, a date element that contains the date of the monkey's first flight, and a species element that describes the monkey's species.

monkeys.xml

<spaceapes>
  <monkey name="Gordo">
    <date>12/13/58</date>

    <species>Squirrel</species>
  </monkey>
  <monkey name="Able">
    <date>5/28/59</date>
    <species>Rhesus</species>

  </monkey>
  <monkey name="Baker">
    <date>5/28/59</date>
    <species>Squirrel</species>
  </monkey>

  <monkey name="Sam">
    <date>12/04/59</date>
    <species>Rhesus</species>
  </monkey>
</spaceapes>

The xmlstarlet command enables users to extract information from XML content with simple XPath queries. Xmlstarlet can generate plain text or filtered XML. Let's start with a simple data extraction experiment. We will use xmlstarlet to determine how many monkeys are described in the monkeys.xml file:

$ xmlstarlet sel -t -v "count(//monkey)" monkeys.xml
4

The sel instruction tells xmlstarlet that we plan to extract or filter data. The -t parameter indicates that the following parameters are part of the output template, and the -v parameter is used to output the value of an xpath expression. In this case, our xpath expression will count all the monkey element nodes. The xpath syntax is beyond the scope of this brief introduction, and interested readers can learn the entire xpath language from this helpful tutorial at the Zvon web site.

Now we will generate a table that lists the name of each monkey as well as its species:

$ xmlstarlet sel -t -m "//monkey" -v "species" -o " " -v "@name" -n monkeys.xml
Squirrel Gordo
Rhesus Able
Squirrel Baker
Rhesus Sam

In this example, we iterate over each monkey element in the XML file, and display the relevant data. The -m parameter tells xmlstarlet to iterate over all nodes that match the provided xpath expression, which is "//monkey" in this case. The template parameters that follow the xpath expression will be evaluated and output for each matched node. In this example, we display the species element of each monkey element, as well as the name attribute. Note that the value xpath expressions all assume that the current context is the matched node, rather than the top level of the xml document: "species" is used instead of "//monkey[x]/species" . The -o parameter tells xmlstarlet to output a text string, and it is used in this example to include a space between the two values associated with each monkey. At the end of our template, we include the -n parameter, which tells xmlstarlet to include a new line character. If we omitted the -n parameter in this example, all the data would appear on one line of text.

Xmlstarlet can also operate on remote XML content. Let's abandon our monkey example, and try to extract some content from the Ars Technica RSS feed:

$ xmlstarlet sel --net -t -m "//item" -o "Title: " -v "title" -n 
   -o "Author: " -v "author" -n  http://arstechnica.com/index.ars/rss
Title: Microsoft server software to go 64-bit only
Author: jeremy@arstechnica.com (Jeremy Reimer)
Title: Firefox 1.5 release expected soon
Author: segphault@sbcglobal.net (Ryan Paul)
Title: Online DVD rentals have bright future
Author: eric@arstechnica.com (Eric Bangeman)
...

In this example, we include the --net parameter to tell xmlstarlet to download the XML content from a remote location. The example iterates over every item element in the XML document, and displays the title and author elements for each item.

Xmlstarlet can also process remote html content. If you use the --html parameter in addition to the --net parameter, you can extract data from web sites. To generate a list of image files used in a web page, simply iterate over each img element and display the src attribute:

$ xmlstarlet sel --net --html -t -m "//img" -v "@src" -n http://xmlstar.sourceforge.net

img/xmlstarlet.png
/img/libxml2-logo.png
http://sourceforge.net/sflogo.php?group_id=66612&type=1
http://sourceforge.net/dbimage.php?id=3426
http://images.sourceforge.net/images/xml.png
http://www.zvon.org/site/graphic/zvon.gif

Now let's try a more sophisticated example. As many of you know, the Open Document Format, which is utilized by OpenOffice 2 and other open source office applications, is based on XML. With a little bit of clever trickery, you can use xmlstarlet to extract content from your OpenOffice documents right at the command line. Open Document files are essentially compressed zip archives that contain all the relevant files associated with a document. The actual document text is stored in a file called content.xml within the archive. In order to use xmlstarlet to extract data from content.xml , you have to use the the unzip command to pipe the contents of content.xml into the xmlstarlet utility.

In our next example, we will list all the headings in the document and the associated heading level values, a technique that could be used to automatically generate outlines of open documents. The Open Document format uses many different XML namespaces for different kinds of content. Various text elements use the "urn:oasis:names:tc:opendocument:xmlns:text:1.0" namespace, so we will need to use that one to get the headings. Xmlstarlet allows you to establish namespace keywords with the -N parameter. In our example, we will assign the Open Document text namespace with the keyword text:

$ unzip -p test.odt content.xml | xmlstarlet sel 
  -N text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" 
  -t -m "//text:*[@text:outline-level]" -v "@text:outline-level" -o " " -v . -n

Our example iterates over every text element that has an outline-level attribute, and it displays the associated level value and the text of the node itself. Note that we do not tell xmlstarlet which file it should use for this operation, because we pipe in the relevant content.

As you can see, xmlstarlet is an extremely useful tool for command line XML operations. There are many other features that I have not presented here, and interested users should take a look at the documentation for additional examples.

Cool App of the Week

SuperTux

I don't know about you folks, but I'm a hardcore Mario fanatic. My obsession with Super Mario World for the SNES borders on religion, I have played Mario 3 so many times that I can probably beat the first world with my eyes closed, and I have unraveled virtually every hidden feature in Mario RPG. My dreams are filled with plumbers, mushrooms, and funky flying turtle things that inexplicably pursue my destruction. For all those reasons, I have become hopelessly addicted to SuperTux, an outstanding, open source, Mario-inspired screen scroller for Linux.

SuperTux is essentially a Mario clone with unique art, original audio content, and well-designed levels. The main character is, of course, an adorable penguin that hops his way through tricky levels filled with walking bombs and evil snowballs. The current release contains all the content associated with Milestone 1, which includes 9 different enemies and 26 playable levels that feature the obligatory winter theme. Milestone 2, which is currently under active development, will add new enemies, up to 30 new levels with a forest theme, support for penguin "flapping" (doesn't that sound cute?), and internationalization support.


SuperTux in action

I have now beaten every level in the first world, and about a third of the levels on the bonus island. Despite a few subtle bugs and the amatuer quality art, this game is highly entertaining and woefully addictive. The developers are very creative, and some of the concept art illuminates other features planned for future releases. If you are a Mario fan, or you are looking for a fun way to waste some time on your Linux system, you might want to check out SuperTux. Warning: it will decimate your productivity, so play at your own risk.

/dev/random

  • Gaim-vv developers claim that Google has too much control over Gaim development.
  • Oooh shiny! OSDir has a screenshot tour of KDE 3.5 RC 1.
  • Microsoft's Charles Fitzgerald thinks that open source users are "dorks."
  • MIT turns down free copies of OS X for its US$100 laptop project because Apple isn't willing to distribute its operating system under an open-source license.
  • OSTG announced a patent pledge web site that makes it easy for open source developers to find out which patents companies like IBM have made available for royalty-free usage.
  • Linux.com has a tutorial that introduces netcat, the hacker swiss army knife.

Channel Ars Technica