Perl Parser Performance

September 15, 2004

There was one dominant XML parser in Perl a few years ago; parsing an XML document was synonymous for using the XML::Parser module. The module written by Larry Wall and Clark Cooper worked as an interface to James Clark's expat XML parser, and it didn't leave much room for competitors. Traditional Perl modules for XML processing were built on the top of XML::Parser.

But times are changing. Other C/C++ parsers, such as libxml2 or Xerces C++ have entered the scene, and so have their Perl extensions. Perl XML folks have developed Perl SAX, a Perlish counterpart of Java SAX interface. Currently, CPAN contains several parsing modules. This article compares the performance of five PerlSAX 2 parsers freely available from CPAN. The old good XML::Parser is also included to serve as a baseline.

I must state that this article isn't an independent study as I'm a maintainer of one of the modules involved in this test. The original purpose of this test was to check if the new parser performs as expected. But the methodology is to my best knowledge neutral; all data and sources are available and results are easily reproducible for anyone.

Modules Involved

Six Perl parsers are included in the test. All of them can be downloaded from CPAN. C or C++ libraries required by individual Perl modules ( expat, libxml2, Xerces C++) are also open source. Since Perl modules have lengthy names with a good deal of colons, I use my own abbreviations to refer to them within this article:

XML::Parser [PARS], v2.34
Expat wrapper, has a specific event-based API different from PerlSAX 2.

XML::SAX::Expat [EXP], v0.37
Built on top of XML::Parser, makes use of XML::SAX::Base.

XML::SAX::ExpatXS [EXPXS], v1.00
Expat wrapper, branched the sources of XML::Parser, makes use of XML::SAX::Base.

XML::LibXML, v1.58
Contains two PerlSAX 2 parsers: XML::LibXML::SAX [LXML], XML::LibXML::SAX::Parser [LXMLP], both are interfaces to libxml2 and both make use of XML::SAX::Base. While LXML is a true streaming parser, older LXMLP builds a DOM tree inside. It has been deprecated by LXML in fact. The reason to include both LXML and LXMLP to the test is to compare the performance of the two different approaches.

XML::Xerces [XERC], v2.5.0-0
Interface to Xerces C++. It works with PerlSAX 2 handlers but its API differs from the specs in some aspects.

There is one more PerlSAX 2 compliant parser not included in this test. XML::SAX::PurePerl belongs to the XML::SAX package and serves as a pure-Perl fallback parser. Once you install XML::SAX you have a parser, regardless of external libraries installed in your system. However, this parser is considerably slower than those built around C/C++ libraries. I have dropped it from the test as I don't want to compare apples to oranges.

Test Documents

I have facilitated the selection of appropriate test XML documents by reusing test documents created by Clark Cooper for purposes of his benchmark of XML Parsers published on XML.com in May 1999. REC.xml is the XML version of the XML 1.0 specification (REC-xml-19980210.xml); the other documents (med.xml, chrmed.xml, big.xml, chrbig.xml) are mechanically expanded versions of REC.xml to get various sizes and markup densities.

What I was missing in Clark's selection were smaller and more dense documents typical for the Web. Accordingly I have added two additional real-world files: gingerall.xml, an XHTML file downloaded from the gingerall.org site, and rss10.xml, an RSS 1.0 file originating from the recently hibernated xmlhack.com. Table 1 contains the complete set of XML documents I use within this test:

Table 1.
Characteristics of Test Documents

	REC	med	chrmed	big	chrbig	gingerall	rss10
Size (bytes)	159339	1264240	893821	5052472	3417181	12400	6005
Markup density	34%	33%	6%	33%	2%	43%	45%

Test Method

All the modules are tested using a single Perl script. EXP, EXPXS, LXML, and LXMLP are treated exactly in the same way; this can be seen as a proof of the Perl SAX2 concept. XERC shares the same handler but it requires an extra treatment in the constructor and a parsing method call. PARS has both API and a handler of its own.

Handlers are as simple as they can be; each callback function simply counts how many times it has been called. Each parser retrieves each document 10 times subsequently; the parsing time is measured with the Time::HiRes module.

The tests were run in the following environment: RedHat Linux 9, Perl 5.8.0 (hardware being ancient Pentium III/450, 256 MB). The versions of C/C++ libraries involved are: expat 1.95.7, libxml2 2.6.9, and Xerces C++ 2.5.0.

Results

The results are broken down by the markup density of the test documents. The density makes much more difference than the size of documents. This is not really a surprise for streaming parsers. Even DOM-based LXMLP keeps the pace with the others as long as there is enough memory available. Figures 1, 2, and 3 graph the performance of the parser modules for medium, low, and high markup density. The values shown in the figures are proportional times of processing; the fastest parser shows 100% for each document.

Figure 1.
Performance Comparison for Medium Markup Density Files

PARS leads with a significant margin. Streaming XS extensions EXPXS, XERC, and LXML follow some distance ahead of LXMLP and EXP. Most of this is explainable by the architectural approaches used. Event-based processing requires a lot of function calls; and these calls are expensive in Perl. One more function call per event most likely reduces the performance of EXP. LXMLP does pretty well, demonstrating that libxml2 builds and access DOM really fast. The modules using XML::SAX::Base as a base class (EXPXS, EXP, LXML, LMXLP) have the handicap of an additional Perl function call for each event as well. This is a common tradeoff between performance and compatibility. Since PARS and EXPXS have a comparable code base, most of the performance difference between the two parsers should be caused by object overhead and subclassing XML::SAX::Base.

Figure 2.
Performance Comparison for Low Markup Density Files

PARS (and EXP, which builds on PARS) performs significantly worse for documents with higher proportions of text. The reason is simple -- expat reports one character() event for each line and one more for each line break. Hence it generates many more events than other parsers. And again the "function calls are expensive" mantra; more callbacks mean less performance. The difference in number of calls can be huge. For example, PARS and EXP generate 142,415 character events for chrbig.xml, while EXPXS and LXMLP need as little as 3,981 events (EXPXS is also expat-based but it joins consequent characters before entering the Perl space).

Figure 3.
Performance Comparison for High Markup Density Files

XERC appears to be somewhat faster for small and dense documents; it beats EXPXS in this category. My guess is this is due to the XML::SAX::Base initial overhead being proportionally more significant for small files. Real processing times (see details in Table 2) show that any of the modules taking part in this test are perfectly serviceable to parse web-sized documents. Table 2 shows the overall results with links to the raw data produced by the test script.

Table 2.
Overall Results

	REC	med	chrmed	big	chrbig	gingerall	rss10	average
	details	details	details	details	details	details	details	average
PARS	100%	100%	128%	100%	277%	100%	100%	129%
EXPXS	190%	188%	100%	190%	100%	209%	207%	169%
XERC	227%	230%	107%	221%	127%	199%	191%	186%
LXML	251%	240%	118%	241%	126%	223%	218%	202%
LXMLP	467%	478%	209%	461%	149%	564%	425%	393%
EXP	609%	604%	612%	627%	1187%	643%	582%	695%

The average proportional time in the last column has no universal relevance as it strongly depends on the selection of documents. This is simply a way to express how the parsers have performed in this test with a single number, but don't take it too seriously, please.

I would like to avoid making a final evaluation of the tested parsers. Anyone can make their own conclusions based on the above facts. All the modules perform well enough in common scenarios. Moreover, apart from pure performance, other aspects must be taken into account when choosing a Perl parser, such as compatibility, compliance, stability, or dependencies. Hopefully, the offerings are sufficient for most of us.