Menu

Microformats in Context

April 26, 2006

Uche Ogbuji

There has been a lot of discussion in XML circles as to how far the extensibility revolution promised by XML can take (or has taken) us. Is XML really a tool for creating specialized languages so that information can be expressed in the most natural formats practical? Or is it just a way to reduce the burden on those who write code to consume web content (be strict in what you accept so that you can be liberal with your time spent fly-fishing). Are schema technologies a way to manage the flexibility that XML brings to the table, or just another weapon to put down users ("You don't validate. Go away")? Of course, the way I've posed these questions reveals my bias. I think that XML should be a tool for expressiveness and controlled diversity on the Web. I disagree strongly with the notion, recently expressed in a few quarters, that there are only a few viable XML formats, and that people should stop creating more. At the center of this controversy is the new Web 2.0 hotness: microformats. If you're not already familiar with this phenomenon, first read "What Are Microformats".

It's a DIV's World

Microformats enshrine the idea that rather than creating whole new vocabularies, developers should piggy-back off existing, widely supported and deployed formats such as XHTML. (In this article I'll focus mostly on microformats with XHTML as a host language.) The problem is that XHTML, at its best, does is good for basic document structure but, at its worst, tends to be used for the presentation of documents. Microformats are a lightweight way to express more specialized information within the structure of XHTML without changing its syntax. The idea is that the success of this approach rests on modest (hence "micro") constructs in modules that are mutually independent and focused on very specific domains. Through such simplicity and modularity microformats minimize the strain on the host languages, as well as the implementation effort and overall conceptual load.

Unfortunately, the strain is rarely avoided in practice. Many of the XHTML-based microformats I've seen abuse the semantics of XHTML. a/@rel tends to come in for special abuse. The HTML 4.01 recommendation, whose semantics are adopted by XHTML, says:

This attribute describes the relationship from the current document to the anchor specified by the href attribute. The value of this attribute is a space-separated list of link types.

A microformat, such as Google's rel='nofollow', stretches this definition to breaking. "Don't follow this link" is an instruction to the user agent (more likely an automated agent such as a search index robot). This is related to what was known as "actuation" in the XLink specification and a very different matter from the conceptual relationship between the two documents. I'll hasten to add that these problems are to some extent understood in the microformats camp, and that there are some quite reasonable uses of a/@rel in microformats, including rel-license and rel-tag. Then again there is rel-enclosure, which is still designated a draft but does perpetuate a/@rel abuse without any apology in the spec. The abuse of a/@rev in the vote-links microformats is an even more heinous example. Before you write off my complaints about abuse of existing XHTML constructs as too rarefied and academic, consider that it leads to a very real problem when microformats collide.

Will the Real rel Please Stand Up

There are only so many XHTML attributes to hitch a ride on, and if you can stretch the semantics of each attribute pretty much to suit yourself, it's inevitable that you will need to use clashing microformats. Imagine you have a weblog that automatically asserts rel='nofollow' on comment links to discourage comment spam. An example comment looks as follows.

<p>Nice blog.  Buy your medz <a href='http://medz.com' rel='nofollow'>here</a></p>

But you have another tool that looks for personnel links within your organization and marks them using a colleague designation in the XFN microformat.

<p>I just want to be sure your readers know we're aware of the stability

problems with the latest release.  I've posted some workarounds on

<a href='http://mf-wizards.com/~jdoe/' rel='colleague'>my own blog</a>.</p>

You now have some sorting out to do. Of course you cannot have two rel attributes on the same element. You could set a priority that XFN annotation overrides rel-'nofollow' (this is probably what you'd want in practice), but this means that suddenly your microformats are no longer really independent, and they're certainly not modular. Microformat tools have to be aware of the different specs that might clash, and you introduce a bit of a negative network effect. You could use the NMTOKENS escape hatch, which would mean that after both tools have done their work the comment would look as follows:

<p>I just want to be sure your readers know we're aware of the stability problems with the latest release. I've posted some workarounds on <a href='http://mf-wizards.com/employees/jdoe/' rel='colleague nofollow'>my own blog</a>.</p>

One problem with this is that when you have a microformat such as XFN, which already allows multiple tokens within a/@rel, you're still inviting clashes because it's not clear which tokens are part of XFN, and which come from other conventions. It also becomes a land grab for terms across microformats. XFN defines rel='date' as a statement that you have a romantic involvement with the person represented by the resource indicated by the href. This could make for some stickiness in a microformat for references to calendar resources, where rel='date' would have a markedly different meaning.

U. G. L. Y. You Ain't Got No Alibi...!

Another problem that stems from being restricted to a host language is that you often end up with very contorted and ugly constructs to force the fit. XOXO is an eminent example of this problem. I once did an exploration of XOXO as a language for exchanging weblog lists, rather than the more established, but quite awful, OPML. I ended up with something like Listing 1.

Listing 1: XOXO example of a weblogs list

<ol class="xoxo">

  <li>

    <p>Technology</p>

    <ol>

      <li>

        <ul>

          <li>

            <a href="http://weblog.foo" type="text/html">Weblog home</a>

            <a href="http://weblog.foo/atom" type="application/atom+xml">Web feed</a>

            <dl>

              <dt>description</dt>

              <dd>That good ole Weblog</dd>

            </dl>

          </li>

        </ul>

      </li>

    </ol>

  </li>

</ol>

XHTML is not really designed for expressing lists of feeds, so XOXO ends up having to layer on the XHTML scaffolding rather thickly. The result is verbose and hard to read. I track a chamber of XML horrors I've found in my consulting, and one very common absurdity is what I call "markup indirection." Developers sometimes choose to ignore the basic extensibility of XML and design formats where the structure is completely generic, and all the markup essentially becomes content. The usual suspect is just a bloated translation of a CSV file.

<product>

  <property>

    <name>ID</name>

    <value>xyz123</value>

  </property>

</product>

rather than <product xml:id='xyz123'/>. The ultimate reduction of this absurdity is <element name="description">... rather than <description>.... Amazingly enough, XOXO goes one step worse than this joke in the pattern:

<dl>

  <dt>description</dt>

  <dd>My favorite Weblog</dd>

</dl>

The above cries out to be written instead as <description>My favorite Weblog<description>. Beyond the ugliness, another problem with markup indirection is that you're fighting against the design of XML and against general-purpose tools that are designed to look for the keys to structure in elements, and not squirreled away in content. Markup indirection also makes processing harder, and this is a common problem that I see with microformats. Eve Maler pointed out to me in a private discussion that this has been an endemic problem from the early days of SGML and it stems from a perception of false economy where people think fewer tags means less burden.

If, instead of XHTML, I start with XBEL, an XML vocabulary that is designed for expressing lists of links I end up with a much more attractive result, Listing 2.

Listing 2: Translation of Listing 1 (a weblogs list) to XBEL with extensions

<folder>

  <title>Technology</title>

  <bookmark href="http://weblog.foo">

    <title>Example Weblog</title>

    <info>

      <metadata owner="webfeeds">

        <link href="http://weblog.foo/atom" type="application/atom+xml"/>

        <description>That good ole Weblog</description>

      </metadata>

    </info>

  </bookmark>

</folder>

I can do even better if I create a specific vocabulary for weblogs, as in Listing 3.

Listing 3: Translation of Listing 1 (a weblogs list) to a custom XML format

<folder>

  <title>Technology</title>

  <weblog href="http://weblog.foo">

    <title>Example Weblog</title>

    <webfeed href="http://weblog.foo/atom" type="application/atom+xml"/>

    <description>That good ole Weblog</description>

  </weblog>

</folder>

The kicker, of course, is that since I'm using XML as it was intended to be used, it's an easy transform from Listing 3 or 2 to Listing 1, either to that explicit XHTML, using XSLT, or to the equivalent presentation using CSS. Almost all modern browsers support at least XML and CSS so this would be transparent to end users.

So, does pretty matter? Interestingly enough, the microformats community heavily overlaps the view-source philosophy, which holds that the best Web standards are simple and transparent so that a developer or content author can simply find a page to emulate, view the page source, and imitate the constructs. The paradox is that the limitations of microformats mean that beyond the simplest uses they tend to be ungainly, and thus fail the view-source test. Truly specialized formats, or at least proper extensions to existing formats are generally much easier to comprehend by casually inspecting the markup.

A Search for Meaning

One problem that the microformats technique doesn't address at all is auto-discovery of semantics. You learn the meaning of the conventions in a microformat by reading the format specification. There are no shortcuts. If you come across a pattern in a host format that looks suspiciously like a microformat, you have no way of knowing what the microformat is for, and what its rules are unless you do some sleuthing with the help of your favorite search engine and find the spec. Even once you find the spec you almost always get an informal description of the convention. You don't often get a schema, and you almost never get a schema structured enough to help automate processing of the format.

This is one limitation that I think is the right choice for microformats. Discovery and semantics are very hard problems, and microformats would never have got off the ground trying to solve them any more than XML would have trying to solve the problem of semantic transparency as well as syntactic transparency. Microformats are rooted strictly to the syntactic realm, and those who do need more formality and structure can build these on the basics. In this article, I am sticking as much as possible to syntactic considerations with respect to microformats, but some of these considerations are related to semantics and are informed by how semantics might be mixed into microformats.

The leading effort along these lines is Gleaning Resource Descriptions from Dialects of Languages (GRDDL). GRDDL is an initiative (undertaken mostly by W3C staffers) to bind microformats to RDF models. It's especially interesting because it hinges on a simple idea that my colleagues at Fourthought came up with four years ago and provided as a feature in the 4Suite server and repository (Eric van der Vlist was independently pursuing similar notions at about the same time). The idea is to use XSLT transforms to transform plain old XML to RDF/XML, thus creating a binding from syntax to formal semantic model. But the most important contribution by GRDDL is that of the profile, a convention for a host language that expresses URIs to assert which microformats are actually used in a document instance. The GRDDL profile for XHTML prescribes usage as in Listing 4.

Listing 4: An XHTML document that uses the XFN microformat and GRDDL

<html xmlns="http://www.w3.org/1999/xhtml">

  <head profile="http://www.w3.org/2003/g/data-view">

    <title>Some Document</title>

    <link rel="transformation"

       href="http://www.w3.org/2000/06/dc-extract/dc-extract.xsl" />

    <link rel="transformation"

       href="http://www.w3.org/2003/12/rdf-in-xhtml-xslts/grokXFN.xsl" />

  </head>

  <body>

  ...

    <div class='blogroll'>

      <a href="http://chimezie.ogbuji.net/" rel="brother met">Chimezie</a>

    </div>

  ...

  </body>

</html>

The profile prescribes the profile="http://www.w3.org/2003/g/data-view" attribute on the head element so that GRDDL processors know that the document follows the convention. The profile also allows for a number of link elements with rel="transformation", each of which defines a transform from syntax manifested within the XHTML to RDF/XML to be parsed into a model. Listing 4 uses XFN and thus asserts a link to a relevant XSLT transform at http://www.w3.org/2003/12/rdf-in-xhtml-xslts/grokXFN.xsl. There is also a transform link to http://www.w3.org/2000/06/dc-extract/dc-extract.xsl, which is not related to any microformat, but rather to XHTML itself. It extracts from XHTML readily-accessible Dublin Core metadata, such as the document title (from the title element) description, creator or date (from corresponding meta elements). This underscores that GRDDL is more general than microformats. In fact, if you use extensions to XHTML rather than a microformat, you can use GRDDL just as well to make the extension, and to extract RDF therefrom.

GRDDL imposes an additional burden on a microformat's specification, namely, an XSLT transform to RDF/XML. This is often not a problem since microformat authors are usually sophisticated, and in the worst case they can get a little help from someone else to write the transform. GRDDL also imposes an additional burden on a microformat's user: the profile attribute and transform links in the document heading. This is more problematic since most web authors hate to worry about such details. The idea of GRDDL profiles would help solve the discovery and semantic issues of microformats, although it would be nice to see other sorts of links, such as to the schema or even to the specification of a microformat, which GRDDL doesn't explicitly address at present. It remains to be seen whether web authors can bear the burden of profile information in document headers. Since asserting these links is such a straightforward matter of syntax, it is probably a case of whether GRDDL advocates can convince tool vendors to make the small necessary tweaks.

A More Radical Departure

GRDDL is designed to play nicely with host formats, microformats, whole-sale extensions and just about anything one can cook up in the syntax. RDF/A is a related initiative but represents a more radical departure from microformats. It actually predates microformats and GRDDL. It started out as an RDF syntax that would be more friendly to web authors because it is expressed in XHTML. It has recently changed its name, some of its focus, and has gained a good bit of steam indirectly from the microformats buzz. While you can think of GRDDL as a bridge from microformats to RDF, you can think of RDF/A as microformats done the RDF way in the first place. The rel-license microformat specifies that a link is specifically to the license for the source document.

<html xmlns="http://www.w3.org/1999/xhtml">

  <head>

    <title>Some Document</title>

  </head>

  <body>

  ...

    <p>This document is licensed under a

<a rel="license" href="http://creativecommons.org/licenses/by-nc/2.5/">

  Creative Commons Non-Commercial License

</a>.

    </p>

  ...

  </body>

</html>

It takes a fairly light change to turn this into RDF/A

<html

  xmlns="http://www.w3.org/1999/xhtml"

  xmlns:cc="http://creativecommons.org/licenses/">

  <head>

    <title>Some Document</title>

  </head>

  <body>

  ...

    <p>This document is licensed under a

<a rel="cc:license" href="http://creativecommons.org/licenses/by-nc/2.5/">

  Creative Commons Non-Commercial License

</a>.

    </p>

  ...

  </body>

</html>

rather than rel="license" it's rel="cc:license", with the prefix mapping to the added namespace declaration http://creativecommons.org/licenses/. This is another example of the problem-filled practice of QNames in content, but it's based on the RDF/XML legacy and is used to construct RDF predicate links much as such QName constructs are used in RDF/XML.

The qualification of the license relationship in this way provides for discovery and semantic precision. The namespace can be treated as a link and dereferenced to get more information about the usage, and this link relationship would not be confused with any other sort of relationship. The main syntactic problem that afflicts microformats also dogs RDF/A, however. By stretching RDF to fit an XHTML skeleton, the result can be quite ugly. If you care at all about XML design, or even about plain transparency, you should be prepared to do a lot of wincing while going through the examples in the RDF/A primer.

As I said, it was right for microformats to start by worrying primarily about syntax with semantics communicated informally. I do think that as microformats take off more, people will start to miss the sorts of help with interchange and transform that can come with more formalized semantics. Microformats look to codify small islands of relatively informal context, whereas GRDDL and RDF/A look to aggregate these islands into distributed models to form the basis of a Semantic Web. It would be nice to have a schema-driven intermediate to these ideas that would allow annotations of the meaning of microformats constructs (Schematron springs to mind as a very fruitful technology in this context), providing processing support if not aggregation, which could then be delegated to a separate RDF layer (perhaps through GRDDL).

Form Is Function

Just as I was wrapping up the first draft of this article, Norm Walsh wrote a weblog entry in which he provided some thought experiments on a means for validating microformats. He believes that "[the validation] problem has to be solved before microformats can be considered a reliable way to encode data." I agree and it's a very interesting read. It's especially interesting in the way it echoes some of my own points above. First of all, to make the document structure more accessible for validation, he wrote a transform to turn the tokens hidden in class attributes and such into the generic identifiers of the XML tags themselves. This is related to my point that microformats' reliance on structure hidden in attributes makes processing more difficult than XML should be. At least one commenter noted how much of an improvement the transformed content was, which echoes my points about readability. Norm also had to contend with cases of semantic clash between constructs in different microformats (in this case, even between two formats created by the same author). His article focuses on validation of syntax, rather than its expressiveness, as I do in this article.

G. Ken Holman pointed out to me in private mail that the new standard ISO/IEC 19757-4 Namespace-based Validation Dispatching Language (NVDL) promotes the use of micro vocabularies (small, specialized XML formats). You can embed these in a host language and use NVDL to declare how validation is dispatched to different schemata based on namespace or other patterns. Ken has a good synopsis of this approach in this message on the UBL list.

It's too bad that microformats and RDF/A degenerate to such awful XML design in non-trivial use cases. Good XML design is not just a concern for purists. Readability and transparency matter, and they are fundamental goals of XML. XML support in browsers is just becoming respectable enough to use the technologies as they were meant to be used. There is really no practical reason why modules of specialized XML with associated modules of CSS could not be used with host languages. There is the problem that host languages are not always readily extensible; in the case of XHTML, to be technically correct you would have to go through the significant trouble of creating a DTD module that meets stringent standards. In practice, however, not much validation is done on the Web. If we could mix in profiles from GRDDL to support discovery, and beef the idea up so that one can express more types of links than transforms to RDF, there would be a solid bridge to semi-automated processing. Such a combination might be a real sweet spot where communities of practice can share modest and highly focused conventions while still propagating high-quality markup. It would require hardly anything in the way of new technology. It would just be a matter of top-notch salesmanship to the user community, something in which the microformats revolution has offered a great lesson.