In the last few posts I’ve discussed the use of XHTML5 for marking up books content. In that application, heavy use is made of custom class and other attributes on the generic HTML5 elements, in particular <article>, <section>, <div> and <span>. Whereas in an XML-based markup scheme information about the structural and/or semantic role of portions of text is conveyed via appropriate element names, here that information is conferred via combinations of element names and attribute names and values.
Implicit in this idea, of placing much more semantic weight on attributes than is conventional in XML applications, is the issue of how to assess data quality and integrity absent a DTD or W3C Schema. Schematron has an important part to play, of course, but some other ideas occurred to me. The upshot is what I’m referring to as CheckMate, which is an as-yet still prototype Windows application and associated scripting language which evaluates XML (including XHTML) content against custom sets of rules. Rules consist of assertions relating to specific elements or element/attribute combinations. Some examples probably elucidate things most readily. Here then is how the script I use to check MUP books markup begins:
# MUP books rule for CheckMate #
# Version 1.0 #
* 1.0: ROOT_IS <html>
* 2.0: <html> HAS_CHILD <head>
* 2.1: <html> HAS_CHILD <body>
* 3.0: <head> HAS_ALL_CHILDREN_IN [<link>, <meta>, <title>]
* 3.1: <head> HAS_ALL_CHILDREN_IN [<meta@name(‘dc.title’)>, <meta@name(‘dc.creator’)>, <meta@name(‘dc.description’)>, <meta@name(‘dc.subject’)>, <meta@name(‘dc.source’)>, <meta@name(‘dc.format’)>, <meta@name(‘dc.publisher’)>, <meta@name(‘dc.created’)>, <meta@name(‘dc.type’)>, <meta@name(‘dc.identifier’)>, <meta@name(‘dc.language’)>, <meta@name(‘dc.copyright’)>, <meta@name(‘dc.rights’)>, <meta@name(‘dc.rightsHolder’)>]
* 3.2: <head> HAS_CHILDREN_ONLY_FROM [<link>, <meta>, <title>]
* 4.0: <body> HAS_CHILDREN_ONLY_FROM [<article@class(‘book’)>]
* 4.1: <article@class(‘book’)> HAS_ALL_CHILDREN_IN [<section@class(‘book-meta’)>, <section@class(‘book-front’)>, <section@class(‘book-body’)>, <section@class(‘book-back’)>]
* 5.0: <section@class(‘book-meta’)> HAS_CHILDREN_ONLY_FROM [<section@class(‘book-series-info-sec’)>, <section@class(‘book-title-page’)>, <section@class(‘book-pub-rights’)>]
The structure is clear enough I think. Comment lines start with a hash (#), rules begin with an asterisk (*). Each rule has an arbitrary identifier. Rules, with a few exceptions, take the form of element-type object, predicate, and then either another element-type object or a list of such objects. Currently implemented predicates are:
Still to come are:
(plus a context-specification mechanism, i.e. element *in-context-X* PREDICATE [element|element_list])
(And I’m currently implementing regular expression matching for attribute values.)
I take it that the predicate names are more-or-less self-explanatory. Already CheckMate is able to ensure that books content conforms to a tight sectional structure, with little ambiguity about what is or isn’t allowed within particular content sections. Rules are relatively compact, readily intelligible, and easily added to. Performance is excellent, since CheckMate works by first indexing the elements and attributes, and carrying out its operations against these indexes rather than against the content directly.
Further work as described will make CheckMate an even more powerful tool for checking content marked up using an XHTML5-based scheme.