Epistemic Systems

[ cognition – information – knowledge – publishing – science – software ]

Introducing CheckMate

leave a comment »

In the last few posts I’ve discussed the use of XHTML5 for marking up books content. In that application, heavy use is made of custom class and other attributes on the generic HTML5 elements, in particular <article>, <section>, <div> and <span>. Whereas in an XML-based markup scheme information about the structural and/or semantic role of portions of text is conveyed via appropriate element names, here that information is conferred via combinations of element names and attribute names and values.

Implicit in this idea, of placing much more semantic weight on attributes than is conventional in XML applications, is the issue of how to assess data quality and integrity absent a DTD or W3C Schema. Schematron has an important part to play, of course, but some other ideas occurred to me. The upshot is what I’m referring to as CheckMate, which is an as-yet still prototype Windows application and associated scripting language which evaluates XML (including XHTML) content against custom sets of rules. Rules consist of assertions relating to specific elements or element/attribute combinations. Some examples probably elucidate things most readily. Here then is how the script I use to check MUP books markup begins:

# MUP books rule for CheckMate #
# Version 1.0 #
* 1.0: ROOT_IS <html>
* 2.0: <html> HAS_CHILD <head>
* 2.1: <html> HAS_CHILD <body>
* 3.0: <head> HAS_ALL_CHILDREN_IN [<link>, <meta>, <title>]
* 3.1: <head> HAS_ALL_CHILDREN_IN [<meta@name(‘dc.title’)>, <meta@name(‘dc.creator’)>, <meta@name(‘dc.description’)>, <meta@name(‘dc.subject’)>, <meta@name(‘dc.source’)>, <meta@name(‘dc.format’)>, <meta@name(‘dc.publisher’)>, <meta@name(‘dc.created’)>, <meta@name(‘dc.type’)>, <meta@name(‘dc.identifier’)>, <meta@name(‘dc.language’)>, <meta@name(‘dc.copyright’)>, <meta@name(‘dc.rights’)>, <meta@name(‘dc.rightsHolder’)>]
* 3.2: <head> HAS_CHILDREN_ONLY_FROM [<link>, <meta>, <title>]
* 4.0: <body> HAS_CHILDREN_ONLY_FROM [<article@class(‘book’)>]
* 4.1: <article@class(‘book’)> HAS_ALL_CHILDREN_IN [<section@class(‘book-meta’)>, <section@class(‘book-front’)>, <section@class(‘book-body’)>, <section@class(‘book-back’)>]
* 5.0: <section@class(‘book-meta’)> HAS_CHILDREN_ONLY_FROM [<section@class(‘book-series-info-sec’)>, <section@class(‘book-title-page’)>, <section@class(‘book-pub-rights’)>]

The structure is clear enough I think. Comment lines start with a hash (#), rules begin with an asterisk (*). Each rule has an arbitrary identifier. Rules, with a few exceptions, take the form of element-type object, predicate, and then either another element-type object or a list of such objects. Currently implemented predicates are:

root_is
exists
has_parent
has_child
descends_from
has_attrib
has_all_children_in
has_children_only_from

Still to come are:

contains
excludes
follows
precedes
has_parent_in
descends_from_member_of
count
has_one_child_from
(plus a context-specification mechanism, i.e. element *in-context-X* PREDICATE [element|element_list])

(And I’m currently implementing regular expression matching for attribute values.)

I take it that the predicate names are more-or-less self-explanatory. Already CheckMate is able to ensure that books content conforms to a tight sectional structure, with little ambiguity about what is or isn’t allowed within particular content sections. Rules are relatively compact, readily intelligible, and easily added to. Performance is excellent, since CheckMate works by first indexing the elements and attributes, and carrying out its operations against these indexes rather than against the content directly.

Further work as described will make CheckMate an even more powerful tool for checking content marked up using an XHTML5-based scheme.

Advertisements

Written by Alex Powell

May 7, 2015 at 3:30 pm

Posted in Uncategorized

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: