Archive for the ‘Uncategorized’ Category
Recently I was asked to give a talk about innovation in digital publishing at a major publishing and membership organization in the engineering and technology field. Or rather that was what I deemed the topic to be. What I was actually asked to address was a Steve Jobs quotation (which may have been a slight paraphrase) – “It’s not the customers job to know what they want”. The presentation is here.
I don’t think I actually said anything very, well, innovative, but the talk offered a nice chance to rediscover some of the ideas I encountered some years ago when I took a course on ‘Computers and Creativity’ under Professor Margaret Boden at the University of Sussex, as part of my masters degree in AI and cognitive science.
The line I took in the talk was that innovation often requires that multiple developments and ideas in disparate areas be brought into contact with each other, and for a host of reasons – cognitive, economic, social, technical, organizational – that can be a hard thing to do. It requires deep knowledge, a sometimes serious level of creativity, and no little insight into market needs and customer motivations. The tablet computer is a case in point, depending as it does on the application of developments in display technology, battery technology, circuit integration, UI design, etc. Innovation in digital publishing, I suggest, is substantially similar, in that it requires us to keep abreast of, select from, and integrate and apply, developments in a host of different technology areas.
That sounds like a big challenge, and it is. But maybe there’s a crumb of comfort to be had in the thought that much – perhaps most – innovation is not radical and revolutionary but is instead incremental and evolutionary. In other words, this version might not be great, but find that something extra and the next one might be. (Just hope that in the mean time your competitors’ offerings don’t take things so far forwards that you can’t recapture the initiative!)
In the last few posts I’ve discussed the use of XHTML5 for marking up books content. In that application, heavy use is made of custom class and other attributes on the generic HTML5 elements, in particular <article>, <section>, <div> and <span>. Whereas in an XML-based markup scheme information about the structural and/or semantic role of portions of text is conveyed via appropriate element names, here that information is conferred via combinations of element names and attribute names and values.
Implicit in this idea, of placing much more semantic weight on attributes than is conventional in XML applications, is the issue of how to assess data quality and integrity absent a DTD or W3C Schema. Schematron has an important part to play, of course, but some other ideas occurred to me. The upshot is what I’m referring to as CheckMate, which is an as-yet still prototype Windows application and associated scripting language which evaluates XML (including XHTML) content against custom sets of rules. Rules consist of assertions relating to specific elements or element/attribute combinations. Some examples probably elucidate things most readily. Here then is how the script I use to check MUP books markup begins:
# MUP books rule for CheckMate #
# Version 1.0 #
* 1.0: ROOT_IS <html>
* 2.0: <html> HAS_CHILD <head>
* 2.1: <html> HAS_CHILD <body>
* 3.0: <head> HAS_ALL_CHILDREN_IN [<link>, <meta>, <title>]
* 3.1: <head> HAS_ALL_CHILDREN_IN [<meta@name(‘dc.title’)>, <meta@name(‘dc.creator’)>, <meta@name(‘dc.description’)>, <meta@name(‘dc.subject’)>, <meta@name(‘dc.source’)>, <meta@name(‘dc.format’)>, <meta@name(‘dc.publisher’)>, <meta@name(‘dc.created’)>, <meta@name(‘dc.type’)>, <meta@name(‘dc.identifier’)>, <meta@name(‘dc.language’)>, <meta@name(‘dc.copyright’)>, <meta@name(‘dc.rights’)>, <meta@name(‘dc.rightsHolder’)>]
* 3.2: <head> HAS_CHILDREN_ONLY_FROM [<link>, <meta>, <title>]
* 4.0: <body> HAS_CHILDREN_ONLY_FROM [<article@class(‘book’)>]
* 4.1: <article@class(‘book’)> HAS_ALL_CHILDREN_IN [<section@class(‘book-meta’)>, <section@class(‘book-front’)>, <section@class(‘book-body’)>, <section@class(‘book-back’)>]
* 5.0: <section@class(‘book-meta’)> HAS_CHILDREN_ONLY_FROM [<section@class(‘book-series-info-sec’)>, <section@class(‘book-title-page’)>, <section@class(‘book-pub-rights’)>]
The structure is clear enough I think. Comment lines start with a hash (#), rules begin with an asterisk (*). Each rule has an arbitrary identifier. Rules, with a few exceptions, take the form of element-type object, predicate, and then either another element-type object or a list of such objects. Currently implemented predicates are:
Still to come are:
(plus a context-specification mechanism, i.e. element *in-context-X* PREDICATE [element|element_list])
(And I’m currently implementing regular expression matching for attribute values.)
I take it that the predicate names are more-or-less self-explanatory. Already CheckMate is able to ensure that books content conforms to a tight sectional structure, with little ambiguity about what is or isn’t allowed within particular content sections. Rules are relatively compact, readily intelligible, and easily added to. Performance is excellent, since CheckMate works by first indexing the elements and attributes, and carrying out its operations against these indexes rather than against the content directly.
Further work as described will make CheckMate an even more powerful tool for checking content marked up using an XHTML5-based scheme.
As mentioned in the previous post, I’ve recently been involved in the development of a markup scheme for Manchester University Press’s books publishing programme. That programme is especially strong in the humanities and literature, with the latter including plays and verse.
Having decided that (X)HTML5 was the way to go, the question arose of how to mark up poetry. An obvious first thought is that stanzas (verses) amount to paragraphs, so it seems natural to think in terms of something like this:
<p>Verse 1 line 1<br />
Verse 1 line 2<br />
Verse 1 line 3<br />
Verse 1 line 4</p>
<p>Verse 2 line 1<br />
Verse 2 line 2<br />
Verse 2 line 3<br />
Verse 2 line 4</p>
Blog posts like this one seemed to support this line of thinking. However, further investigation gave pause for thought. In particular, the second reply to this question on stackoverflow seemed particularly relevant, for lines of poetry may be indented in essentially unlimited and arbitrary ways. To accommodate this inconvenient fact it is necessary to mark up each line as a block element in its own right.
In the end what we ended up with was as shown in the example below. We obtain the control we require by applying a CSS style attribute directly to each line that needs to be indented.
To my embarrassment I see that it’s been nearly three years (!) since I last posted on here, and an eventful time it’s been. In 2012 I returned to STM journal publishing, when I was privileged to learn how BioMed Central (BMC) has made the Open Access model work so well. So well, indeed, that parent company Springer Science+Business Media decided to absorb BMC into the company as a brand. (We shall have to wait and see what the merger of most of Macmillan with Springer will mean.) But for the past six months or so I’ve been working with Manchester University Press – the third largest such press in England after those at Oxford and Cambridge – on the development of a new books production workflow. We’re not quite there yet, but trials are going well. I figure this is something that could be of interest to quite a few people so here I’ll outline some of the thinking behind the new process.
Of course, XML is at the heart of things. XML standards and processes in STM journal publishing are by now quite mature – Elsevier’s journal production operations, for example, have been based on structured markup since even before the advent of XML in 1998 (if memory serves) – and the gotchas, if that’s the word, have by now generally been exposed and ironed out. The situation regarding books markup, however, is rather less settled. Markup options we considered included DocBook, TEI, NLM JATS (in its BITS variant for books), and HTML5. Central to our decision-making were the goals of expressiveness and extensibility. By expressiveness I mean the ability to capture the rich structural and semantic characteristics of content in a way that maximizes future downstream processing possibilities. (Who knows what functionality we might wish to deliver to content users in future?) By extensibility I mean the ability to develop the markup scheme as new requirements become apparent. And as if that wasn’t enough, we also sought relative simplicity, to make life as easy as possible for publishing staff as well as MUP’s suppliers.
In the end, after a lengthy content analysis phase, a certain amount of experimentation and some deliberation, we decided that HTML5, suitably XML-ified, offered the best combination of virtues relative to MUP’s needs. The (X)HTML5-based MUP markup specification is currently at the Version 1.1 stage, and trials with suppliers are under way to identify issues and areas where refinement is needed. Already, however, we have a markup language that handles well the sort of content for which MUP is noted, namely monographs in the humanities, and verse and drama editions.
Working with (X)HTML5 requires something of a gestalt switch if like me you are used to seeing markup, as regards the business of capturing overall content structure and semantics, primarily in terms of elements as opposed to attributes. HTML’s elements are of very broad potential applicability, and are more oriented to the capture of visual layout requirements than to the encoding of abstract structure and semantics. Achieving the latter means making full use of the class attribute (and to a lesser extent the id attribute), which in HTML5 can be applied to any element. The <article> and <section> elements that are new to HTML5, as well as attribute names prefixed by ‘data-‘, provide additional expressive power.
In subsequent posts I’ll describe some of the details of the MUP markup scheme, but to give a flavour here is how top-level book structures are handled:
<?xml version=”1.0″ encoding=”UTF-8″?>
<html lang=”en” xmlns=”…”>
<link rel=”stylesheet” type=”text/css” href=”… [CSS stylesheet name] …” />
<title>Smith and Jones | My First Sample MUP Book | Manchester Medieval Sources Series</title>
<article class=”book” id=”[MUP project identifier]” data-doi-book=”[Book DOI]”>
<section class=”book-series-info-sec”> … </section>
<section class=”book-title-page”> … </section>
<section class=”book-pub-rights”> … </section>
<section class=”book-front” id=”ABCD0000-front” data-doi-book=”[Book DOI]”>
<section class=”book-dedication”> … </section>
<section class=”book-pref-quotes”> … </section>
<section class=”book-toc”> … </section>
<section class=”book-inclusion-lists”> … </section>
<section class=”book-series-ed-preface”> … </section>
<section class=”book-preface”> … </section>
<section class=”book-contributors”> … </section>
<section class=”book-acks”> … </section>
<section class=”book-abbrevs”> … </section>
<section class=”book-permissions”> … </section>
<section class=”book-chronology”> … </section>
<article class=”chap” id=”chap-0″ data-chap-num=”-1″ data-doi-chap=”[Chapter- level DOI]”>
<section class=”chap-body”> … </section>
<section class=”chap-footnotes”> … </section>
<section class=”chap-endnotes”> … </section>
<section class=”book-back” data-doi-back=”[Book DOI]”>
<section class=”book-glossary”> … </section>
<section class=”book-appendix”> … </section>
<section class=”book-bibliography”> … </section>
<section class=”book-index”> … </section>
Defining an XML markup scheme that allows us to capture all the structural, stylistic and semantic distinctions we deem to be important in our documents is sometimes easier said than done. If only we could make simultaneous use of several distinct markup schemes within a document, or employ staggered elements, say. In the belief that simply lowering the validation bar by demanding that documents just be well-formed isn’t necessarily the answer when greater flexibility is required, I’ve been musing about the possibilities for more flexible markup languages. One result is Morf (‘more flexible markup language’).
Morf markup is very like XML markup, but some of XML’s constraints are dropped and new ones added in order to expand the space of permissible document structures. … [see technical note (PDF file)]