Recently I was asked to give a talk about innovation in digital publishing at a major publishing and membership organization in the engineering and technology field. Or rather that was what I deemed the topic to be. What I was actually asked to address was a Steve Jobs quotation (which may have been a slight paraphrase) – “It’s not the customers job to know what they want”. The presentation is here.
I don’t think I actually said anything very, well, innovative, but the talk offered a nice chance to rediscover some of the ideas I encountered some years ago when I took a course on ‘Computers and Creativity’ under Professor Margaret Boden at the University of Sussex, as part of my masters degree in AI and cognitive science.
The line I took in the talk was that innovation often requires that multiple developments and ideas in disparate areas be brought into contact with each other, and for a host of reasons – cognitive, economic, social, technical, organizational – that can be a hard thing to do. It requires deep knowledge, a sometimes serious level of creativity, and no little insight into market needs and customer motivations. The tablet computer is a case in point, depending as it does on the application of developments in display technology, battery technology, circuit integration, UI design, etc. Innovation in digital publishing, I suggest, is substantially similar, in that it requires us to keep abreast of, select from, and integrate and apply, developments in a host of different technology areas.
That sounds like a big challenge, and it is. But maybe there’s a crumb of comfort to be had in the thought that much – perhaps most – innovation is not radical and revolutionary but is instead incremental and evolutionary. In other words, this version might not be great, but find that something extra and the next one might be. (Just hope that in the mean time your competitors’ offerings don’t take things so far forwards that you can’t recapture the initiative!)
In the last few posts I’ve discussed the use of XHTML5 for marking up books content. In that application, heavy use is made of custom class and other attributes on the generic HTML5 elements, in particular <article>, <section>, <div> and <span>. Whereas in an XML-based markup scheme information about the structural and/or semantic role of portions of text is conveyed via appropriate element names, here that information is conferred via combinations of element names and attribute names and values.
Implicit in this idea, of placing much more semantic weight on attributes than is conventional in XML applications, is the issue of how to assess data quality and integrity absent a DTD or W3C Schema. Schematron has an important part to play, of course, but some other ideas occurred to me. The upshot is what I’m referring to as CheckMate, which is an as-yet still prototype Windows application and associated scripting language which evaluates XML (including XHTML) content against custom sets of rules. Rules consist of assertions relating to specific elements or element/attribute combinations. Some examples probably elucidate things most readily. Here then is how the script I use to check MUP books markup begins:
# MUP books rule for CheckMate #
# Version 1.0 #
* 1.0: ROOT_IS <html>
* 2.0: <html> HAS_CHILD <head>
* 2.1: <html> HAS_CHILD <body>
* 3.0: <head> HAS_ALL_CHILDREN_IN [<link>, <meta>, <title>]
* 3.1: <head> HAS_ALL_CHILDREN_IN [<meta@name(‘dc.title’)>, <meta@name(‘dc.creator’)>, <meta@name(‘dc.description’)>, <meta@name(‘dc.subject’)>, <meta@name(‘dc.source’)>, <meta@name(‘dc.format’)>, <meta@name(‘dc.publisher’)>, <meta@name(‘dc.created’)>, <meta@name(‘dc.type’)>, <meta@name(‘dc.identifier’)>, <meta@name(‘dc.language’)>, <meta@name(‘dc.copyright’)>, <meta@name(‘dc.rights’)>, <meta@name(‘dc.rightsHolder’)>]
* 3.2: <head> HAS_CHILDREN_ONLY_FROM [<link>, <meta>, <title>]
* 4.0: <body> HAS_CHILDREN_ONLY_FROM [<article@class(‘book’)>]
* 4.1: <article@class(‘book’)> HAS_ALL_CHILDREN_IN [<section@class(‘book-meta’)>, <section@class(‘book-front’)>, <section@class(‘book-body’)>, <section@class(‘book-back’)>]
* 5.0: <section@class(‘book-meta’)> HAS_CHILDREN_ONLY_FROM [<section@class(‘book-series-info-sec’)>, <section@class(‘book-title-page’)>, <section@class(‘book-pub-rights’)>]
The structure is clear enough I think. Comment lines start with a hash (#), rules begin with an asterisk (*). Each rule has an arbitrary identifier. Rules, with a few exceptions, take the form of element-type object, predicate, and then either another element-type object or a list of such objects. Currently implemented predicates are:
Still to come are:
(plus a context-specification mechanism, i.e. element *in-context-X* PREDICATE [element|element_list])
(And I’m currently implementing regular expression matching for attribute values.)
I take it that the predicate names are more-or-less self-explanatory. Already CheckMate is able to ensure that books content conforms to a tight sectional structure, with little ambiguity about what is or isn’t allowed within particular content sections. Rules are relatively compact, readily intelligible, and easily added to. Performance is excellent, since CheckMate works by first indexing the elements and attributes, and carrying out its operations against these indexes rather than against the content directly.
Further work as described will make CheckMate an even more powerful tool for checking content marked up using an XHTML5-based scheme.
As mentioned in the previous post, I’ve recently been involved in the development of a markup scheme for Manchester University Press’s books publishing programme. That programme is especially strong in the humanities and literature, with the latter including plays and verse.
Having decided that (X)HTML5 was the way to go, the question arose of how to mark up poetry. An obvious first thought is that stanzas (verses) amount to paragraphs, so it seems natural to think in terms of something like this:
<p>Verse 1 line 1<br />
Verse 1 line 2<br />
Verse 1 line 3<br />
Verse 1 line 4</p>
<p>Verse 2 line 1<br />
Verse 2 line 2<br />
Verse 2 line 3<br />
Verse 2 line 4</p>
Blog posts like this one seemed to support this line of thinking. However, further investigation gave pause for thought. In particular, the second reply to this question on stackoverflow seemed particularly relevant, for lines of poetry may be indented in essentially unlimited and arbitrary ways. To accommodate this inconvenient fact it is necessary to mark up each line as a block element in its own right.
In the end what we ended up with was as shown in the example below. We obtain the control we require by applying a CSS style attribute directly to each line that needs to be indented.
To my embarrassment I see that it’s been nearly three years (!) since I last posted on here, and an eventful time it’s been. In 2012 I returned to STM journal publishing, when I was privileged to learn how BioMed Central (BMC) has made the Open Access model work so well. So well, indeed, that parent company Springer Science+Business Media decided to absorb BMC into the company as a brand. (We shall have to wait and see what the merger of most of Macmillan with Springer will mean.) But for the past six months or so I’ve been working with Manchester University Press – the third largest such press in England after those at Oxford and Cambridge – on the development of a new books production workflow. We’re not quite there yet, but trials are going well. I figure this is something that could be of interest to quite a few people so here I’ll outline some of the thinking behind the new process.
Of course, XML is at the heart of things. XML standards and processes in STM journal publishing are by now quite mature – Elsevier’s journal production operations, for example, have been based on structured markup since even before the advent of XML in 1998 (if memory serves) – and the gotchas, if that’s the word, have by now generally been exposed and ironed out. The situation regarding books markup, however, is rather less settled. Markup options we considered included DocBook, TEI, NLM JATS (in its BITS variant for books), and HTML5. Central to our decision-making were the goals of expressiveness and extensibility. By expressiveness I mean the ability to capture the rich structural and semantic characteristics of content in a way that maximizes future downstream processing possibilities. (Who knows what functionality we might wish to deliver to content users in future?) By extensibility I mean the ability to develop the markup scheme as new requirements become apparent. And as if that wasn’t enough, we also sought relative simplicity, to make life as easy as possible for publishing staff as well as MUP’s suppliers.
In the end, after a lengthy content analysis phase, a certain amount of experimentation and some deliberation, we decided that HTML5, suitably XML-ified, offered the best combination of virtues relative to MUP’s needs. The (X)HTML5-based MUP markup specification is currently at the Version 1.1 stage, and trials with suppliers are under way to identify issues and areas where refinement is needed. Already, however, we have a markup language that handles well the sort of content for which MUP is noted, namely monographs in the humanities, and verse and drama editions.
Working with (X)HTML5 requires something of a gestalt switch if like me you are used to seeing markup, as regards the business of capturing overall content structure and semantics, primarily in terms of elements as opposed to attributes. HTML’s elements are of very broad potential applicability, and are more oriented to the capture of visual layout requirements than to the encoding of abstract structure and semantics. Achieving the latter means making full use of the class attribute (and to a lesser extent the id attribute), which in HTML5 can be applied to any element. The <article> and <section> elements that are new to HTML5, as well as attribute names prefixed by ‘data-‘, provide additional expressive power.
In subsequent posts I’ll describe some of the details of the MUP markup scheme, but to give a flavour here is how top-level book structures are handled:
<?xml version=”1.0″ encoding=”UTF-8″?>
<html lang=”en” xmlns=”…”>
<link rel=”stylesheet” type=”text/css” href=”… [CSS stylesheet name] …” />
<title>Smith and Jones | My First Sample MUP Book | Manchester Medieval Sources Series</title>
<article class=”book” id=”[MUP project identifier]” data-doi-book=”[Book DOI]”>
<section class=”book-series-info-sec”> … </section>
<section class=”book-title-page”> … </section>
<section class=”book-pub-rights”> … </section>
<section class=”book-front” id=”ABCD0000-front” data-doi-book=”[Book DOI]”>
<section class=”book-dedication”> … </section>
<section class=”book-pref-quotes”> … </section>
<section class=”book-toc”> … </section>
<section class=”book-inclusion-lists”> … </section>
<section class=”book-series-ed-preface”> … </section>
<section class=”book-preface”> … </section>
<section class=”book-contributors”> … </section>
<section class=”book-acks”> … </section>
<section class=”book-abbrevs”> … </section>
<section class=”book-permissions”> … </section>
<section class=”book-chronology”> … </section>
<article class=”chap” id=”chap-0″ data-chap-num=”-1″ data-doi-chap=”[Chapter- level DOI]”>
<section class=”chap-body”> … </section>
<section class=”chap-footnotes”> … </section>
<section class=”chap-endnotes”> … </section>
<section class=”book-back” data-doi-back=”[Book DOI]”>
<section class=”book-glossary”> … </section>
<section class=”book-appendix”> … </section>
<section class=”book-bibliography”> … </section>
<section class=”book-index”> … </section>
Defining an XML markup scheme that allows us to capture all the structural, stylistic and semantic distinctions we deem to be important in our documents is sometimes easier said than done. If only we could make simultaneous use of several distinct markup schemes within a document, or employ staggered elements, say. In the belief that simply lowering the validation bar by demanding that documents just be well-formed isn’t necessarily the answer when greater flexibility is required, I’ve been musing about the possibilities for more flexible markup languages. One result is Morf (‘more flexible markup language’).
Morf markup is very like XML markup, but some of XML’s constraints are dropped and new ones added in order to expand the space of permissible document structures. … [see technical note (PDF file)]
Last week a short interview with (Sir) Jon(athan) Ive, senior vice-president of industrial design at Apple, appeared in i, the Independent’s cut-price stablemate. As the force behind such products as the iPod, iPhone and iPad, he presumably knows a thing or two about successful technology-led innovation. Mark Prigg, the interviewer, asked what makes Apple design different. Ive answered by focusing on the design process, rather than on its products, and said this:
We struggle with the right words to describe the design process at Apple, but it is very much about designing and prototyping and making. When you separate those, I think the final result suffers.
Although Ive was talking largely about the creation of physical technology objects, his words have relevance for software developers too. Integration of processes is key, he says, yet the structures and roles defined in many organizations militate strongly against this. A common problem is what could be termed paralysis through analysis. For a certain sort of organizational mind, no problem is quite so alluring as devising a logical org chart, in which each business function is resolved into a set of processes, separate roles are defined to address every stage and functional element of each process, and those roles then turned without further ado into jobs.
To those of this analytical cast of mind, jobs look well-defined when the key activities with which they are associated are relatively uniform. And indeed often this makes good sense: doubtless it would frequently be counterproductive to combine into a single job a variety of tasks that demand radically different skill sets or personal attributes. But take the approach too far and you end up with dull, narrow jobs that fail to embrace functionally substantive chunks of business process. One consequence is that to accomplish anything at all requires the effective operation of a communication network the size and complexity of which is out of all proportion to the nature of the fundamental business requirement. In addition there is the risk that job holders derive little satisfaction from their work. After all, they are not responsible for very much, and their ability to do their job depends as much – or more – on the performance of others as it does on the skilful exercise of their own abilities.
If you don’t feel responsible for something meaningful, your motivations will tend (unless you are one of those rare types able to remain serenely absorbed in a task, irrespective of how worthwhile the task seems to be) to be extrinsic – and often negative, in the sense that the principal justification for effort is liable to become the avoidance of the disciplinary measures likely to attend poor performance. Thinking about motivation brings me to something else Ive said in the interview. When asked ‘What makes a great designer?’, he replied
It is so important to be light on your feet: inquisitive and interested in being wrong. You have that fascination with the what-if questions, but you also need absolute focus and a keen insight into context and what is important – that is really terribly important. It’s about contradictions you have to navigate.
The flexible, deep engagement Ive describes surely requires an at least partially intrinsic sense of responsibility for an outcome, of ownership of the problem of realizing that outcome. (‘I want to do this, because it seems important to me.’) Ive talks about the need to be both inquisitive and optimistic, and those qualities are quite demanding in terms of the conditions they require. You must have confidence in both your own abilities and those of the organization you belong to. The harder the problem, or the more initially ill-defined the solution, the more you need the support of your organization. You need to know that your efforts will not be in vain, and that support will be available when you need it.
Probably anyone who has worked in software development has encountered situations in which innovation has been difficult if not actually impossible. These situations may be such that they fail to provide the conditions under which individuals can engage in depth with a problem, and/or they frustrate the collaboration needed to ensure that individual efforts sum positively. Commonly the problem is not lack of absolute quantity of resource, but rather lack of resource of the right type at the right times. Indeed, many failures in software development probably stem from having too many people involved with a project, working in roles that are too narrowly and rigidly defined or that are defined in a way that is somehow orthogonal to the natural grain of the business or the project.
The price of functional atomization of project resources is, more often than not, a very high communication overhead. People will spend more time in meetings trying to negotiate requirements, or writing documents, memos and emails as inputs to or outputs from such meetings, than they will in actually trying ideas out and turning them into working solutions. This is where we should heed Jon Ive’s cautionary remarks about the separation of activities. Integrate roles. Combine activities. Focus on user needs, the problem space and the functional area, not on the technology set. Select your staff well, then trust and empower them. Avoid premature commitment to a specific solution; make sure you’re exploring the solution space as fully as possible. Keep things fluid. And don’t over-decompose a project, thinking that with more staff it will get done quicker. (It will just get harder for everyone to see what they’re driving at, or to change direction mid-way.) This is where agile as it’s sometimes practised doesn’t seem to get it quite right. Sometimes there’s a benefit to providing longer periods of time for problem wrestling and solution wrangling, a chance for multiple rethinks and fresh starts of which no one need ever be aware apart from the individual or team involved. Sometimes the best solutions are the ones that look least like what you first thought you needed.
Much has been written of late about the status of peer review in STM publishing [1, 2, 3, 4], and publishers have begun tentatively exploring alternatives to the traditional peer review model [5, 6]. Rather than just review this work I thought it would be interesting to take seriously what people like Jan Velterop have been saying  and consider more speculatively how things might look in a hypothetical future from which pre-publication peer review is largely absent. If this is the future that awaits us, I suspect that it will be at least a decade or two before it comes to pass. By then, I assume, the majority of content will be available on an open access basis. How then will things stand with the scientific research article, individually and in aggregate?
In the medium-term scenario that I shall address shortly, scientific communication will undoubtedly have evolved, but the commonalities with today’s publishing model will remain palpable. I say this by way of justification for not considering in any depth certain more esoteric possibilities, e.g. for recording (in real time) every second of activity involving a particular piece of lab equipment or all the work carried out by a particular researcher (headcams anyone?), or even for directly recording a researcher’s brain activity. (How much researcher hostility would there be to such ‘spy-in-the-lab/head’-type scenarios?) One day perhaps reports of specific lines of research activity will indeed centre on a continuous, or a linked collection of, semantically structured, digitized time series of data recording experimental and cognitive activities. But for now I’ll stick with the idea that the definitive research record is the scientific article qua post hoc rational reconstruction of the processes to which these conceivable but as yet largely non-existent, relatively unmediated datastreams would relate. (One can imagine easily enough how today’s article-centric model could give way by degrees to an activity recording model. It would begin with datasets being attached to articles in ever-greater quantities, and with increasing connectedness between datasets, until a tipping point was reached where the (largely non-textual) datasets in some sense outweighed the text itself. This would stimulate a restructuring such that the datasets now formed the spine of the scientific record, with chunks of elucidatory and reflective text forming appendages to that spine. Maybe this would be the prelude to robots submitting research reports!)
OK, anyway, human-written articles remain for now. So what might change? For a start, I think that automated textual analysis and semantic processing will play a far bigger role than at present. I shall assume that on submission, as a matter of course, an author’s words will be automatically and extensively dissected, analysed, classified and compared with a variety of other content and corpora. One result will be the demise of the journal, a possibility that I discussed in a previous post – or possibly the proliferation of the journal, albeit as a virtual, personalizable, object defined by search criteria. What we need for this, whatever the technical basis for its realization, is a single global scientific article corpus (let’s call it the GSC), with unified, standard mechanisms for addressing its contents.
Suppose I’ve written an article, which I upload/post for publication. (To where? A topic for a later post, perhaps.) Immediately it would be automatically scanned to identify key words and phrases. These are the words and combinations of words that occur more often in what I’ve written than would be expected given their frequency in some corpus sufficiently large to be representative of all scientific disciplines – perhaps in the GSC. (To some extent that’s just a summary of some experiments I did in the late 1990s, and I don’t think they were especially original then. Autonomy launched around the same and they were doing far fancier statistical stuff.) This is The Future we’re talking about, of course, and in The Future, as well as the GSC and associated term frequency database (TFDB) there will be a term associations and relations database (TARDB). (Google is working on something like this, it seems.) Maybe they could be combined into a single database of term frequency and relational data, but for now I’ll refer to two functionally distinct databases. Term-related data will need to be dynamically updated to reflect the growing contents of the GSC. Once an article’s key terms have been derived, it should be possible to classify it using the TARDB. Note that this won’t define rigid, exclusively hierarchical term relationships; we know the problems they can occasion. (For example, when we try to set up a system of folders and sub-folders for storing browser bookmarks. Should we add this post under ‘computing’, ‘publishing’, ‘web’ or ‘journal’? What are the ‘right’ hierarchical relationships between those categories?) Instead we’ll need to capture semantic relations between terms by way of a weighted system of tags, say, or weighted links. (Yes, maybe something like semantic nets.) In this way we’ll probably be able to classify content down at the paragraph level, so to some extent we’ll be able automatically to figure out semantic boundaries between different parts of an article. (This part’s about protein folding; this one’s about catalysis; etc.)
Natural language processing (NLP) is developing apace, of course, so extracting key words and phrases on the basis of statistical measures, and using those as a basis for classification and clustering, is probably just the lower limit of what we should be taking for granted. Once we start associating individual words with specific grammatical roles we can do much more – although I suspect that statistics-based approaches will retain great appeal precisely because they ride roughshod over grammatical details and assign chunks of content to broad processing categories. The software complexity required to handle all categories is thus correspondingly relative low. (So say my highly fallible intuitions!)
Anyway, when we can easily establish roughly what an article is about, why do we need the journal? If a user wants to find all the articles published recently to do with protein folding, no problem. Likewise if they’d rather survey the field of protein engineering and enzyme catalysis, say. What’s that you say? Quality? Oh I see, you want just the good articles. Well, I suppose it was hoping for too much: we’re going to have to think, after all, about how we deal with the issue of quality, post peer review. That’s tricky, and I don’t have all the answers. But it’s interesting to think about what we might be able to do to make post-publication peer evaluation work as a reliable article quality assessment mechanism.
We need a system that encourages ‘users’, i.e. readers, to provide useful feedback on published articles. (Something I should have said earlier too: I’m assuming that all articles that survive certain minimal filters to remove spam and abusive submissions do get published. Those filters might involve automated scanning related to, or as part of, the key terms identification processes outlined above, or human eyeballs may need to be involved.) Does user feedback need to be substantive, or would a simple system based on numerical ratings suffice? The publishing traditionalist in me says the former, but what if I had the reassurance of knowing that the overall rating an article gained reflected the views of respected authorities on the subject addressed by the article? Elsewhere  I briefly outlined (in the comments) a simple scheme based on different user categories, with the ratings of users being weighted according to their category. (I previously assumed the continued existence of journals and editorial boards. Here I want to assume that in The Future we have done away with those.) The ratings of users who were themselves highly rated authors would count for more than those of non-authors. Given what I have just said about article analysis and classification, we can see that it would not be too difficult to compute an author’s areas of expertise. Perhaps an author’s ratings should be highly weighted, relative to those of non-authors, only in relation to articles related to their fields of expertise, or in proportion to the relatedness of their expertise to the topic of the article in question. (We’d need to devise suitable measures of disciplinary relatedness, perhaps based on citation overlaps as well as article term relationships and associations as represented in the TARDB.) To be really useful a hybrid review mechanism would probably be needed, combining simple numerical ratings with provision for making substantive comments. The former would enable users to select articles meeting specific assessment criteria, e.g. find me all articles rated 60% or more on average. If users were categorized according to their ‘rating worth’ as discussed above – with the ratings of well-rated authors being weighted most highly – then users could search just for articles rated highly be well-rated authors. (Raters could of course move up or down the scale according to their publishing history.)
A rating system like this would depend on the existence of a trustworthy user identification and authentication system, even if the implied requirement for users to log in would be in tension with the aim of encouraging readers to rate articles. Anonymity is another potential problem area, in that article raters will doubtless sometimes be keen for their ratings to remain anonymous. There is no reason why (with a little ingenuity) it would not be possible to ensure anonymity by default, with the possibility of voluntary anonymity-lifting if desired by all participants in the rating process. (It would be important to translate a rater’s user account identifier into a visible public identifier that was unique to a particular rated article. Otherwise an author who was made aware of the rater’s identity would be able to recognize that rater’s identity when it arose in the context of other articles and communicate it to others.) The area of author—rater relations might in fact be one where a role would exist for a trusted third party who was aware of the identities of article raters. This would enable them to assure authors of a rater’s credentials, should authors deem a particular rating to be suspect or malign in some way, while not revealing the rater’s identity if the rater did not agree to it.
Another issue is how to make it hard to ‘game’ the system. The obvious risk is that a user sets up an account from which to make derogatory assessments of rivals’ work that is distinct from the account they use when submitting their own work. However, the rating weighting scheme can help here, inasmuch as ratings count for more when they come from well-regarded authors. It will thus be important to ensure that authors – well-rated ones especially – are encouraged to rate the work of others, since a malign rater’s views will count for little when they come from a user account not associated with authorship, in comparison with those of a rater who is also a well-regarded author. Assuming an OA model in which authors must meet modest up-front publishing costs (mitigated by knowing that one’s submission is guaranteed to be accepted), it may be that inducements to rate can be offered, in terms of reduced article-processing charges for authors who agree to rate a certain number of articles, say. Not that one would want to discourage constructive critics who happen not to be authors themselves. Perhaps one could allow and encourage authors – again, with a weighting proportional to the quality of the ratings given to their work by others – to rate the comments made by other users. Overall I am optimistic that it is not beyond the wit of man to devise a system that establishes a virtuous dynamic around authorship and criticism, based on a system of article ratings that also allows for substantive comment.
What else might be possible or desirable? One area where it may be possible to use automated intelligence and the resources of the GSC/TFDB/TARDB to effect improvements on existing models and mechanisms falls under the broad heading of assessment of novelty/priority and the discernment of relationships with existing work. Recently I heard of a researcher who came across a newly published article, the bibliography of which listed a number of publications that the researcher had cited in their own earlier work, which had been available online for several years. To their knowledge the works they had cited had not previously been cited by others in the field. The new article did not cite the researcher’s work. It is impossible to know whether the author of the new article was aware of the researcher’s work, and was informed by it at least to the extent that they saw fit to seek out many of the same references. But once possibilities for automated content analysis are exploited more fully, it may become less relevant whether an author is scrupulous in their citation of related work. Publishing systems will be able simply to present links to all the content in the GSC that represent a semantic match with a particular article, and will be able to indicate the order of publication. (When done well this list of related material would amount to something like a supplementary bibliography.) I mentioned citation overlap earlier, in relation to assessing the relatedness of different research areas. But, as the example above indicates, citation overlap might also represent an additional dimension for the automatic estimation of one aspect of originality.