iPhylo: scratchpads

Roderic D. M. Page

Showing posts with label scratchpads. Show all posts

Tuesday, July 06, 2010

ZooKeys publishes articles of the future

The open access taxonomic journal ZooKeys has published a special issue with four papers, each available in HTML, PDF, and XML, the later being extensively marked up. Penev et al. ("Semantic tagging of and semantic enhancements to systematics papers: ZooKeys working examples", doi:10.3897/zookeys.50.538) describes the process involved in creating these XML files. Two papers (doi:10.3897/zookeys.50.506 and doi:10.3897/zookeys.50.505) were created using authoring tools available in Scratchpads, as outlined by Blagoderov et al. ("Streamlining taxonomic publication: a working example with Scratchpads and ZooKeys", doi:10.3897/zookeys.50.539). When you view the HTMl for these articles you can toggle on or off the highlighting citations, taxonomic names, and geographic co-ordinates. Mousing over a taxonomic name, for example, a popup appears with links to GBIF, NCBI, EOL, BHL, Wikipedia, etc.):

I think these papers represent one view of the future of scientific publishing ("article 2.0"), and I'm flattered that Penev et al. cite my Elsevier challenge work (doi:10.1016/j.websem.2010.03.004, preprint at hdl:10101/npre.2009.3173.1) as one of the sources of inspiration (along with the landmark Shotton et al. "Adventures in Semantic Publishing: Exemplar Semantic Enhancements of a Research Article" doi:10.1371/journal.pcbi.1000361, which I've discussed previously). It is also good to see the TaxPub XML schema used by a publisher, and Scratchpads being a part of the process of publishing taxonomic information.

Deep linking

My initial impression is that there is huge of potential here, although I think there is still lots to do. I'm not totally convinced that popups are they way to go (although I've dabbled with them as well), and we need to move beyond simply linking to other sites to a deeper form of integration. For example, a Zookeys article might link to BHL via a taxonomic name, but how about deeper linking? For example, the paper by Brake and von Tschirnhaus (doi:10.3897/zookeys.50.505) contains the following citations:

Biró L (1899) Commensalismus bei Fliegen. Természetrajzi füzetek 22: 198–204.

Kertész K (1899) Verzeichnis einiger, von L. Biró in Neu-Guinea und am Malayischen Archipel gesammelten Dipteren. Természetrajzi füzetek 22: 173–19

Neither reference has any links in the HTML, so the user is under the impression that they aren't available online, but both references have been scanned by BHL. You can see full text for these articles in BioStor (references 52005 and 52004, respectively -- note that the pagination for Biró 1899 is given incorrectly in the paper). This is one area where BHL has a lot to offer publishers, and it would be great to see BHL provide the services publishers need to add these links to their articles.

This integration should go both ways. It's odd that the paper by Brake and von Tschirnhaus contains LSID used by the ZooBank for this paper (urn:lsid:zoobank.org:pub:DABB03F4-A128-43BB-990C-02F25D656B00, see the <self-uri> tag in the XML), but ZooBank doesn't know about the DOI for the paper, hence the ZooBank page for this article has no link to the article itself. It's time to join this stuff together.

What's next?

What I'd really like to see is article XML repurposed as, say, RDF, and used to populate a database so that we can query it. In this way we can start to atomise the article into useful parts, and recombine them in new and interesting ways. Might be something to play with over the summer.

On a practical level, I'm somewhat bemused by the variety of XML formats being used by open access publishers. PLoS use version 2.0 of the NLM Journal Archiving and Interchange Tag Suite, and I wrote a XSLT style sheet to transform PLoS articles for viewing on an iPad. TaxPub is based on version 3.0 of the NLM DTD, which breaks quite a bit of my code relating to citations, so I'll have to tweak this to get it to display Zookeys articles correctly. Handling TaxPub itself will also require some additional work. Then there are the BMC journals, which have their own flavour of XML (based on something called the "KETON DTD"). It's all a bit messy. But I guess it'd be no fun if it was too easy...

Thursday, January 15, 2009

Wikis versus Scratchpads

Yes, I know this is ultimately a case of the "genius of and", but the more I play with the Semantic Mediawiki extension the more I think this is going to be the most productive way forward. I've had numerous conversations with Vince Smith about this. Vince and colleagues at the NHM have been doing a lot of work on "Scratchpads" -- Drupal based webs sites that tend to be taxon-focussed. My worry is that in the long term this is going to create lots of silos that some poor fool will have to aggregate together to do anything synthetic with. This makes inference difficult, and also raises issues of duplication (for example, in bibliographies).

I've avoided wikis for a while because of the reliance on plain text (i.e., little structure) (see this old post of mine on Semant), but Semantic Mediawiki provides a fairly simple way to structure information, and it also provides some basic inference. This makes it possible to create wiki pages that are largely populated by database queries, rather than requiring manual editing. For example, I have queries now that will automatically populate a page about a person with that person's publications, and any taxa named after that person. The actual wiki page itself has hardly any text (basically the name of the person). That is, nobody has to manually edit the wiki page to update lists of published papers. Similarly, maps can be generated in situ using queries that aggregate localities mentioned on a wiki page with localities for GenBank sequences and specimens. Very quickly relationships start to emerge without any manual intervention. The combination of templates and Semantic Mediawiki queries seems a pretty powerful way to aggregate information. There are, of course, limitations. The queries are fairly basic, and there's not the power of something like SPARQL, but it's a start. Coupled with the ease of editing to fix the errors in the contributing databases, and the ease of redirecting to handle multiple identifiers, I think a wiki-based approach has a lot of promise.

So, I've been teasing Vince that Drupal (or another CMS) is probably the wrong approach, and that semantic wikis are much more powerful (something Gregor Hagedorn has also been arguing). Vince would probably counter that the goal of scratchpads is to move taxonomists into the digital age by providing them with a customisable platform for them to store and display their data, hence his mission is to capture data. My focus is more to do with aggregating and synthesising the large amount of data we already have (and are struggling to do anything exciting with). Hence, the "genius of and". However, I still worry that when we have a world with loads of scratch pads with overlapping data, some poor fool will still have to merge them together to make sense of it all.