iPhylo: ZooBank

Roderic D. M. Page

Showing posts with label ZooBank. Show all posts

Monday, November 11, 2013

Names and nomenclators: just do it already!

Quick notes on taxonomic names (again). It's a continuing source of bafflement that the biodiversity community is making a dog's breakfast of names. It seems we are forever making it more complicated than it needs to be, forever minting new acronyms that pollute the landscape without actually contributing anything useful, and forever promising shiny new tools and services without every actually delivering them. Meanwhile people and projects that build upon names are left to deal with a mess.

It seems to me that it would be nice if we had a single place to go to get definitive information on a name, and that place would give us a unique identifier that we could use in our own databases as a way to clean up and reconcile our data. For example, if we have a bibliographic database we can map citations to DOIs and then use those to identify the articles. If we have a list of journal names, we can map those to ISSNs and clean up our data. Likewise, if we have a classification such as GBIF or NCBI, we should be able to map the names in those classifications onto standard identifiers for taxonomic names.

The frustrating thing is we already have standard identifiers for taxonomic names. Since around 2005 we have been serving LSIDs for plant and animal names. We have Index Fungorum, IPNI, ION, and ZooBank, all serving LSIDs, all serving RDF, all using the same TDWG vocabulary.

The nomenclators vary in size and scope, but we have the three major, multicellular eukaryotes covered (circles proportional to number of names in each database):

There is some duplication, both within nomenclators (IPNI and ION I'm looking at you) and between nomenclators (ION and ZooBank have the same scope, although ZooBank is dwarfed by ION, anyone care to explain why we have both...?). All four databases are actively growing, partly through direct registration of new taxonomic names.

So, we're basically done, right? Surely all we need to do is harvest the LSIDs for all these names, put them into a single triple store, and wrap some basic services around them? If the nomenclators provide a list of recent changes (e.g., as an RSS feed) then we could continuously update the store with new names. Then any database or classification could reconcile it's names with those in the nomenclators. They could also then augment their own records by making use of additional data the nomenclators have, such as objective synonomies and links to original descriptions. In other words, we could have a model like this:
Taxonmodel

Classifications represent a view of how taxa are related, the names associated with those taxa are stored in nomenclators. This means that classification databases like GBIF and NCBI are not in the business of managing names, they simply link to the nomenclators (in the same way that a bibliographic database can link to DOI, ISSNs, and author ids such as ORCID and VIAF).

We have almost all of this infrastructure in place already. In one of the unsung triumphs of TDWG we have all the nomenclators serving data in the same format using the same technology. And yet we have singly failed to do anything useful with this extraordinary resource! Instead we seem more interested in contributing more projects to the acronym soup of biodiversity informatics. All around us projects to assign and link identifiers for publications (CrossRef), data (DataCite), and people (ORCID) are taking off. The infrastructure for taxonomic names has been in place since 2005, we could be doing the same sort of things CrossRef, DataCite and ORCID are doing in their domains. Why aren't we?

Wednesday, November 28, 2012

ZooBank data model

I'm trying to get my head around the data model used by ZooBank to store taxonomic names. To do this, I've built a graph for the species Belonoperca pylei described by Baldwin & Smith described in:

Baldwin, C. C., & Smith, W. L. (1998). Belonoperca pylei, a new species of seabass (Teleostei: Serranidae: Epinephelinae: Diploprionini) from the cook islands with comments on relationships among diploprionins. Ichthyological Research, 45(4), 325–339. doi:10.1007/BF02725185

After extracting some data from ZooBank API I created a DOT file connecting the various "taxon name usages" associated with Belonoperca pylei and constructed a graph using GraphViz:
Zoobank

You can grab the DOT file here, and a bigger version of the image is on Flickr. I've labelled taxon names and references with plain text as well as the UUIDs that serve as identifiers in ZooBank. (Update: the original diagram had Belonoperca pylei Baldwin & Smith, 1998 sensu Eschmeyer [9F53EF10-30EE-4445-A071-6112D998B09B] in the wrong place, which I've now fixed.)

This is a fairly simple case of a single species, but it's already starting to look a tad complicated. We have Belonoperca pylei Baldwin & Smith, 1998 linked to its original description (doi:10.1007/BF02725185) and to the genus Belonoperca Fowler & Bean, 1930 (linked to its original publication http://biostor.org/reference/105997) as interpreted by ("sensu") Baldwin & Smith, 1998. Belonoperca Fowler & Bean 1930 sensu Baldwin & Smith 1998 is linked to the original use of that genus (i.e., Belonoperca Fowler & Bean, 1930). Then we have the species Belonoperca pylei Baldwin & Smith, 1998 as understood in Eschmeyer's 2004 checklist.

Notice that each usage of a taxon name gets linked back to a previous usage, and names are linked to higher names in a taxonomic hierarchy. When the species Belonoperca pylei was described it was placed in the genus Belonoperca, when Belonoperca was described it was placed in the family Serranidae, and so on.

Thursday, May 26, 2011

ZooBank on CouchDB: UUIDs, replication, and embedding the literature in taxonomic databases

Last December I released a web site called Australian Faunal Directory on CouchDB, which was part of my ongoing exploration of how to build a simple yet useful database of taxonomic names. In particular, I want to link names directly to the primary taxonomic literature. No longer is it adequate to simply list names, or list names with mangled bibliographic details (I'm looking at you, Catalogue of Life). This is the 21st century, so I expect one click from name to literature, or at the most two (via, say, a DOI). Nothing else will cut it.

The Australian Faunal Directory (AFD) was an eye opener as it was the first serious use I'd made of CouchDB (now CouchBase). I'd played with replicating and forking data in 2010: Catalogue of Life and CouchDB, but the AFD project was bigger, and also inspired me to use web hooks to make the database editable. Suddenly this stuff started to look easy: no schema, simple web services, and tiny amounts of code.

ZooBank
So then my attention turned to ZooBank, which is "the official registry of Zoological Nomenclature, according to the International Commission on Zoological Nomenclature (ICZN)." ZooBank was proposed by Polaszek et al. (2005) in a short piece in Nature ("A universal register for animal names", doi:10.1038/437477a). By providing a registry of names for animals, ultimately it aims to help avoid embarrassing situations such as the example I recount in my paper on BioStor (doi:10.1186/1471-2105-12-187): a recent paper in Nature published the name Leviathan for an extinct sperm whale with a giant bite (doi:10.1038/nature09067), only for authors to have to publish an erratum with a new name (doi:10.1038/nature09381) when it was discovered that Leviathan had already been used for an extinct mammoth.

ZooBank is developed and run by Rich Pyle, and has some nice features, such as RDF export (via LSIDs), but like most taxonomic databases it doesn't link directly to the literature. Where are the DOIs? Where are links to BHL? Where is the ability to add these links? And why is it almost entirely about fish? (OK, I know the answer to that one).

CouchDB
But the thing which really got me thinking about using CouchDB to create a version of ZooBank was Rich Pyle's vision of having a distributed ZooBank, and his insistence on using ugly UUIDs in ZooBank identifiers (e.g., urn:lsid:zoobank.org:act:6BBEF50E-76B4-42EF-97B1-7029DBCD8257). As much as they are ugly, Rich has always argued that they make distributed systems easy because you don't need a centralised system to assign unique identifiers.

Anybody who has played with CouchDB will know that CouchDB uses UUIDs by default to create identifiers for database documents. It also excels at data synchronisation, and can run on platforms large and small (including mobile such as Android and iOS). This means a database could be updated on an iPhone or iPad without an Internet connection, then the data could be synchronised with other databases. Indeed, I developed this CouchDB clone of ZooBank on my MacBook, then pointed it at CouchDB running on my server and within minutes had an exact copy of the database running on the server. This ease of replication, together with the joy of schema-less design makes CouchDB seem an obvious fit to ZooBank.

Demo
You can see the ZooBank on CouchDB demo here. It's not a complete copy of ZooBank, but has most of it. I reuse the UUIDs issued by ZooBank, so that

http://zoobank.org:80/?uuid=6bbef50e-76b4-42ef-97b1-7029dbcd8257

becomes

http://iphylo.org/~rpage/zoobank/6bbef50e-76b4-42ef-97b1-7029dbcd8257

As usual it's all a bit crude, but has some nice features, such as links to BHL content with a built in article viewer I wrote for the AFD project:

What's next?
At present only a fraction of the ZooBank references have external links, I hope to add more in the next few days, using both automatic scripts and the web hook interface. The search interface needs work, and being that ZooBank is about nomenclature and not taxonomy, it might be useful to add a classification (say from the Catalogue of Life) so that users can navigate around the names (and get a sense of how many are *cough* fish).

At present to display a reference I do one of four things:

If reference is in BHL I use my article viewer
If there is a freely available PDF online I display that using Google Docs PDF viewer
If 1 and 2 don't apply, but there is a DOI then I resolve the DOI and display the result in an IFRAME (yuck)
If none of 1-3 apply I display a blank rectangle

There are a couple ways we could improve this. The first is to enhance the display of BHL content by making use of the structure of the source DjVu files. Another is to make use of the XML now being made available by the journal Zookeys (see my blog post, and Pensoft's announcement that ZooKeys is now being archived by PubMed Central, complete with taxonomic markup). There are a lot of ZooKeys articles in ZooBank, so there's a lot of potential for embedding an article viewer that takes Zookeys XML and redisplays it with taxonomic names and references as clickable links that link to other ZooBank content. That way we approach the point where taxonomic literature becomes a first class citizen of a taxonomic database.

Wednesday, May 06, 2009

Integrating and displaying data using RSS

Although I'd been thinking of getting the wiki project ready for e-Biosphere '09 as a challenge entry, lately I've been playing with RSS has a complementary, but quicker way to achieve some simple integration.

I've been playing with RSS on and off for a while, but what reignited my interest was the swine flu timemap I made last week. The neatest thing about the timemap was how easy it was to make. Just take some RSS that is geotagged and you get the timemap (courtesy of Nick Rabinowitz's wonderful Timemap library).

So, I began to think about taking RSS feeds for, say journals and taxonomic and genomic databases and adding them together and displaying them using tools such as timemap (see here for an earlier mock up of some GenBank data). Two obstacles are in the way. The first is that not every data source of interest provides RSS feeds. To address this I've started to develop wrappers around some sources, the first of which is ZooBank.

The second obstacle is that integration requires shared content (e.g., tags, identifiers, or localities). Some integration will be possible geographically (for example, adding geotagged sequences and images to a map), but this won't work for everything. So, I need to spend some time trying to link stuff together. In the case of Zoobank there's some scope for this, as ZooBank metadata sometimes includes DOIs, which enables us to link to the original publication, as well as bookmarking services such as Connotea. I'm aiming to include these links within the feed, as shown in this snippet (see the <link rel="related"...> element):

<entry>
<title>New Protocetid Whale from the Middle Eocene of Pakistan: Birth on Land, Precocial Development, and Sexual Dimorphism</title>
<link rel="alternate" type="text/html" href="http://zoobank.org/urn:lsid:zoobank.org:pub:8625FB9A-1FC3-43C3-9A99-7A3CDE0DFC9C"/>
<updated>2009-05-06T18:37:34+01:00</updated>
<id>urn:uuid:c8f6be01-2359-1805-8bdb-02f271a95ab4</id>
<content type="html">Gingerich, Philip D., Munir ul-Haq, Wighart von Koenigswald, William J. Sanders, B. Holly Smith & Iyad S. Zalmout<br/><a href="http://dx.doi.org/10.1371/journal.pone.0004366">doi:10.1371/journal.pone.0004366</a></content>
<summary type="html">Gingerich, Philip D., Munir ul-Haq, Wighart von Koenigswald, William J. Sanders, B. Holly Smith & Iyad S. Zalmout<br/><a href="http://dx.doi.org/10.1371/journal.pone.0004366">doi:10.1371/journal.pone.0004366</a></summary>
<link rel="related" type="text/html" href="http://dx.doi.org/10.1371/journal.pone.0004366" title="doi:10.1371/journal.pone.0004366"/>
<link rel="related" type="text/html" href="http://bioguid.info/urn:lsid:zoobank.org:pub:8625FB9A-1FC3-43C3-9A99-7A3CDE0DFC9C" title="urn:lsid:zoobank.org:pub:8625FB9A-1FC3-43C3-9A99-7A3CDE0DFC9C"/>
</entry>

What I'm hoping is that there will be enough links to create something rather like my Elsevier Challenge entry, but with a much more diverse set of sources.