iPhylo: links

Roderic D. M. Page

Showing posts with label links. Show all posts

Thursday, September 12, 2013

The spy who loved frogs and taxonomy as a digital backwater

A nice article by Brendan Borrell about the secret life of herpetologist Edward Taylor, and Rafe Brown's efforts to untangle his taxonomic legacy has appeared in Nature:

Borrell, B. (2013). Taxonomy: The spy who loved frogs. Nature, 501(7466), 150–153. doi:10.1038/501150a

Gecko Ptychozoon intermedium Malagos copy 1

Fascinating article, but as always I'm going to skip straight past the content and look at links. The article leads with Ptychozoon intermedium, the Philippine parachute gecko. Naturally, pedant that I am, I wanted to find the original description of this gecko (which wasn't cited in the Nature piece). I turned to BioNames, and got the name but no literature. A bit of Googling revealed that Taylor originally used the name Ptychozoon intermedia (note the ending "a" rather than "um", sigh). OK, BioNames has Ptychozoon intermedia, plus the original description:

Edward H Taylor (1915) New species of Philippine Lizards. Philippine Journal of Science Manila Sect 10(D): 89–109. http://biostor.org/reference/129464

Obviously I need to improve BioNames to handle multiple variants of the species name. Finding this article took a little tracking down, not quite on the level of uncovering a spy, perhaps, but sometimes the amount of detective work involved in tracking down taxonomic literature is tiresome.

To continue with the theme, in my experience when reading taxonomic papers the list of literature cited is often simply listed as a text string without a link to the place you can find it. This is in marked contrast to papers in other subjects (say, phylogenetics), where most if not all the literature cited is linked. For the Nature article on Edward Taylor here are the references cited:

Reference list:

Brown, R. M., Ferner, J. W. & Diesmos, A. C. Herpetologica 53, 357–373 (1997).
Webb, R. G. Herpetologica 34, 422–425 (1978).
Inger, R. F. Fieldiana Zool. 33, 183–531 (1954).
Savage, J. M. The Amphibians and Reptiles of Costa Rica (Univ. Chicago. Press, 2002).
Merrill, E. D. Science 101, 401 (1945).
Diesmos, A. C., Brown, R. M. & Gee, G. V. A. Sylvatrop 13, 63–80 (2003).
Taylor, E. H., Leonard, A. B., Smith, H. M. & Pisani, G. R. Monogr. Mus. Nat. Hist. Univ. Kansas 4, 1–160 (1975).
Taylor, E. H. The Caecilians of the World (Univ. Kansas Press, 1968).
Brown, R. M. et al. Check List 8, 469–490 (2012).
Brown, R. M., Siler, C. D., Diesmos, A. C. & Alcala, A. C. Herpetol. Monogr. 23, 1–44 (2009).

Nature has added DOIs to two of them:

Brown, R. M., Ferner, J. W. & Diesmos, A. C. Herpetologica 53, 357–373 (1997).
Webb, R. G. Herpetologica 34, 422–425 (1978).
Inger, R. F. Fieldiana Zool. 33, 183–531 (1954).
Savage, J. M. The Amphibians and Reptiles of Costa Rica (Univ. Chicago. Press, 2002).
Merrill, E. D. Science 101, 401 (1945). DOI: 10.1126/science.101.2623.355
Diesmos, A. C., Brown, R. M. & Gee, G. V. A. Sylvatrop 13, 63–80 (2003).
Taylor, E. H., Leonard, A. B., Smith, H. M. & Pisani, G. R. Monogr. Mus. Nat. Hist. Univ. Kansas 4, 1–160 (1975).
Taylor, E. H. The Caecilians of the World (Univ. Kansas Press, 1968).
Brown, R. M. et al. Check List 8, 469–490 (2012).
Brown, R. M., Siler, C. D., Diesmos, A. C. & Alcala, A. C. Herpetol. Monogr. 23, 1–44 (2009). DOI: 10.1655/09-037.1

So 8 of 10 references have no link (I'm ignoring the ISI link for the first reference). So, I spent a little time fussing with BioStor, JSTOR, and Google and came up with some more:

Brown, R. M., Ferner, J. W. & Diesmos, A. C. Herpetologica 53, 357–373 (1997). JSTOR: 3893345
Webb, R. G. Herpetologica 34, 422–425 (1978). JSTOR: 3891519
Inger, R. F. Fieldiana Zool. 33, 183–531 (1954). BioStor: 99995
Savage, J. M. The Amphibians and Reptiles of Costa Rica (Univ. Chicago. Press, 2002).
Merrill, E. D. Science 101, 401 (1945). DOI: 10.1126/science.101.2623.355
Diesmos, A. C., Brown, R. M. & Gee, G. V. A. Sylvatrop 13, 63–80 (2003).
Taylor, E. H., Leonard, A. B., Smith, H. M. & Pisani, G. R. Monogr. Mus. Nat. Hist. Univ. Kansas 4, 1–160 (1975). DOI: 10.5962/bhl.title.4250
Taylor, E. H. The Caecilians of the World (Univ. Kansas Press, 1968).
Brown, R. M. et al. Check List 8, 469–490 (2012). PDF
Brown, R. M., Siler, C. D., Diesmos, A. C. & Alcala, A. C. Herpetol. Monogr. 23, 1–44 (2009). DOI: 10.1655/09-037.1

Not perfect, but better. My concern is that the lack of linked literature citations simply seems to confirm taxonomy's status as an intellectual backwater. In other subjects the reader can quickly visit the literature cited and navigate the web of papers relevant to the article. But in taxonomy we have to resort to Google and/or specialised tools such as JSTOR, BioStor and BHL to find the literature. This needs to change, unless we are happy with taxonomy being a digital backwater.

Wednesday, January 16, 2013

Megascience platforms for biodiversity information: what's wrong with this picture?

The journal Mycokeys has published the following paper:

Triebel, D., Hagedorn, G., & Rambold, G. (2012). An appraisal of megascience platforms for biodiversity information. MycoKeys, 5(0), 45–63. doi:10.3897/mycokeys.5.4302

This paper contains a diagram that seems innocuous enough but which I find worrying:

MycoKeys 005 045 g001

The nodes in the graph are "biodiversity megascience platforms", the edges are "cross-linkages and data exchange". What bothers me is that if you view biodiversity informatics through this lens then the relationships among these projects becomes the focus. Not the data, not the users, nor the questions we are trying to tackle. It is all about relationships between projects.

I want a different view of the landscape. For example, below is a very crude graph of the kinds of things I think about, namely kinds of data and their interrelationship:

Biodiversity

What tends to happen is that this data landscape gets carved up by different projects, so we get separate databases of taxonomic names, images, publications, and specimens (these are the "megascience platforms" such as CoL, EOL, GBIF). This takes care of the nodes, but what about the edges, the links between the data? Typically what happens is lots of energy is expended on what to call these links, in other words, the development of the vocabularies and ontologies such as those curated by TDWG. This is all valuable work, but this doesn't tackle what for me is the real obstacle to progress, which is creating the links themselves. Where are the "megascience platforms" devoted to linking stuff together?

When we do have links between different kinds of data these tend to be within databases. For example, Genbank explicitly links sequences to publications in PubMed, and taxa in the NCBI taxonomy database. All three (sequence, publication, taxon) have identifiers (accession number, PubMed id, taxon id, respectively) that are widely used outside GenBank (and, indeed, are the de facto identifiers for the bioinformatics community). Part of the reason these identifiers are so widely used is because GenBank is the only real "megascience platform" in the list studied by Triebel et al. It's the only one that we can readily do science with (think BLAST searches, think of the number of databases that have repurposed GenBank data, or build on NCBI services).

Genbank

Many of the questions we might ask can be formulated as paths through a diagram like the one above. For example, if I want to do phylogeography, then I want the path phylogeny -> sequence -> specimen -> locality. If I'm lucky the phylogeny is in a database and all the sequences have been georeferenced, but often the phylogeny isn't readily available digitally, I need to map the OTUs in the tree to sequences, I then need to track down the vouchers for those sequences, and obtain the localities for those sequences from, say, GBIF. Each step involves some degree of pain as we try and map identifiers from one database to those in another.

Phylogeography

If I want to do classical alpha taxonomy I need information on taxonomic names, concepts, publications, attributes, and specimens. The digital links between these are tenuous at best (where are the links between GBIF specimen records and the publications that cite those specimens, for example?).

Focussing on so-called "platforms" is unfortunate, in my opinion, because it means that we focus on data and how we carve up responsibility for managing it (never mind what happens to data that lacks an obvious constituency). The platforms aren't what we should be focussing on, it is the relationships between data (and no, these are not the same as the relationships between the "platforms").

If I'd like to see one thing in biodiversity informatics in 2013 it is the emergence of a "platform" that makes the links the centre of their efforts. Because without the links we are not building "platforms", we are building silos.

Wednesday, February 29, 2012

Making biodiversity data sticky: it's all about links

Sometimes I need to remind myself just why I'm spending so much time trying to make sense of other people's data, and why I go on (and on) about identifiers. One reason for my obsession is I want data to be "sticky", like the burrs shown in the photo above (Who invented velcro? by A-dep). Shared identifiers are like the hooks on the burrs, if two pieces of data have the same identifier they will stick together. Given enough identifiers and enough data, then we could rapidly assemble a "ball" of interconnected data. A published the diagram below as part of my Elsevier Challenge entry (preprint, published version) summarises some of the links between diverse kinds of biological data:
Model

While in principle many of these links should be trivial to create, in practice they aren't. One major obstacle is the lack of globally unique identifiers, or if such identifiers exist they aren't being used. As a result, our data is anything but sticky. In the absence of identifiers, creating links between different data sets can a significant undertaking. One way to tackle this is focus on just one kind of link at a time and create a database of those links. The diagram below shows some of the links I've been working on:
Links

For example, the iPhylo Linkout project creates links between taxon concepts in NCBI and Wikipedia. The iTaxon project is a mapping between taxonomic names and publications. I've briefly explored mapping host-parasite relationships using GenBank, and I'm currently exploring the links between publications and specimens. This list certainly doesn't exhaust the set of possible links, but it's a start. The challenge is to create sufficient links for biodiversity data to finally coalesce and for us to be able to ask questions that span multiple sources and types of data.