iPhylo: Handles

Roderic D. M. Page

Showing posts with label Handles. Show all posts

Tuesday, November 29, 2011

Mapping names to literature: closing in on 250,000 names

Following on from my earlier post Linking taxonomic names to literature: beyond digitised 5×3 index cards I've been slowly updating my latest toy:

http://iphylo.org/~rpage/itaxon

This site displays a database mapping over 200,000 animal names to the primary literature, using a mix of identifiers (DOIs, Handles, PubMed, URLs) as well as links to freely available PDFs where they are available. Lots still to do as about a third of the 1.5 million names in the database have citations that my code hasn't been able to parse. There are also lots of gaps that need to be filled in, for example missing DOIs or PubMed identifiers, and a lot of the earlier names are linked by "microcitations" to names, and I'll need to handle those (using code from my earlier project Nomenclator Zoologicus meets Biodiversity Heritage Library: linking names directly to literature).

The mapping itself is stored in a database that I'm constantly editing, so this is far from production quality, but I've found it eye-opening just how much literature is available. There is a lot of scope for generating customised lists of papers, for example, primary taxonomic sources for taxa currently on the IUCN Red List, or those taxa which have sequences in GenBank (building on the mapping of NCBI taxa onto Wikipedia). Given that a lot of the relevant literature is in BHL, or available as PDFs, we could do some data mining, such as extracting geographical coordinates, taxonomic names, and citations. And if linked data is your thing, the 110,000 DOIs and nearly 9,000 CiNiii URLs all serve RDF (albeit not without a few problems).

I've set a "goal" of having 250,000 names mapped to the primary literature, at which point the database interface will get some much-needed attention, but for now have a look for your favourite animal and see if it's original description has been digitised.

Tuesday, April 21, 2009

GBIF and Handles: admitting that "distributed" begets "centralized"

The problem with this ... is that my personal and unfashionable observation is that “distributed” begets “centralized.” For every distributed service created, we’ve then had to create a centralized service to make it useable again (ICANN, Google, Pirate Bay, CrossRef, DOAJ, ticTocs, WorldCat, etc.).
--Geoffrey Bilder interviewed by Martin Fenner

Thinking about the GUID mess in biodiversity informatics, stumbling across some documents about the PILIN (Persistent Identifier Linking INfrastructure) project, and still smarting from problems getting hold of specimen data, I thought I'd try and articulate one solution.

Firstly, I think biodiversity informatics has made the same mistake as digital librarians in thinking that people care where the get information from. We don't, in the sense that I don't care whether I get the information from Google or my local library, I just want the information. In this context local is irrelevant. Nor do I care about individual collections. I care about particular taxa, or particular areas, but not collections (likewise, I may care about philosophy, but not philosophy books at Glasgow University Library). I think the concern for local has lead to an emphasis on providing complex software to each data provider that supports operations (such as search) that don't scale (live federated search simply doesn't work), at the expense of focussing on simple solutions that are easy to use.

In a (no doubt unsuccessful) attempt to think beyond what I want, let's imagine we have several people/organisations with interests in this area. For example:

Imagine I am an occasional user. I see a specimen referred to, say a holotype, I want to learn more about that specimen. Is there some identifier I can use to find out more. I'm used to using DOIs to retrieve papers, what about specimens. So, I want:

identifiers for specimens so I can retrieve more information

Imagine I am a publisher (which can be anything from a major commercial publisher to a blogger). I want to make my content more useful to my readers, and I've noticed that other's are doing this so I better get onboard. But I don't want to clutter my content with fragile links -- and if a link breaks I want it fixed, or I want a cached copy (hence the use of WebCite by some publishers). If I want a link fixed I don't want to have to chase up individual providers, I want one place to go (as I do for references if a DOI breaks). So, I want:

stable links with some guarantee of persistence
somebody who will take responsibility to fix the broken ones

Imagine I am a data provider. I want to make my data available, but I want something simple to put in place (I have better things to do with my time, and my IT department keep a tight grip on the servers). I would also like to be able to show my masters that this is a good thing to do, for example by being able to present statistics on how many times my data has been accessed. I'd like identifiers that are meaningful to me (maybe carry some local "branding"). I might not be so keen on some central agency serving all my data as if it was theirs. So, I want

simplicity
option to serve my own data with my own identifiers

Imagine I am an power user. I want lots of data, maybe grouped in ways that the data providers hadn't anticipated. I'm in a hurry, so I want to get this stuff quickly. So I want:

convenient, fast APIs to fetch data
flexible search interfaces would be nice, but I may just download it myself because it's probably quicker if I do it myself

Imagine I am an aggregator. I want data providers to have a simple harvesting interface so that I can grab the data. I don't need a search interface to their data because I can do it much faster if I have the data locally (federated search sucks). So I want:

the ability to harvest all the data ("all your data are belong to me")
a simple way to update my copy of provider's data when it changes

It's too late in the evening for me to do this justice, but I think a reasonable solution is this:

Individual data providers serve their data via URLs, ideally serving a combination of HTML and RDF (i.e., linked data)~~, but XML would be OK~~
Each record (e.g., specimen) has an identifier that is locally unique, and the identifier is resolvable (for example, by simply appending it to a URL)
Each data provider is encouraged to reuse existing GUIDs wherever possible, (e.g., for literature (DOIs) and taxonomic names) to make their data "meshable"
Data provider can be harvested, either completey, or for records modified after a given date
A central aggregator (e.g., GBIF) aggregates all specimen/observation data. It uses Handles (or DOIs) to create GUIDs, comprising a naming authority (one for each data provider), and an identifier (supplied by the data provider, may carry branding, e.g. "antweb:casent0100367"), so an example would be "hdl:1234567/antweb:casent0100367" or "doi:10.1234/antweb:casent0100367". Note that this avoids labeling these GUIDs as, say, http://gbif.org/1234567/antweb:casent0100367
Handles resolve to data provider URL, but cached aggregator copy of metadata may be used if data provide is offline
Publishers use "hdl:1234567/antweb:casent0100367" (i.e., authors use this when writing manuscripts), as they can harass central aggregator if they break
Central aggregator is reponsible for generating reports to providers of how there data has been used, e.g. how many times "cited" in literaure

So, GBIF (for whoever steps up to the plate) would use handles (or DOIs). This gives them the tools to manage the identifiers, plus tells the world that we are serious about this. Publishers can trust that the links to millions of specimen records won't disappear. Providers don't have complex software to install, removing one barrier to making more data available.

I think it's time we made a serious effort to address these issues.

Thursday, July 10, 2008

OpenDOI

Brian de Alwis has written a cool Apple Script called OpenDOI that adds support for resolving doi: and hdl: URLs using Safari on a Mac. With it installed, links such as hdl:10101/npre.2008.1760.1 and doi:10.1093/bib/bbn022 become clickable, without having to stick a HTTP proxy in front of them.

Seems that an obvious extension to this would be to add support for LSIDs. Firefox can support LSIDs through the LSID Browser for Firefox, but this won't work with Safari. Something for the to do list.

Wednesday, May 30, 2007

AMNH, DSpace, and OpenURL

Hate my tribe. Hate them for even asking why nobody uses library standards in the larger world, when “brain-dead inflexibility in practice” is one obvious and compelling reason, and “incomprehensibility” is another.

... $DEITY have mercy, OpenURL is a stupid spec. Great idea, and useful in spite of itself. But astoundingly stupid. Ranganathan preserve us from librarians writing specs! - Caveat Lector

OK, we're on a roll. After adding Journal of Arachnology and Pysche to my OpenURL resolver, I've no added the American Museum of Natural History's Bulletins and Novitates.

In an act of great generosity, the AMNH has placed its publications on a freely accessible DSpace server. This is a wonderful resource provided by one of the world's premier natural history museums (and one others should follow), and is especially valuable given that volumes of the Bulletins and Novitates post 1999 are also hosted by BioOne (and hence have DOIs), but these versions of the publications are not free.

As blogged earlier on SemAnt, getting metadata from DSpace in a actually usable form is a real pain. I ended up writing a script to pull everything off via the OAI interface, extract metadata from the resulting XML, do a DOI look-up for post 1999 material, then dump this into the MySQL server so my OpenURL service can find it.

Apart from the tedium of having to find the OAI interface (why oh why do people make this harder than it needs to be?), the metadata served up by the AMNH is, um, a little ropey. They use Dublin Core, which is great, but the AMNH makes a hash of using it. Dublin Core provides quite a rich set of terms for describing a reference, and guidelines on how to use it. The AMNH uses the same tag for different things. Take date, for example:


<dc:date>2005-10-05T22:02:08Z</dc:date>
<dc:date>2005-10-05T22:02:08Z</dc:date>
<dc:date>1946</dc:date>

Now, one of these dates is the date of publication, the others are dates the metadata was uploaded (or so I suspect). So, why not use the appropriate terms? Like, for instance, <dcterms:created>. Why do I have to parse three fields, and intuit that the third one is the date of publication. Likewise, why have up to three <dc:title> fields, and why include an abbreviated citation in the title? And why for the love of God, format that citation differently for different articles!? Why have multiple <dc:description> fields, one of which is the abstract (and for which <dcterms:abstract> is available?). It's just a mess, and it's very annoying (as you can probably tell). I can see some hate library standards.

Anyway, after much use of Perl regular expressions, and some last minute finessing with Excel, I think we now have the AMNH journals available through OpenURL.

For a demo, go to David Shorthouse's list of references for spiders, say the letter P and click on the bioGUID symbol by a paper by Norm Platnick in the American Museum novitates.

Monday, May 07, 2007

Catalogue of Life, OpenURL, and taxonomic literature

Playing with the recently released "Catalogue of Life" CD, and pondering Charles Hussey's recent post to TAXACOM about the "European Virtual Library of Taxonomic Literature (E-ViTL)" (part of EDIT) has got me thinking more and more about how primitive our handling of taxonomic literature is, and how it cripples the utility of taxonomic databases such as the Catalogue of Life. For example, none of the literature listed in the Catalogue of Life is associated with any digital identifier (such as a DOI, Handle, SICI, or even a URL). In the digital age, this renders the literature section nearly useless -- a user has to search for the reference in Google. Surely we want identifiers, not (often poorly formed) bibliographic citations? For example, I think hdl:2246/4613 is more useful than

Schmidt, K. P. 1921. New species of North American lizards of the genera Holbrookia and Uta. American Museum Novitates (22)

Given the Handle hdl:2246/4613, we get straight to the bibliographic resource, and in this case, a PDF of the paper. In the digital age this is what we need.

So, how to get there? Well, I think we need to focus on developing services to associate references with identifiers. Imagine a service that takes a bibliographic record and returns a globally unique identifier for that reference. This, of course, is part of what CrossRef provides through its OpenURL resolver.

OpenURL has been around a while, and despite the fact that it is probably over complicated (see I hate library standards for more on the seeming desire of librarians to make things harder than they need to be), I think it is a useful way to think about the task of linking taxonomic names to literature, especially if we keep things simple (see Rethinking OpenURL). In particular, drop the obsession with local context -- I don't care what my local library has, my library is the cloud.

So, what if we had an OpenURL service that took a bibliographic citation and queried local and remote sources for a digital identifier, such as a DOI or a Handle, for that citation? If there is no such identifier, then the next step is to create one. For example, the service could create a SICI (see my del.icio.us bookmarks for sici) for that item. Ideally, for those items that were digitised, we could have a database that associated SICIs with the resource location. For example, most of the journal Psyche is available free online as PDFs, and has XML files for each volume providing full bibliographic details (including URLs). It would be trivial to harvest these and add this information to an OpenURL service.

These ideas need a little more fleshing out, but I think it's time the taxonomic community started thinking seriously about digital identifiers for literature, and how they would be used. CrossRef is a great example of what can be done with some simple services (Handles + OpenURL), and it's a tragedy that every time DOIs come up people get blinded by cost, and don't spend time trying to understand how CrossRef works. If nyou want a good demonstration of what can be done with CrossRef, just look at Connotea, which builds much of its functionality on top of CrossRef web services.

It is also interesting that CrossRef is much simpler to use than repositories such as DSpace (used by the AMNH's digital library) -- each DSpace installation has it's own hooks to retrieve metadata (in some cases, such as the AMNH, appallingly badly formed), and as a result there is no easy way to discover what metadata is associated with a given handle, nor given a citation whether a handle exists for that citation.

So, when projects such as EDIT start talking about taxonomic libraries, I think they need to think in terms of simple web services that will serve as the building blocks for other tools. An OpenURL service would be a major boon, and would speed us towards the day when databases such as the Catalogue of Life would not contain (often inconsistently formed) text records of bibliographic works, but actionable identifiers. Any thing less and we remain in the dark ages.