Showing posts with label possible project. Show all posts
Showing posts with label possible project. Show all posts

Wednesday, April 27, 2016

Possible project: Biodiversity dashboard

Mattern 1 dashboard 1020x703 Despite the well deserved scepticism about dashboards voiced by Shannon Mattern @shannonmattern (see Mission Control: A History of the Urban Dashboard, I discovered this by reading Ignore the Bat Caves and Marketplaces: lets talk about Zoning by Leigh Dodds @ldodds) I'm intrigued by the idea a dashboard for biodiversity. We could have several different kinds of information, displayed in a single place.

Immediate information

There are sites such as Global Forest Watch Fires that track events that affect biodiversity and which are haoppoening right now. Some of this data can be harvested (e.g., from the NASA Fire Information for Resource Management System) to show real-time forest fires. Below is an image for the last 24 hours:

We could also have Twitter feeds of these sorts of events

Historical trends

We could have longer-term trends, such as changes in forest cover, or changes in abundance of species over time.

Trends in information

We could have feeds that show us how our knowledge is changing. For example, we could have a map of data from the newest datasets uploaded to GBIF, the lastest DNA barcodes, etc.

As an example, @wikiredlist tweets overtime an article about a species from the IUCN Red List is edited on the English language Wikipedia.

Imagine several such streams, both as lists and as maps. As another example, a while ago I created a visualisation of new species discoveries:

Summary

I'm aware of the irony of drawing inspiration from a critique of dashboards, but I still think there is value in having an overview of global biodiversity. But we shouldn't loose site of the fact that such views will be biassed and constrained, and in many cases it will be much easy to visualise what is going on (or, at least, what our chosen sources reveal) than to effect change on those trends that we find most alarming.

Friday, August 14, 2015

Possible project: NameStream - a stream of new taxonomic names

Yet another barely thought out project, although this one has some crude code. If some 16,000 new taxonomic names are published each year, then that is roughly 40 per day. We don't have a single place that aggregates these, so any major biodiversity projects is by definition out of date. GBIF itself hasn't had an update list of fungi or plant names for several years, and at present doesn't have an up to date list of animal names. You just have to follow the Twitter feeds of ZooKeys and Zootaxa to feel swamped in new names.

And yet, most nomenclators are pumping out RSS feeds of new names, or have APIs that support time-based queries (i.e., send me the names added in the last month). Won't it be great to have a single aggregator that took these "name streams", augmented them by adding links to the literature (it could, for example, harvest RSS feeds and Twitter streams of the relevant journals), and provided the biodiversity community with a feed of new names and associated supporting information. We could watch new discoveries of new biodiversity unfold in near real time, as well as provide a stream of data for projects such as GBIF and others to ingest and keep their databases up to date.

Possible project: A PubMed Central for taxonomy

F93f2e30d1cca847800e6f3060b8101a I need more time to sketch this out fully, but I think a case can be made for a taxonomy-centric (or, perhaps more usefully, a biodiversity-centric) clone of PubMed Central.

Why? We already have PubMed Central, and a European version Europe PubMed Central, and the content of Open Access journals such as ZooKeys appears in both, so, again, why?

Here are some reasons:

  1. PubMed Central has pioneered the use of XML to archive and publish scientific articles, specifically JATS XML. But the biodiversity literature comes in all sorts of formats, including several flavours of XML (such as SciElo XML, XML from OCR literature such as DjVu, ABBYY, and TEI, etc.)
  2. While Europe PMC is doing nice things with ORCIDs and entity extraction, it doesn't deal with the kinds of entities prevalent in the taxonomic literature (such as geographic localities, specimen codes, micro citations, etc.). Nor does it deal with extracting the core scientific statements that a taxonomic paper makes.
  3. Given that much of the taxonomic literature will be derived from OCR we need mechanisms to be able to edit and fix OCR errors, as well as markup errors in more recent XML-native documents. For example, we could envisage having XML stored on GitHub and open to editing.
  4. We need to embed taxonomic literature in our major biodiversity databases, rather than have it consigned to ghettos of individual small-scale digitisation project, or swamped by biomedical literature whose funding and goals may not align well with the biodiversity community (Europe PMC is funded primary by medical bodies).
I hope to flesh this out a bit more later on, but I think it's time we started treating the taxonomic literature as a core resource that we, as a community, are responsible for. The NIH did this with biomedical research, shouldn't we be doing the same for biodiversity?

Monday, August 10, 2015

Possible project: mapping authors to Wikipedia entries using lists of published works

220px Boulenger George 1858 1937One of the less glamorous but necessary tasks of data cleaning is mapping "strings to things", that is, taking strings such as "George A. Boulenger" and mapping them to identifiers, such as ISNI: 0000 0001 0888 841X. In case of authors such as George Boulenger, one way to do this would be through Wikipedia, which has entries for many scientists, often linked to identifiers for those people (see the bottom of the Wikipedia page for George A. Boulenger and look at the "Authority Control" section).

How could we make these mappings? Simple string matching is one approach, but it seems to me that a more robust approach could use bibliographic data. For example, if I search for George A. Boulenger in BioStor, I get lots of publications. If at least some of these were listed on the Wikipedia page for this person, together with links back to BioStor (or some other external identifier, such as DOIs), then we could do the following:

  1. Search Wikipedia for names that matched the author name of interest
  2. If one or more matches are found, grab the text of the Wikipedia pages, extract any literature cited (e.g., in the {cite} tag), get the bibliographic identifiers, and see if they match any in our search results.
  3. If we get one or more hits, then it's likely that the Wikipedia page is about the author of the papers we are looking at, and so we link to it.
  4. Once we have a link to Wikipedia, extract any external identifier for that person, such as ISNI or ORCID.
For this to work, it requires that the Wikipedia page cites works by the author in a way that we can harvest, and uses identifiers that we can match to those in the relevant database (e.g., BioStor, Crossef, etc.). We might also have to look at Wikipedia pages in multiple languages, given that English-language Wikipedia may be lacking information on scholars from non-English speaking countries (this will be a significant issue for many early taxonomists).

Based on my limited browsing of Wikipedia, there seems to be little standardisation of entries for people, certainly little in how their published works are listed (the section heading, format, how many, etc.). The project I'm proposing would benefit from a consistent set of guidelines for how to include a scholar's output.

What makes this project potentially useful is that it could help flesh out Wikipedia pages by encouraging people to add lists of published works, it could aid bibliographic repositories like my own BioStor by increasing the number of links they get from Wikipedia, and if the Wikipedia page includes external identifiers then it helps us go from strings to things by giving us a way to locate globally unique identifiers for people.

Sunday, August 09, 2015

Possible project: #itaxonomist, combining taxonomic names, DOIs, and ORCID to measure taxonomic impact

E9815d877cd092a19918df74e04f0415Imagine a web site where researchers can go, log in (easily) and get a list of all the species they have described (with pretty pictures and, say, GBIF map), and a list of all DNA sequences/barcodes (if any) that they've published. Imagine that this is displayed in a colourful way (e.g., badges), and the results tweeted with the hastag #itaxonomist.

Imagine that you are not a taxonomist, but if you have worked with one (e.g., published a paper), you can go to the site, log in, and discover that you “know” a taxonomist. Imagine if you are a researcher who has cited taxonomic work, you can log in and discover that your work depends on a taxonomist (think six degrees of Kevin Bacon).

Imagine that this is run as a campaign (hashtag #itaxonomist), with regular announcements leading up to the release date. Imagine if #itaxonomist trends. Imagine the publicity for the work taxonomists do, and the new found ability for them to quantitatively demonstrate this.

How does it work?

#itaxonomist relies on three things:

  1. People having an ORCID
  2. People having publications with DOIs (or otherwise easily identifiable) in their ORCID profile
  3. A map between DOIs (etc.) and the names in the nomenclators (ION, IPNI, Index Fungorum, ZooBank)

Implementation

Under the hood this builds part of the “biodiversity knowledge graph”, and uses ideas I and others have been playing around with (e.g., see David Shorthouse’s neat proof of concept http://collector.shorthouse.net/agent/0000-0002-7260-0350 and my now defunct Mendeley project http://iphylo.blogspot.co.uk/2011/12/these-are-my-species-finding-taxonomic.html).

For a subset of people and names this we could build this very quickly. Some some taxonomists already have ORCIDs , and some nomenclators have limited numbers of DOIs. I am currently building lists of DOIs for primary taxonomic literature, which could be used to seed the database.

The “i am a taxonomist” query is simply a map between ORCID to DOI to name in nomenclator. The “i know a taxonomist” is a map between ORCID and DOI that you share with a taxonomist, but there are no names associated with that DOI (e.g., a paper you have co-authored with a taxonomist that wasn’t on taxonomy, or at least didn’t describe a new species). The “six degrees of taxonomy” relies on the existence of open citation data, which is trickier, but some is available in PubMed Central and/or could be harvested from Pensoft publications.

Tuesday, August 04, 2015

Possible project: extract taxonomic classification from tags (folksonomy)

Note to self about a possible project. This PLoS ONE paper:

Tibély, G., Pollner, P., Vicsek, T., & Palla, G. (2013, December 31). Extracting Tag Hierarchies. (P. Csermely, Ed.)PLoS ONE. Public Library of Science (PLoS). http://doi.org/10.1371/journal.pone.0084133
describes a method for inferring a hierarchy from a set of tags (and cites related work that is of interest). I've grabbed the code and data from http://hiertags-beta.elte.hu/home/ and put it on GitHub.

Possible project

Use Tibély et al. method (or others) on taxonomic names extracted from BHL text (or other) and see if we can reconstruct taxonomic classifications. ow do classifications compare to those in databases? Can we enhance existing databases using this technique (e.g., extract classifications from literature for groups pporly represented in existing databases)? Could be part of larger study of what we can learn from co-occurrence of taxonomic names, e.g. Automatically extracting possible taxonomic synonyms from the literature.

Note to anyone reading this: if this project sounds interesting, by all means feel free to do it. These are just notes about things that I think would be fun/interesting/useful to do.