Showing posts with label guest post. Show all posts
Showing posts with label guest post. Show all posts

Thursday, October 07, 2021

Reflections on "The Macroscope" - a tool for the 21st Century?

YtNkVT2U This is a guest post by Tony Rees.

It would be difficult to encounter a scientist, or anyone interested in science, who is not familiar with the microscope, a tool for making objects visible that are otherwise too small to be properly seen by the unaided eye, or to reveal otherwise invisible fine detail in larger objects. A select few with a particular interest in microscopy may also have encountered the Wild-Leica "Macroscope", a specialised type of benchtop microscope optimised for low-power macro-photography. However in this overview I discuss the "Macroscope" in a different sense, which is that of the antithesis to the microscope: namely a method for visualizing subjects too large to be encompassed by a single field of vision, such as the Earth or some subset of its phenomena (the biosphere, for example), or conceptually, the universe.

My introduction to the term was via addresses given by Jesse Ausubel in the formative years of the 2001-2010 Census of Marine Life, for which he was a key proponent. In Ausubel's view, the Census would perform the function of a macroscope, permitting a view of everything that lives in the global ocean (or at least, that subset which could realistically be sampled in the time frame available) as opposed to more limited subsets available via previous data collection efforts. My view (which could, of course, be wrong) was that his thinking had been informed by a work entitled "Le macroscope, vers une vision globale" published in 1975 by the French thinker Joël de Rosnay, who had expressed such a concept as being globally applicable in many fields, including the physical and natural worlds but also extending to human society, the growth of cities, and more. Yet again, some ecologists may also have encountered the term, sometimes in the guise of "Odum's macroscope", as an approach for obtaining "big picture" analyses of macroecological processes suitable for mathematical modelling, typically by elimination of fine detail so that only the larger patterns remain, as initially advocated by Howard T. Odum in his 1971 book "Environment, Power, and Society".

From the standpoint of the 21st century, it seems that we are closer to achieving a "macroscope" (or possibly, multiple such tools) than ever before, based on the availability of existing and continuing new data streams, improved technology for data assembly and storage, and advanced ways to query and combine these large streams of data to produce new visualizations, data products, and analytical findings. I devote the remainder of this article to examples where either particular workers have employed "macroscope" terminology to describe their activities, or where potentially equivalent actions are taking place without the explicit "macroscope" association, but are equally worthy of consideration. To save space here, references cited here (most or all) can be found via a Wikipedia article entitled "Macroscope (science concept)" that I authored on the subject around a year ago, and have continued to add to on occasion as new thoughts or information come to hand (see edit history for the article).

First, one can ask, what constitutes a macroscope, in the present context? In the Wikipedia article I point to a book "Big Data - Related Technologies, Challenges and Future Prospects" by Chen et al. (2014) (doi:10.1007/978-3-319-06245-7), in which the "value chain of big data" is characterised as divisible into four phases, namely data generation, data acquisition (aka data assembly), data storage, and data analysis. To my mind, data generation (which others may term acquisition, differently from the usage by Chen et al.) is obviously the first step, but does not in itself constitute the macroscope, except in rare cases - such as Landsat imagery, perhaps - where on its own, a single co-ordinated data stream is sufficient to meet the need for a particular type of "global view". A variant of this might be a coordinated data collection program - such as that of the ten year Census of Marine Life - which might produce the data required for the desired global view; but again, in reality, such data are collected in a series of discrete chunks, in many and often disparate data formats, and must be "wrangled" into a more coherent whole before any meaningful "macroscope" functionality becomes available.

Here we come to what, in my view, constitutes the heart of the "macroscope": an intelligently organized (i.e. indexable and searchable), coherent data store or repository (where "data" may include imagery and other non numeric data forms, but much else besides). Taking the Census of Marine Life example, the data repository for that project's data (plus other available sources as inputs) is the Ocean Biodiversity Information System or OBIS (previously the Ocean Biogeographic Information System), which according to this view forms the "macroscope" for which the Census data is a feed. (For non habitat-specific biodiversity data, GBIF is an equivalent, and more extensive, operation). Other planetary scale "macroscopes", by this definition (which may or may not have an explicit geographic, i.e. spatial, component) would include inventories of biological taxa such as the Catalogue of Life and so on, all the way back to the pioneering compendia published by Linnaeus in the eighteenth century; while for cartography and topographic imagery, the current "blockbuster" of Google Earth and its predecessors also come well into public consciousness.

In the view of some workers and/or operations, both of these phases are precursors to the real "work" of the macroscope which is to reveal previously unseen portions of the "big picture" by means either of the availability of large, synoptic datasets, or fusion between different data streams to produce novel insights. Companies such as IBM and Microsoft have used phraseology such as:

By 2022 we will use machine-learning algorithms and software to help us organize information about the physical world, helping bring the vast and complex data gathered by billions of devices within the range of our vision and understanding. We call this a "macroscope" – but unlike the microscope to see the very small, or the telescope that can see far away, it is a system of software and algorithms to bring all of Earth's complex data together to analyze it by space and time for meaning." (IBM)
As the Earth becomes increasingly instrumented with low-cost, high-bandwidth sensors, we will gain a better understanding of our environment via a virtual, distributed whole-Earth "macroscope"... Massive-scale data analytics will enable real-time tracking of disease and targeted responses to potential pandemics. Our virtual "macroscope" can now be used on ourselves, as well as on our planet." (Microsoft) (references available via the Wikipedia article cited above).

Whether or not the analytical capabilities described here are viewed as being an integral part of the "macroscope" concept, or are maybe an add-on, is ultimately a question of semantics and perhaps, personal opinion. Continuing the Census of Marine Life/OBIS example, OBIS offers some (arguably rather basic) visualization and summary tools, but also makes its data available for download to users wishing to analyse it further according to their own particular interests; using OBIS data in this manner, Mark Costello et al. in 2017 were able to demarcate a finite number of data-supported marine biogeographic realms for the first time (Costello et al. 2017: Nature Communications. 8: 1057. doi:10.1038/s41467-017-01121-2), a project which I was able to assist in a small way in an advisory capacity. In a case such as this, perhaps the final function of the macroscope, namely data visualization and analysis, was outsourced to the authors' own research institution. Similarly at an earlier phase, "data aggregation" can also be virtual rather than actual, i.e. avoiding using a single physical system to hold all the data, enabled by open web mapping standards WMS (web map service) and WFS (web feature service) to access a set of distributed data stores, e.g. as implemented on the portal for the Australian Ocean Data Network.

So, as we pass through the third decade of the twenty first century, what developments await us in the "macroscope" area"? In the biodiversity space, one can reasonably presume that the existing "macroscopic" data assembly projects such as OBIS and GBIF will continue, and hopefully slowly fill current gaps in their coverage - although in the marine area, strategic new data collection exercises may be required (Census 2020, or 2025, anyone?), while (again hopefully), the Catalogue of Life will continue its progress towards a "complete" species inventory for the biosphere. The Landsat project, with imagery dating back to 1972, continues with the launch of its latest satellite Landsat 9 just this year (21 September 2021) with a planned mission duration for the next 5 years, so the "macroscope" functionality of that project seems set to continue for the medium term at least. Meanwhile the ongoing development of sensor networks, both on land and in the ocean, offers an exciting new method of "instrumenting the earth" to obtain much more real time data than has ever been available in the past, offering scope for many more, use case-specific "macroscopes" to be constructed that can fuse (e.g.) satellite imagery with much more that is happening at a local level.

So, the "macroscope" concept appears to be alive and well, even though the nomenclature can change from time to time (IBM's "Macroscope", foreshadowed in 2017, became the "IBM Pairs Geoscope" on implementation, and is now simply the "Geospatial Analytics component within the IBM Environmental Intelligence Suite" according to available IBM publicity materials). In reality this illustrates a new dichotomy: even if "everyone" in principle has access to huge quantities of publicly available data, maybe only a few well funded entities now have the computational ability to make sense of it, and can charge clients a good fee for their services...

I present this account partly to give a brief picture of "macroscope" concepts today and in the past, for those who may be interested, and partly to present a few personal views which would be out of scope in a "neutral point of view" article such as is required on Wikipedia; also to see if readers of this blog would like to contribute further to discussion of any of the concepts traversed herein.

Friday, September 11, 2020

Darwin Core Million reminder, and thoughts on bad data

Bob mesibovThe following is a guest post by Bob Mesibov.

No winner yet in the second Darwin Core Million for 2020, but there are another two and a half weeks to go (to 30 September). For details of the contest see this iPhylo blog post. And please don’t submit a million RECORDS, just (roughly) a million DATA ITEMS. That’s about 20,000 records with 50 fields in the table, or about 50,000 records with 20 fields, or something arithmetically similar.


The purpose of the Darwin Core Million is to celebrate high-quality occurrence datasets. These are extraordinarily rare in biodiversity informatics.

I’ll unpick that. I’m not talking about the accuracy of the records. For most records, the “what”, “where”, “when” and “by whom” are probably correct. An occurrence record is a simple fact: Wilma Flintstone collected a flowering specimen of an Arizona Mountain Dandelion 5 miles SSE of Walker, California on 27 June 2019. More technically, she collected Agoseris parviflora at 38.4411 –119.4393, as recorded by her handheld GPS.

What could possibly go wrong in compiling a dataset of simple records like that in a spreadsheet or database? Let me count a few of the ways:

  • data items get misspelled or misnumbered
  • data items get put in the wrong field
  • data items are put in a field for which they are invalid or inappropriate
  • data items that should be entered get left out
  • data items get truncated
  • data items contain information better split into separate fields
  • data items contain line breaks
  • data items get corrupted by copying down in a spreadsheet
  • data items disagree with other data items in the same record
  • data items refer to unexplained entities (“habitat type A”)
  • paired data items don’t get paired (e.g. latitude but no longitude)
  • the same data item appears in different formats in different records
  • missing data items are represented by blanks, spaces, “?”, “na”, “-”, “unknown”, “not recorded” etc, all in the same data table
  • character encoding failures create gibberish, question marks and replacement characters (�)
  • weird control characters appear in data items, and parsing fails
  • dates get messed up (looking at you, Excel)
  • records get duplicated after minor edits

In previous blog posts (here and here) I’ve looked at explanations for poor-quality data at the project, institution and agency level — data sources I referred to collectively as the “PIA”. I don’t think any of those explanations are controversial. Here I’m going to be rude and insulting and say there are three further obstacles to creating good, usable and shareable occurrence data:

Datasets are compiled as though they were family heirlooms.

The PIA says “This database is OUR property. It’s for OUR use and WE understand the data, even if it’s messy and outsiders can’t figure out what we’ve done. Ambiguities? No problem, we’ll just email Old Fred. He retired a few years back but he knows the system back to front.”

Prising data items from these heirlooms, mapping them to new fields and cleaning them are complicated exercises best left to data specialists. That’s not what happens.

Datasets are too often compiled by people with inadequate computer skills. Their last experience of data management was building a spreadsheet in a “digital learning” class. They’re following instructions but they don’t understand them. Both the data enterers and their instructors are hoping for a good result, which is truly courageous optimism.

The (often huge) skills gap between the compilers of digital PIA data and the computer-savvy people who analyse and reformat/repackage the data (users and facilitators-for-users) could be narrowed programmatically, but isn’t. Hands up all those who use a spreadsheet for data entry by volunteers and have comprehensive validation rules for each of the fields? Thought so.

People confuse software with data. This isn’t a problem restricted to biodiversity informatics, and I’ve ranted about this issue elsewhere. The effect is that data compilers blame software for data problems and don’t accept responsibility for stuff-ups.

Sometimes that blaming is justified. As a data auditor I dread getting an Excel file, because I know without looking that the file will have usability and shareability issues on top of the usual spreadsheet errors. Excel isn’t an endpoint in a data-use pipeline, it’s a starting point and a particularly awful one.

Another horror is the export option. Want to convert your database of occurrence records to format X? Just go to the “Save as” or “Export data” menu item and click “OK”. Magic happens and you don’t need to check the exported file in format X to see that all is well. If all is not well, it’s the software’s fault, right? Not your problem.

In view of these and the previously blogged-about explanations for bad data, it’s a wonder that there are any high-quality datasets, but there are. I’ve audited them and it’s a shame that for ethical reasons I can’t enter them myself in the current Darwin Core Million.

Monday, August 10, 2020

Australian museums and ALA

Bob mesibovThe following is a guest post by Bob Mesibov.

The Atlas of Living Australia (ALA) adds "assertions" to Darwin Core occurrence records. "Assertions" are indicators of particular data errors, omissions and questionable entries, such as "Coordinates are transposed", "Geodetic datum assumed WGS84" and "First [day] of the century".

Today (8 August 2020) I looked at assertions attached to records in ALA for non-fossil animals in the Australian State museums. There were 62 occurrence record collections from the seven museums (I lumped the two Tasmanian museums together), with 45 different assertions. I then calculated assertions per record for each collection. The worst performer was the Queensland Museum Porifera collection (3.84 ass/rec), and tied for best were the Museums Victoria Herpetology and Ichthyology collections (1.09 ass/rec).

I also aggregated museum collections to build a kind of league table by State:

The clear winner is Museums Victoria.

But how well do ALA's assertions measure the quality of data records? Not all that well, actually.

  • The tests used to make the assertions generate false positives and false negatives, although at a low rate
  • The tests aren't independent, so that a single data error can "smear" across several assertions
  • The tests ignore errors and omissions in DwC fields that many data users would consider important

ALA's assertions also have a strong spatial/geographical bias, with 23 of the 45 assertions in my sample dataset saying something about the "where" of the occurrence. Looking just at those 23 "where" assertions, the museums league table again shows Museums Victoria ahead, this time by a wide margin:

ALA is currently working on better ways for users to filter out records with selected assertions, in what's misleadingly called a "Data Quality Project". The title is misleading because the overall quality of ALA's holdings doesn't improve one bit. Getting data providers to fix their data issues would be a more productive way to upgrade data quality, but I haven't seen any evidence that Australian museums (for example) pay much attention to ALA's assertions. (There are no or minimal changes in assertion totals between data updates.)

It's been pointed out to me that that museum and herbarium records amount to only a small fraction of ALA's ca 90 million records, and that citizen scientists are growing the stock of occurrence records far faster than institutions do. True, and those citizen science records are often of excellent quality (see https://www.datafix.com.au/BASHing/2020-02-05.html). However, citizen science observations are strongly biased towards widespread and common species. ALA's records for just six common Australian birds (5,072,599 as of 8 August 2020; https://dashboard.ala.org.au/) outnumber all the museum animal records I looked at in the assertion analysis (4,669,508).

In my humble view, the longer ALA's institutional data providers put off fixing their mistakes, the less valuable ALA becomes as a bridge between biodiversity informatics and biodiversity science.


Wednesday, January 24, 2018

Guest post: The Not problem

Bob mesibovThe following is a guest post by Bob Mesibov.

Nico Franz and Beckett Sterner created a stir last year with a preprint in bioRxiv about expert validation (or the lack of it) in the "backbone" classifications used by aggregators. The final version of the paper was published this month in the OUP journal Database (doi:10.1093/database/bax100).

To see what effect "backbone" taxonomies are having on aggregated occurrence records, I've recently been auditing datasets from GBIF and the Atlas of Living Australia. The results are remarkable, and I'll be submitting a write-up of the audits for formal publication shortly. Here I'd like to share the fascinating case of the genus Not Chan, 2016.

I found this genus in GBIF. A Darwin Core record uploaded by the New Zealand Arthropod Collection (NZAC02015964) had the string "not identified on slide" in the scientificName field, and no other taxonomic information.

GBIF processed this record and matched it to the genus Not Chan, 2016, which is noted as "doubtful" and "incertae sedis".

There are 949 other records of this genus around the world, carefully mapped by GBIF. The occurrences come from NZAC and nine other datasets. The full scientific names and their numbers of GBIF records are:

NumberName
2Not argostemma
14not Buellia
1not found, check spelling
1Not given (see specimen note) bucculenta
1Not given (see specimen note) ortoni
1Not given (see specimen note) ptychophora
1Not given (see specimen note) subpalliata
1not identified on slide
1not indentified
1Not known not known
1Not known sp.
1not Lecania
4Not listed
873Not naturalised in SA sp.
18Not payena
5not Punctelia
18not used
6Not used capricornia Pleijel & Rouse, 2000

GBIF cites this article on barnacles as the source of the genus, although the name should really be Not Chan et al., 2016. A careful reading of this article left me baffled, since the authors nowhere use "not" as a scientific name.

Next I checked the Catalogue of Life. Did CoL list this genus, and did CoL attribute it to Chan? No, but "Not assigned" appears 479 times among the names of suprageneric taxa, and the December 2018 CoL checklist includes the infraspecies "Not diogenes rectmanus Lanchester,1902" as a synonym.

The Encyclopedia of Life also has "Not" pages, but these have in turn been aggregated on the "EOL pages that don't represent real taxa" page, and under the listing for the "Not assigned36" page someone has written:

This page contains a bunch of nodes from the EOL staff Scratchpad. NB someone should go in and clean up that classification.

"Someone should go in and clean up that classification" is also the GBIF approach to its "backbone" taxonomy, although they think of that as "we would like the biodiversity informatics community and expert taxonomists to point out where we've messed up". Franz and Sterner (2018) have also called for collaboration, but in the direction of allowing for multiple taxonomic schemes and differing identications in aggregated biodiversity data. Technically, that would be tricky. Maybe the challenge of setting up taxonomic concept graphs will attract brilliant developers to GBIF and other aggregators.

Meanwhile, Not Chan, 2016 will endure and aggregated biodiversity records will retain their vast assortment of invalid data items, character encoding failures, incorrect formatting, duplications and truncated data items. In a post last November on the GitHub CoL+ pages I wrote:

Being old and cynical, I can speculate that in the time spent arguing the "politics" of aggregation in recent years, a competent digital librarian or data scientist would have fixed all the CoL issues and would be halfway through GBIF's. But neither of those aggregators employ digital librarians or data scientists, and I'm guessing that CoL+ won't employ one, either.

Monday, September 18, 2017

Guest post: Our taxonomy is not your taxonomy

Bob mesibov The following is a guest post by Bob Mesibov.

Do you know the party game "Telephone", also known as "Chinese Whispers"? The first player whispers a message in the ear of the next player, who passes the message in the same way to a third player, and so on. When the last player has heard the whispered message, the starting and finishing versions of the message are spoken out loud. The two versions are rarely the same. Information is usually lost, added or modified as the message is passed from player to player, and the changes are often pretty funny.

I recently compared ca 100 000 beetle records as they appear in the Museums Victoria (NMV) database and in DarwinCore downloads from the Atlas of Living Australia (ALA) and the Global Biodiversity Information Facility (GBIF). NMV has its records aggregated by ALA, and ALA passes its records to GBIF. The "Telephone" effect in the NMV to ALA to GBIF comparison was large and not particularly funny.

Many of the data changes occur in beetle names. ALA checks the NMV-supplied names against a look-up table called the National Species List, which in this case derives from the Australian Faunal Directory (AFD). If no match is found, ALA generalises the record to the next higher supplied taxon, which it also checks against the AFD. ALA also replaces supplied names if they are synonyms of an accepted name in the AFD.

GBIF does the same in turn with the names it gets from ALA. I'm not 100% sure what GBIF uses as beetle look-up table or tables, but in many other cases their GBIF Backbone Taxonomy mirrors the Catalogue of Life.

To give you some idea of the magnitude of the changes, of ca 85000 NMV records supplied with a genus+species combination, about one in five finished up in GBIF with a different combination. The "taxonRank" changes are summarised in the overview below, and note that replacement ALA and GBIF taxon names at the same rank are often different:

Generalised

Of the species that escaped generalisation to a higher taxon, there are 42 names with genus triples: three different genus names for the same taxon in NMV, ALA and GBIF.

Just one example: a paratype of the staphylinid Schaufussia mona Wilson, 1926 is held in NMV. The record is listed under Rytus howittii (King, 1866) in the ALA Darwin Core download, because AFD lists Schaufussia mona as a junior subjective synonym of Tyrus howitti King, 1866, and Tyrus howittii in AFD is in turn listed as a synonym of Rytus howittii (King, 1866). The record appears in GBIF under Tyraphus howitti (King, 1865), with Rytus howittii (King, 1866) listed as a synonym. In AFD, Rytus howittii is in the tribe Tyrini, while Tyraphus howitti is a different species in the tribe Pselaphini.

ALA gives "typeStatus" as "paratype" for this record, but the specimen is not a paratype of Rytus howittii. In the GBIF download, the "typeStatus" field is blank for all records. I understand this may change in future. If it does, I hope the specimen doesn't become a paratype of Tyraphus howitti through copying from ALA.

There are lots of "Telephone" changes in non-taxonomic fields as well, including some geographical howlers. ALA says that a Kakadu National Park record is from Zambia and another Northern Territory record is from Mozambique, because ALA trusts the incorrect longitude provided by NMV more than it does the NMV-supplied locality text. GBIF blanks this locality text field, leaving the GBIF user with two African records for Australian specimens and no internal contradictions.

ALA trusts latitude/longitude to the extent of changing the "stateProvince" field for localities near Australian State borders, if a low-precision latitude/longitude places the occurrence a short distance away in an adjoining State.

Manglings are particularly numerous in the "recordedBy" field, where name strings are reformatted, not always successfully. Complex NMV strings suffer worst, e.g. "C Oke; Charles John Gabriel" in NMV becomes "Oke, C.|null" in ALA, and "Ms Deb Malseed - Winda-Mara Aboriginal Corporation WMAC; Ms Simone Sailor - Winda-Mara Aboriginal Corporation WMAC" is reformatted as in ALA "null|null|null|null"

Most of the "Telephone" effect in the NMV-ALA-GBIF comparison appears in the NMV-ALA stage. I contacted ALA by email and posted some of the issues on the ALA GitHub site; I haven't had a response and the issues are still open. I also contacted Tim Robertson at GBIF, who tells me that GBIF is working on the ALA-GBIF stage.

Can you get data as originally supplied by NMV to ALA, through ALA? Well, that's easy enough record-by-record on the ALA website, but not so easy (or not possible) for a multi-record download. Same with GBIF, but in this case the "original" data are the ALA versions.

Friday, September 30, 2016

Guest post: It's 2016 and your data aren't UTF-8 encoded?

Bob mesibov The following is a guest post by Bob Mesibov.

According to w3techs, seven out of every eight websites in the Alexa top 10 million are UTF-8 encoded. This is good news for us screenscrapers, because it means that when we scrape data into a UTF-8 encoded document, the chances are good that all the characters will be correctly encoded and displayed.

It's not quite good news for two reasons.

In the first place, one out of eight websites is encoded with some feeble default like ISO-8859-1, which supports even fewer characters than the closely related windows-1252. Those sites will lose some widely-used punctuation when read as UTF-8, unless the webpage has been carefully composed with the HTML equivalents of those characters. You're usually safe (but see below) with big online sources like Atlas of Living Australia (ALA), APNI, CoL, EoL, GBIF, IPNI, IRMNG, NCBI Taxonomy, The Plant List and WoRMS, because these declare a UTF-8 charset in a meta tag in webpage heads. (IPNI's home page is actually in ISO-8859-1, but its search results are served as UTF-8 encoded XML.)

But a second problem is that just because a webpage declares itself to be UTF-8, that doesn't mean every character on the page sings from the Unicode songbook. Very odd characters may have been pulled from a database and written onto the page as-is. In ALA I recently found an ancient rune — the High Octet Preset control character (HOP, hex 81):

http://biocache.ala.org.au/occurrences/6191ca90-873b-44f8-848d-befc29ad7513 http://biocache.ala.org.au/occurrences/5077df1f-b70a-465b-b22b-c8587a9fb626

HOP replaces ü on these pages and is invisible in your browser, but a screenscrape will capture the HOP and put SchHOPrhoff in your UTF-8 document.

Another example of ALA's fidelity to its sources is its coding of the degree symbol, which is a single-byte character (hex b0) in windows-1252, e.g. in Excel spreadsheets, but a two-byte character (hex c2 b0) in Unicode. In this record, for example:

http://biocache.ala.org.au/occurrences/5e3a2e05-1e80-4e1c-9394-ed6b37441b20

the lat/lon was supplied (says ALA) as 37°56'9.10"S 145° 0'43.74"E. Or was it? The lat/lon could have started out as 37°56'9.10"S 145°0'43.74"E in UTF-8. Somewhere along the line the lat/lon was converted to windows-1252 and the ° characters were generated, resulting in geospatial gibberish.

When a program fails to understand a character's encoding, it usually replaces the mystery character with a ?. A question mark is a perfectly valid character in commonly used encodings, which means the interpretation failure gets propagated through all future re-uses of the text, both on the Web and in data dumps. For example,

http://biocache.ala.org.au/occurrences/dfbbc42d-a422-47a2-9c1d-3d8e137687e4

gives N?crophores for Nécrophores. The history of that particular character failure has been lost downstream, as is the case for myriads of other question marks in online biodiversity data.

In my experience, the situation is much worse in data dumps from online sources. It's a challenge to find a dump without question marks acting as replacement characters. Many of these question marks appear in author and place names. Taxonomists with eastern European names seem to fare particularly badly, sometimes with more than one character variant appearing in the same record, as in the Australian Faunal Directory (AFD) offering of Wêgrzynowicz, W?grzynowicz and Węgrzynowicz for the same coleopterist. Question marks also frequently replace punctuation, such as n-dashes, smart quotes and apostrophes (e.g. O?Brien (CoL) and L?Échange and d?Urville (AFD)).

Character encoding issues create major headaches for data users. It would be a great service to biodiversity informatics if data managers compiled their data in UTF-8 encoding or took the time to convert to UTF-8 and fix any resulting errors before publishing to the Web or uploading to aggregators.

This may be a big ask, given that at least one data manager I've talked to had no idea how characters were encoded in the institution's database. But as ALA's Miles Nicholls wrote back in 2009, "Note that data should always be shared using UTF-8 as the character encoding". Biodiversity informatics is a global discipline and UTF-8 is the global standard for encoding.

Readers needing some background on character encoding will find this and especially this helpful, and a very useful tool to check for encoding problems in small blocks of text is here.

Wednesday, September 07, 2016

Guest post: Absorbing task or deranged quest: an attempt to track all genus names ever published

YtNkVT2U This guest post by Tony Rees describes his quest to track all genus names ever published (plus a subset of the species…).

A “holy grail” for biodiversity informatics is a suitably quality controlled, human- and machine-queryable list of all the world’s species, preferably arranged in a suitable taxonomic hierarchy such as kingdom-phylum-class-order-family-genus or other. To make it truly comprehensive we need fossils as well as extant taxa (dinosaurs as well as dinoflagellates) and to cover all groups from viruses to vertebrates (possibly prions as well, which are, well, virus-like). Linnaeus had some pretty good attempts in his day, and in the internet age the challenge has been taken up by a succession of projects such as the “NODC Taxonomic Code” (a precursor to ITIS, the Integrated Taxonomic Information System - currently 722,000 scientific names), the Species 2000 consortium, and the combined ITIS+SP2000 product “Catalogue of Life”, now in its 16th annual edition, with current holdings of 1,635,250 living and 5,719 extinct valid (“accepted”) species, plus an additional 1,460,644 synonyms (information from http://www.catalogueoflife.org/annual-checklist/2016/info/ac). This looks pretty good until one realises that as well as the estimated “target” of 1.9 million valid extant species there are probably a further 200,000-300,000 described fossils, all with maybe as many synonyms again, making a grand total of at least 5 million published species names to acquire into a central “quality assured” system, a task which will take some time yet.

Ten years ago, in 2006, the author participated in a regular meeting of the steering committee for OBIS, the Ocean Biogeographic Information System which, like GBIF, aggregates species distribution data (for marine species in this context) from multiple providers into a single central search point. OBIS was using the Catalogue of Life (CoL) as its “taxonomic backbone” (method for organising its data holdings) and, again like GBIF, had come up against the problem of what to do with names not recognised in the then-latest edition of CoL, which was at the time less than 50% complete (information on 884,552 species). A solution occurred to me that since genus names are maybe only 10% as numerous as species names, and every species name includes its containing genus as the first portion of its binomial name (thanks, Linnaeus!), an all-genera index might be a tractable task (within a reasonable time frame) where an all-species index was not, and still be useful for allocating incoming “not previously seen” species names to an appropriate position in the taxonomic hierarchy. OBIS, in particular, also wished to know if species (or more exactly, their parent genera) were marine (to be displayed) or nonmarine (hide), similar with extant versus fossil taxa. Sensing a challenge, I offered to produce such a list, in my mind estimating that it might require 3 months full-time, or the equivalent 6 months in part time effort to complete and supply back to OBIS for their use.

To cut a long story short… the project, which I christened the Interim Register of Marine and Nonmarine Genera or IRMNG (originally at CSIRO in Australia, now hosted on its own domain “www.irmng.org” and located at VLIZ in Belgium) has successfully acquired over 480,000 published genus names, including valid names, synonyms and a subset of published misspellings, all allocated to families (most) or higher ranks (remainder) in an internally coherent taxonomic structure, most with marine/nonmarine and extant/fossil flags, all with the source from which I acquired them, sources for the flags, and more; also for perhaps 50% of genera, lists of associated species from wherever it has been convenient to acquire them (Catalogue of Life 2006 being a major source, but many others also used). My estimated 6 months has turned into 10 years and counting, but I do figure that the bulk of the basic “names acquisition” has been done for all groups (my estimate: over 95% complete) and it is rare (although not completely unknown) for me to come across genus names not yet held, at least for the period 1753-2014 which is the present coverage of IRMNG; present effort is therefore concentrated on correcting internal errors and inconsistencies, and upgrading the taxonomic placement (to family) for the around 100,000 names where this is not yet held (also establishing the valid name/synonym status of a similar number of presently “unresolved” generic names).

With the move of the system from Australia to VLIZ, completed within the last couple of months, there is the facility to utilise all of the software and features presently developed at VLIZ that currently runs WoRMS, the World Register of Marine Species and its many associated subsidiary databases, as well as (potentially) look at forming a distributed editing network for IRMNG in the future, as already is the case for WoRMS, presuming that others are see a value in maintaining IRMNG as a useful resource e.g. for taxonomic name resolution, detection of potential homonyms both within and across kingdoms, and generally acting as a hierarchical view of “all life” to at least genus level. A recently implemented addition to IRMNG is to hold ION identifiers (also used in BioNames), for the subset of names where ION holds the original publication details, enabling “deep links” to both ION and BioNames wherein the original publication can often be displayed, as previously described elsewhere in this Blog. Similar identifiers for plants are not yet held in the system but could be, (for example Index Fungorum identifiers for fungi), for cases where the potential linked system adds value in giving, for example, original publication details and onward links to the primary literature.

All in all I feel that the exercise has been of value not only to OBIS (the original “client”) but also to other informatics systems such as GBIF, Encyclopedia of Life, Atlas of Living Australia, Open Tree of Life and others who have all taken advantage of IRMNG data to add to their systems, either for the marine/nonmarine and extant/fossil flags or as an addition to their primary taxonomic backbones, or both. In addition it has allowed myself, the founding IRMNG compiler, to “scratch the taxonomic itch” and finally flesh out what is meant by statements that a certain group contains x families or y genera, and what these actually might be. Finally, many users of the system via its web interface have commented over time on how useful it is to be able to input “any” name, known or unknown, with a good chance that IRMNG can tell them something about the genus (or genus possible options, in the case of homonyms) as well as the species, in many cases, as well as discriminate extant from fossil taxon names, something not yet offered to any significant extent by the current Catalogue of Life.

Readers of iPhylo are encouraged to try IRMNG as a “taxonomic name resolution service” by visiting www.irmng.org and of course, welcome to contact me with suggestions of missed names (concentrating at genus level at the present time) or any other ideas for improvement (which I can then submit for consideration to the team at VLIZ who now provide the technical support for the system).

Friday, April 08, 2016

Guest post: 10 explanations for messy data, by Bob Mesibov

The follow is a guest post by Bob Mesibov, who has contributed to iPhylo before. Bob

Like many iPhylo readers, I deal with large, pre-existing compilations of biodiversity data. The compilations come from museums, herbaria, aggregation projects and government agencies. For simplicity in what follows and to avoid naming names, I'll lump all these sources into a single fictional entity, the PAI (for Projects, Agencies and Institutions).

The datasets I get from the PAI typically contain duplicate records, inconsistencies in content and format, unexplained data gaps, data in wrong fields, fields improperly used, no flagging of doubtful data, etc. Data cleaning consumes most of the time I spend on a data project. Cleaning can take weeks, analysing the cleaned data takes minutes, reporting the results of the analysis takes hours or days. (Example: doi:10.3897/BDJ.2.e1160)

I can understand how datasets get messy. Data entry errors account for a lot of the problems I see, and I make data entry errors myself. But the causes of messiness are one thing and its cure is another. The custodians of those data compilations don't seem to have done much (or any) data checking. Why not?

When I'm brave enough to ask that question, I usually get a polite response from the PAI. Here are 10 explanations I've heard for inadequate data checking and cleaning:

(1) The data are fit for use, as-is. No cleaning is needed, because the data are fit for some use, and the PAI is satisfied with that. One data manager wrote to me in an email: '...even records with lower certainty, in this case an uncertain identification, can be useful at a coarser resolution. Although we have no idea as to the reliability of the identification to the species or even genus they are likely correctly identify[ing] something as at least an animal, arthropod and possibly to class so the record is suitable for analysis at that level.'

(2) The PAI is exposing its data online. The crowd will spot any problems and tell the PAI about them.

I've previously pointed out (doiL10.3897/zookeys.293.5111) how lame this explanation is. As a strategy for data cleaning it's slow, piecemeal and wildly optimistic. At best, it accumulates data-cleaning 'tickets' with no guarantee that any will ever be closed. What I hear from the PAI is 'We're aware of problems of that kind and are hoping to find a general solution, rather than deal with a multitude of individual cases'. Years pass and the individual cases don't get fixed, so interested members of the crowd lose faith in the process and stop reporting problems.

(3) No one outside the PAI is allowed to look at the whole dataset, and no one inside the PAI has the time (or skills) to do data checking and cleaning.

This is a particularly nice Catch-22. I once offered to check a portion of the PAI's data holdings for free, and was told that PAI policy was that the dataset was not to be shared with anyone outside the PAI. The same data were freely available on the PAI's website in bits and pieces through a database search page.

(4) The PAI is migrating to new database software next year. Data cleaning will be part of the migration.

No, it won't. Note that this response isn't always simple procrastination, because it's sometimes the case that the PAI's database has only limited capabilities for data checking and editing. PAI staff are hopeful that checking and editing will be easier with the new software. They'll be disappointed.

(5) The person who manages data is on leave / was seconded to another project / resigned and hasn't been replaced yet / etc.

This is another way of saying that no one inside the PAI has the time to do data checking and cleaning. When the data manager returns to work or gets replaced, data checking and cleaning will have the same low priority it had before. That's why it didn't get done.

(6) Top management says any data cleaning would have to be done by outside specialists, but there's not enough money in the current budget to hire such people.

Not only a Catch-22, but a solid, long-term excuse, applicable in any financial year. It would cost less to train PAI staff to do the job in-house.

(7) The PAI would prefer to use a specialist data tool to clean data, like OpenRefine, but hasn't yet got up to speed on its use.

The PAI believes in magic. OpenRefine will magically clean the data without any thought required on the part of PAI staff. The magic will have to be applied repeatedly, because the sources of the duplications, gaps and errors haven't been found and squashed.

(8) The PAI staff best qualified to check and clean the data aren't allowed to do so.

IT policy strictly forbids anyone but IT staff from tinkering with the PAI database, whose integrity is sacrosanct. A very specific request from biodiversity staff may be ticketed by IT staff for action, but global checking and editing is out of the question. IT staff are not expected to understand biodiversity studies, and biodiversity staff are not expected to understand databases.

This explanation is interesting because it implies a workaround. If a biodiversity staffer can get a dump from the database as a simple text file, she can do global checking and editing of the data using the command line or a spreadsheet. The cleaned data can then be passed to IT staff for incorporation into the database as replacement data items. The day that happens, pigs will be seen flying outside the PAI windows.

(9) The PAI datasets have grown so big that global data checking and editing is no longer possible.

Harder, yes; impossible, no. And the datasets didn't suddenly appear, they grew by accretion. Why wasn't data checking and editing done as data was added?

(10) All datasets are messy and data users should do their own data cleaning.

The PAI shrugs its shoulders and says 'That's just the way it is, live with it. Our data are no messier than anyone else's'.

I've left this explanation for last because it begs the question. Yes, users can do their own data cleaning — because it's not that hard and there are many ways to do it. So why isn't it done by highly qualified, well-paid PAI data managers?

Tuesday, December 01, 2015

Guest post: 10 years of global biodiversity databases: are we there yet?

YtNkVT2UThis guest post by Tony Rees explores some of the themes from his recent talk 10 years of Global Biodiversity Databases: Are We There Yet?.

A couple of months ago I received an invitation to address the upcoming 2015 meeting of the Malacological Society of Australasia (Australian and New Zealand mollusc persons for the uninitiated) on some topic associated with biodiversity databases, and I decided that a decadal review might be an interesting exercise, both for my potential audience (perhaps) and for my own interest (definitely). Well, the talk is delivered and the slides are available on the web for viewing if interested, and Rod has kindly invited me to present some of its findings here, and possibly stimulate some ongoing discussion since a lot of my interests overlap his own quite closely. I was also somewhat influenced in my choice of title by a previous presentation of Rod's from some 5 years back, "Why aren't we there yet?" which provides a slightly pessimistic counterpoint to my own perhaps more optimistic summary.

I decided to construct the talk around 5 areas: compilations of taxonomic names and associated valid/accepted taxa; links to the literature (original citations, descriptions, more); machine-addressable lists of taxon traits; compilations of georeferenced species data points such as OBIS and GBIF; and synoptic efforts in the environmental niche modelling area (all or many species so as to be able to produce global biodiversity as well as single-species maps). Without recapping the entire content of my talk (which you can find on SlideShare), I thought I would share with readers of this blog some of the more interesting conclusions, many of which are not readily available elsewhere, at least not with effort to chase down and/or make educated guesses.

In the area of taxonomic names, for animals (sensu lato) ION has gone up from 1.8m to 5.2m names (2.8m to 3.5m indexed documents) from all ranks (synonyms not distinguished) over the cited period 2005-2015, while Catalogue of Life has gone up from 0.5m species names + ?? synonyms to 1.6m species names + 1.3m synonyms over the same period; for fossils, BioNames database is making some progress in linking ION names to external resources on the web but, at less than 100k such links, is still relatively small scale and without more than a single-operator level of resourcing. A couple of other "open access" biological literature indexing activities are still at a modest level (e.g. 250k-350k citations, as against an estimated biological literature of perhaps 20m items) at present, and showing few signs of current active development (unless I have missed them of course).

Comprehensive databases of taxon traits (in machine addressable form) appear to have started with the author’s own "IRMNG" genus- and species- level compendium which was initially tailored to OBIS needs for simply differentiating organisms into extant vs. fossil, marine vs. nonmarine. More comprehensive indexes exist for specific groups and recently, Encyclopedia of Life has established "TraitBank" which is making some headway although some of the "traits" such as geographic distribution (a bounding box from either GBIF or OBIS) and "number of GenBank sequences" stretch the concept of trait a little (just my two cents' worth, of course), and the newly created linkage to Google searches is to be applauded.

With regard to aggregating georeferenced species data (specimens and observations), both OBIS (marine taxa only) and GBIF (all taxa) have made quite a lot of progress over the past ten years, OBIS increasing its data holdings ninefold from 5.6m to 44.9m (from 38 to 1,900+ data providers) and GBIF more than tenfold from 45m to 577m records over the same period, from 300+ to over 15k providers. While these figures look healthy there are still many data gaps in holdings e.g. by location sampled, year/season, ocean depth, distance to land etc. and it is probably a fair question to ask what is the real "end point" for such projects, i.e. somewhere between "a record for every species" and "a record for every individual of every species", perhaps...

Global / synoptic niche modelling projects known to the author basically comprise Lifemapper for terrestrial species and AquaMaps for marine taxa (plus some freshwater). Lifemapper claims "data for over 100,000 species" but it is unclear whether this corresponds to the number of completed range maps available at this time, while AquaMaps has maps for over 22,000 species (fishes, marine mammals and invertebrates, with an emphasis on fishes) each of which has a point data map, a native range map clipped to where the species is believed to occur, an "all suitable habitat map" (the same unclipped) and a "year 2100 map" showing projected range changes under one global warming scenario. Mapping parameters can also be adjusted by the user using an interactive "create your own map" function, and stacking all completed maps together produces plots of computed ocean biodiversity plus the ability to undertake web-based "what [probably] lives here" queries for either all species or for particular groups. Between these two projects (which admittedly use different modelling methodologies but both should produce useful results as a first pass) the state of synoptic taxon modelling actually appears quite good, especially since there are ongoing workshops e.g. the recent AMNH/GBIF workshop Biodiversity Informatics and Modelling Species Distributions at which further progress and stumbling blocks can be discussed.

So, some questions arising:

  • Who might produce the best "single source" compendium of expert-reviewed species lists, for all taxa, extant and fossil, and how might this happen (my guess: a consortium of Catalogue of Life + PaleoBioDB at some future point)
  • Will this contain links to the literature, at least citations but preferably as online digital documents where available? (CoL presently no, PaleoBioDB has human-readable citations only at present)
  • Will EOL increasingly claim the "TraitBank" space, and do a good enough job of it? (also bearing in mind that EOL is still an aggregator, not an original content creator, i.e. somebody still has to create it elsewhere)
  • Will OBIS and/or GBIF ever be "complete", and how will we know when we’ve got there (or, how complete is enough for what users might require)?
  • Same for niche modelling/predicted species maps: will all taxa eventually be covered, and will the results be (generally) reliable and useable (and at what scale); or, what more needs to be done to increase map quality and reliability.

Opinions, other insights welcome!

Tuesday, December 09, 2014

Guest post: Top 10 species names and what they mean

The following is a guest post by Bob Mesibov. Bob

The i4Life project has very kindly liberated Catalogue of Life (CoL) data from its database, and you can now download the latest CoL as a set of plain text, tab-separated tables here.

One of the first things I did with my download was check the 'taxa.txt' table for species name popularity*. Here they are, the top 10 species names for animals and plants, with their frequencies in the CoL list and their usual meanings:

Animals


2732 gracilis = slender
2373 elegans = elegant
2231 bicolor = two-coloured
2066 similis = similar
1995 affinis = near
1937 australis = southern
1740 minor = lesser
1718 orientalis = eastern
1708 simplex = simple
1350 unicolor = one-coloured

Plants


1871 gracilis = slender
1545 angustifolia = narrow-leaved
1475 pubescens = hairy
1336 parviflora = few-flowered
1330 elegans = elegant
1324 grandiflora = large-flowered
1277 latifolia = broad-leaved
1155 montana = (of a) mountain
1124 longifolia = long-leaved
1102 acuminata = pointed

Take the numbers cum grano salis. The first thing I did with the CoL tables was check for duplicates, and they're there, unfortunately. It's interesting, though, that gracilis tops the taxonomists' poll for both the animal and plant kingdoms.

*With the GNU/Linux commands


awk -F"\t" '($11 == "Animalia") && ($8 == "species") {print $20}' taxa.txt | sort | uniq -c | sort -nr | head
awk -F"\t" '($11 == "Plantae") && ($8 == "species") {print $20}' taxa.txt | sort | uniq -c | sort -nr | head

Tuesday, August 19, 2014

Guest post: Response to the discussion on Red List assessments of East African chameleons

AHjardingThis is guest post by Angelique Hjarding in response to discussion on this blog about the paper below.
Hjarding, A., Tolley, K. A., & Burgess, N. D. (2014, July 10). Red List assessments of East African chameleons: a case study of why we need experts. Oryx. Cambridge University Press (CUP). doi:10.1017/s0030605313001427
Thank you for highlighting our recent publication and for the very interesting comments. We wanted to take the opportunity to address some of the issues brought up in both your review and from reader comments.

One of the most important issues that has been raised is the sharing of cleaned and vetted datasets. It has been suggested that the datasets used in our study be uploaded to a repository that can be cited and shared. This is possible for data that was downloaded from GBIF as they have already done the legwork to obtain data sharing agreements with the contributing organizations. So as long as credit is properly given to the source of the data, publicly sharing data accessed through GBIF should be acceptable. At the time the manuscript was submitted for publication, we were unaware of sites such as http://figshare.com where the data could be stored and shared with no additional cost to the contributor. The dataset used in the study that used GBIF data has now been made available in this way.
Angelique Hjarding. (2014). Endemic Chameleons of Kenya and Tanzania. Figshare. doi:10.6084/m9.figshare.1141858


It starts to get tricky with doing the same for the expert vetted data. This dataset consists primarily of data gather by the expert from museum records and published literature. So in this case it is not a question of why the expert doesn’t share. The question is why the museum data and any additional literature records are not on GBIF already. As has been pointed out in our analysis (and confirmed by Rod) most of these museums do not currently have data sharing agreements with GBIF. Therefore, the expert who compiled the data does not have the permission of the museums to share their data second hand. Bottom line, all of the data used in this study that was not accessed through GBIF is currently available from the sources directly. That is, for anyone who wants to take the time contact the museums for permission to use their data for research and to compile it. We also do not believe there is blame on museums that have not yet shared their data with forums such as GBIF. Mobilisation of data is an enormous task, and near impossible if funds and staff are not available. With regards to the particular comment regarding the lack of data sharing by NHML and other museums, we need to recognise what the task at hand would mean, and rather address ways such a monumental, and valuable, collection could be mobilised. A further issue should be raised around literature records that are not necessarily encapsulated in museum collections, but are buried in old and obscure manuscripts. To our knowledge, there is no way to mobilise those records either, because they are not attached to a specimen. Further, because there are no specimens, extreme care must be taken if such records were to be mobilised in order to ensure quality control. Again, assistance of expert knowledge would be highly beneficial, yet these things take time and require funds.

Another issue that was raised is why didn’t we go directly to GBIF to fix the records? The point of our research was not to clean and update GBIF/museum data but to evaluate the effect of expert vetting and museum data mobilization in an applied conservation setting. As it has been pointed out, the lead author was working at GBIF during the course of the research. An effort was made to provide a checklist of the updated taxonomy to GBIF at the time, but there was no GBIF mechanism for providing updates. This appears to still be the case. In addition, two GBIF staff provided comments on the paper and were acknowledged for their input. We are happy to provide an updated taxonomy to help improve the data quality, should some submission tool for updates be made available.

Finally we would like to address the question, why use GBIF data if we know it needs some work before it can be used? We believe this is a very important debate for at least two reasons. First, when data is made public, we believe there are many researchers who work under the assumption that the data is ready for use with minimal further work. We believe they assume that the taxonomy is up to date; that the records are in the right place; and that the records provided relate to the name that is attached to those records. Many of the papers that have used GBIF data have undertaken broad scale macroecological analyses where, perhaps, the errors we have shown matter little. But some of these synthetic studies have also proposed that their results can be used for decision making by companies, which starts to raise concerns especially if the company wants to know the exact species that its activities could impact. As we have shown, for chameleons at least, such advice would be hard to provide using the raw GBIF data.

Second, we are aware that there is another group of researchers using GBIF data who "know that to use GBIF's data you need to do a certain amount of previous work and run some tests, and if the data does not pass the tests, you don't use it." We are not sure of the tests that are run, and it would be useful to have these spelled out for broader debate and potentially the development of some agreed protocols for data cleaning for various uses.

Our underlying reason for writing the paper was not to enter into debate of which data are best between GBIF and an expert compiled dataset. We are extremely pleased that GBIF data exist, and are freely available for the use of all. This certainly has to be part of the future of 'better data for better decisions', but we are concerned that we should not just accept that the data is the best we can get, but should instead look for ways to improve it, for all kinds of purposes. As such, we would like to suggest that the discussion focuses some energy on ways to address the shortcomings of the present system, but also that the community who would benefit from the data address ways to assist the dataholders to mobilise their information in terms of accessing the resources required to digitise and make data available, and maintain updated taxonomy for their holdings. In an era of declining funding for Museum-based taxonomy in many parts of the world this is certainly a challenge that needs to be addressed.

We welcome further discussion as this is a very important topic, not only for conservation but also in terms of improved access to biodiversity knowledge, which is critical for many reasons.

Angelique Hjarding http://orcid.org/0000-0002-9279-4893
Krystal Tolley
Neil Burgess

Thursday, December 12, 2013

Guest post: response to "Putting GenBank Data on the Map"

DES Tahiti 09 biggerThe following is a guest blog post by David Schindel and colleagues and is a response to the paper by Antonio Marques et al. in Sciencedoi:10.1126/science.341.6152.1341-a.

Marques, Maronna and Collins (1) rightly call on the biodiversity research community to include latitude/longitude data in database and published records of natural history specimens. However, they have overlooked an important signal that the community is moving in the right direction. The Consortium for the Barcode of Life (CBOL) developed a data standard for DNA barcoding (2) that was approved and implemented in 2005 by the International Nucleotide Sequence Database Collaboration (INSDC; GenBank, ENA and DDBJ) and revised in 2009. . All data records that meet the requirements of the data standard include the reserved keyword 'BARCODE'. The required elements include: (a) information about the voucher specimen from which the DNA barcode sequence was derived (e.g., species name, unique identifier in a specimen repository, country/ocean of origin); (b) a sequence from an approved gene region with minimum length and quality; and (c) primer sequences and the forward and reverse trace files. Participants in the workshops that developed the data standard decided to include latitude and longitude as strongly recommended elements but not as strict requirements for two reasons. First, many voucher specimens from which BARCODE records are generated may have been collected before GPS devices were available. Second, barcoding projects such as the Barcode of Wildlife Project (4) are concentrating on rare and endangered species. Publishing the GPS coordinates of collecting localities would facilitate illegal collecting and trafficking that could contribute to biodiversity loss.

The BARCODE data standard is promoting precisely the trend toward georeferencing called for by Marques, Marrona and Collins. Table 1 shows that there are currently 346,994 BARCODE records in INSDC (3). Of these BARCODE records, 83% include latitude/longitude data. Despite not being a required element in the data standard, this level of georeferencing is much higher than for all cytochrome c oxidase I gene (COI), the BARCODE region, 16S rRNA, and cytochrome b (cytb), another mitochondrial region that was used used for species identification prior to the growth of barcoding. Data are also presented on the numbers and percentages of data records that include information on the voucher specimen from which the nucleotide sequence was obtained. In an increasing number of cases, these voucher specimen identifiers in INSDC are hyperlinked to the online specimen data records in museums, herbaria and other biorepositories. Table 2 provides these same data for the time interval used in the Marques et al. letter (1). These tables indicate the clear effect that the BARCODE data standard is having on the community’s willingness to provide more complete data documentation.

Table 1. Summary of metadata for GPS coordinates and voucher specimens associated with all data records.
Categories of data recordsTotal number of GenBank recordsWith Latitude/LongitudeWith Voucher or Culture Collection Specimen IDs
BARCODE347,349286,975 (83%)347,077 (~100%)
All COI751,955365,949 (49%)531,428 (71%)
All 16S4,876,284461,030 (9%)138,921 (3%)
All cytb239,7967,776 (3%)84,784 (35%)

Table 2.
Summary of metadata for GPS coordinates and voucher specimens associated with data records submitted between 1 July 2011 and 15 June 2013.
Total number of GenBank recordsWith Latitude/LongitudeWith Voucher or Culture Collection Specimen IDs
BARCODE160,615132,192 (82%)160,615 (100%)
All COI302,507166,967 (55%)231,462 (77%)
All 16S1,535,364232,567 (15%)49,150 (3%)
All cytb74,6312,920 (4%)24,386 (33%)


The DNA barcoding community's data standard is demonstrating two positive trends: better documentation of specimens in natural history collections, and new connectivity between databases of species occurrences and DNA sequences. We believe that these trends will become standard practices in the coming years as more researchers, funders, publishers and reviewers acknowledge the value of, and begin to enforce compliance with the BARCODE data standard and related minimum information standards for marker genes (5).

DAVID E. SCHINDEL1, MICHAEL TRIZNA1, SCOTT E. MILLER1, ROBERT HANNER2, PAUL D. N. HEBERT2, SCOTT FEDERHEN3, ILENE MIZRACHI3
  1. National Museum of Natural History, Smithsonian Institution Smithsonian Institution, Washington, DC 20013–7012, USA.
  2. University of Guelph, Ontario, Canada
  3. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA

References

  1. Marques, A. C., Maronna, M. M., & Collins, A. G. (2013). Putting GenBank Data on the Map. Science, 341(6152), 1341–1341. doi:10.1126/science.341.6152.1341-a
  2. Consortium for the Barcode of Life, http://www.barcodeoflife.org/sites/default/files/DWG_data_standards-Final.pdf (2009)
  3. Data in Tables 1 and 2 were drawn from GenBank (http://www.ncbi.nlm.nih.gov/genbank/) [data as of 1 October 2013]
  4. Barcode of Wildlife Project, http://www.barcodeofwildlife.org (2013)
  5. Yilmaz, P., Kottmann, R., Field, D., Knight, R., Cole, J. R., Amaral-Zettler, L., Gilbert, J. A., et al. (2011). Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nature Biotechnology, 29(5), 415–420. doi:10.1038/nbt.1823

Thursday, July 12, 2012

Dimly lit taxa - guest post by Bob Mesibov

The following is a first for iPhylo, a guest post by Bob Mesibov. Bob

Rod Page introduced 'dark taxa' here on iPhylo in April 2011. He wrote:

The bulk of newly added taxa in GenBank are what we might term "dark taxa", that is, taxa that aren't identified to a known species. This doesn't necessarily mean that they are species new to science, we may already have encountered these species before, they may be sitting in museum collections, and have descriptions already published. We simply don't know. As the output from DNA barcoding grows, the number of dark taxa will only increase, and macroscopic biology starts to look a lot like microbiology.

Rod suggested that 'quite a lot' of biology can be done without taxonomic names. For the dark taxa in GenBank, that might well mean doing biology without organisms – a surprising thought if you're a whole-organism biologist.

Non-taxonomists may be surprised to learn that a lot of taxonomy is also done, in fact, without taxonomic names. Not only is there a 'dark taxa' gap between putative species identified genetically and Linnaean species described by specialists, there's a 'dimly lit taxa' gap between the diversity taxonomists have already discovered, and the diversity they've named.

Dimly lit taxa range from genera and species given code names by a specialist or a group of collaborators, and listed by those codes in publications and databases, to potential type specimens once seen and long remembered by a specialist who plans to work them up in future, time and workload permitting.

In that phrase 'time and workload permitting' is a large part of the explanation for dimly lit taxa. Over the past month I created 71 species of this kind myself. Each has been code-named, diagnostically imaged, databased and placed in code-labelled bottles on museum shelves. The relevant museums have been given digital copies of the images and data.

The 71 are 'species-in-waiting'. They aren't formally named and described, but specialists like myself can refer to the images and data for identifying new specimens, building morphological and biogeographical hypotheses, and widening awareness of diversity in the group to which the 71 belong.

'Time and workload permitting'. Many of the 71 are poor-quality or fragmented museum specimens from which important morphological data, let alone sequences, cannot be obtained. Fresh specimens are needed, and fieldwork is neither quick nor easy. In my special corner of zoology, as in most such corners in zoology and botany, the widespread and abundant species are all, or nearly all, named. The unnamed rump consists of a huge diversity of geographically restricted and uncommon species. There are more than 71 in that group of mine; those are just the rare species I know about, so far.

'Time and workload permitting'. A non-taxonomist might ask, 'Why don't you just name and describe the 71 briefly, so that the names are at least available, and the gap between what's known and what's named is narrowed?' The answer is simple: inadequate descriptions are the bane of taxonomy. There are hundreds of species in my special group that were named and inadequately described long ago, and which wind up on checklists of names as 'nomen dubium' and 'incertae sedis'. Clearing up the mysteries means locating the types (which hopefully still exist) and studying them. That slow and tedious study would better have been done by the first describer.

Cybertaxonomic tools can help bring dimly lit taxa into full light, but not much. The rate-limiting steps in lighting up taxa are in the minds and lives of human taxonomists coping with the huge and bewilderingly complex diversity of life. It's not the tools used after the observing and thinking is done, it's the observing and thinking.

In their article 'Ramping up biodiversity discovery via online quantum contributions' (http://dx.doi.org/10.1016/j.tree.2011.10.010), Maddison et al. argue that the pace of naming and description can be increased if information about what I've called dimly lit taxa is publicly posted, piece by piece, 'publish as you go', on the Internet. In my case, I would upload images and data for my 71 'species-in-waiting' to suitable sites and make them freely available.

Excited by these discoveries, amateurs and professionals would rush to search for fresh specimens. Specialists would drop whatever else they were doing, borrow the existing specimens of the 71 from their repositories and do careful inventories of the morphological features I haven't documented. Aroused from their humdrum phylogenetic analyses of other organisms, molecular phylogeny labs would apply for extra funding to work on my 71 dimly lit taxa. In no time at all, a proud team of amateurs and specialists would be publishing the results of their collaboration, with 71 names and descriptions.

Shortly afterwards, flocks of pigs would slowly circle the 71 type localities, flapping their wings in unison.

Memo to Maddison et al. and other would-be reformers: the rate of taxonomic discovery and documentation is very largely constrained by the supply of taxonomists. You want more names, find more namers.