Showing posts with label data mining. Show all posts
Showing posts with label data mining. Show all posts

Sunday, November 23, 2014

Automatically extracting possible taxonomic synonyms from the literature

Quick notes on an experimental feature I've added to BioNames. It attempts to identify possible taxonomic synonyms by extracting pairs of names with the same species name that appear together on the same page of text. The text could be full text for an open access article, OCR text from BHL, or the title and abstract for an article. For example, the following paper creates a new combination, Hadwenius tursionis, for a parasite of the bottlenose dolphin. This name is a synonym of Synthesium tursionis.

Fernández, M., Balbuena, J. A., & Raga, J. A. (1994, July). Hadwenius tursionis (Marchi, 1873) n. comb. (Digenea, Campulidae) from the bottlenose dolphin Tursiops truncatus (Montagu, 1821) in the western Mediterranean. Syst Parasitol. Springer Science + Business Media. doi:10.1007/bf00009519

The taxonomic position of Synthesium tursionis (Marchi, 1873) (Digenea, Campulidae) is revised, based on material from 147 worms from four bottlenose dolphins Tursiops truncatus stranded off the Comunidad Valenciana (Spanish western Mediterranean). The species is transferred to Hadwenius, as H. tursionis n. comb., and characterised by a high length/width ratio of the body, spinose cirrus and unarmed metraterm. Synthesium, a monotypic genus, becomes a synonym of Hadwenius. The intraspecific variation of some morphological traits is briefly discussed.

If we extract taxonomic names from the title and abstract we have the pair (Synthesium tursionis, Hadwenius tursionis). If we do this across all the text currently in BioNames then we discover other pairs of names that include Synthesium tursionis, joining these together we can create a graph of co-occurrence of names that are synonyms (see Synthesium tursionis).

Synthesium tursionisHadwenius tursionisDicrocoelium tursionisDistomum tursionisOrthosplanchnus tursionisSynthesium (Orthosplanchnus) tursionis
These graphs are computed automatically, and there is inevitably scope for error. Taxa that are not synonyms may have the same specific name (e.g., parasites and hosts may have the same specific name), and some of the names extracted from the text may be erroneous. At the same time, anecdotally it is a useful way to discover links between names. Even better, this approach means that we have the associated evidence for each pair of names. The interface in BioNames lists the references that contain the pairs of names, so you can evaluate the evidence for synonymy. It would be useful to try and evaluate the automatically detected synonyms by comparisons with existing lists of synonyms (e.g., from GBIF).

Thursday, January 26, 2012

Extracting museum specimen codes from text

Quick note about a tool I've cobbled together as part of the phyloinformatics course, which addresses a long standing need I and others have to extract specimen codes from text. I've had this code kicking around for a while (as part of various never-finished data mining projects), but never got around to releasing it, until now. It is very crude (basically a bunch of regular expressions), and there's a lot which could be done to improve it (not least starting with a complete list of museum specimen codes, rather than just those I've come across in, say Zootaxa and BioStor).

You can try the tool at http://iphylo.org/~rpage/phyloinformatics/services/specimenparser.php. Paste in some text and it will try and extract museum codes. The tool tries to handle ranges of specimens (e.g., MHNSM 1808-09), and some of the more common specimen numbering schemes.

Comments welcome. If you are looking for a source of text, papers in Zookeys or Zootaxa are a good place to start (especially papers on vertebrates where specimen numbers are often used). BioStor is also a good source: if you're looking at a paper in BioStor click on the "Text" link to get the OCR text for an article and paste that into the form at . For example, the text for Systematics of the Bufo coccifer complex (Anura: Bufonidae) of Mesoamerica is available at http://biostor.org/reference/97426.text.

The extraction tool can also be called as a web service using POST to get back the results in JSON.

Friday, March 25, 2011

Visualising the symbiome: hosts, parasites, and the Tree of Life

Back in 2006 in a short post entitled "Building the encyclopedia of life" I wrote that GenBank is a potentially rich source of information on host-parasite relationships. Often sequences of parasites will include information on the name of the host (the example I used was sequence AF131710 from the platyhelminth Ligophorus mugilinus, which records the host as the Flathead mullet Mugil cephalus).

I've always wanted to explore this idea a bit more, and have finally made a start, in part inspired by the recent VIZBI 2011 meeting. I've grabbed a large chunk of GenBank, mined the sequences for host records, and created some simple visualisations of what I'm terming (with tongue firmly in cheek) the "symbiome". Jonathan Eisen will not be happy, but I need a word that describes the complete set of hosts, mutualists, symbionts with which an organism is associated, and "symbiome" seems appropriate.

Human symbiome
To illustrate the idea, below is the human "symbiome". This diagram shows all the taxa in GenBank arranged in a circle, with lines connecting those organisms that have DNA sequences where humans are recorded as their host.

Human

At a glance, we have a lot of bacteria (the gray bar with E. coli) and fungi (blue bar with Yeast), and a few nematodes and arthropods.

Fig tree symbiome
Next up are organisms collected from fig trees (genus Ficus).

Ficus
Fig trees have wasp pollinators (the dark line landing near the honey bee Apis), as well as nematodes (dark line landing near Caenorhabditis elegans). There are also some associations with fungi and other arthropods.

Which taxa host insects?
Next up is a plot of all associations involving insects and a host.

Insect
The diagram is dominated by insect-flowering plant interactions, followed by insect-vertebrate associations (most likely bird and mammal lice).

Which taxa are hosted by insects?
We can reverse the question and ask what organisms are hosted by insects:

Insectashost
Lots of associations between insects and fungi, as well as bacteria, and a few other organisms, such as nematodes, and Plasmodium (the organism which causes malaria).

Frog symbiome
Lastly, below is the symbiome of frogs. "Worms" feature prominently, as well as the fungus that causes chytridiomycosis.

FrogHow the visualisation was made

The symbiome visualisations were made as follows. Firstly DNA sequences were downloaded from EMBL and run through a script that extracted as much metadata as possible, including the contents of the host field (where present). I then took the NCBI taxonomy and generated an ordered list of taxa by walking the tree in postorder, which determines where on the circumference of the circle the taxon lies. Pairs of taxa in an association are connected by a quadratic Bezier curve. The illustration was created using SVG.


Next steps
There are several ways this visualisation could be improved. It's based only only a subset of data (I haven't run all of the sequence databases though the parser yet), and the matching of host taxa is based on exact string matching. All manner of weird and wonderful things get entered in the host field, so we'll need some more sophisticated parsing (see "LINNAEUS: A species name identification system for biomedical literature" doi:10.1186/1471-2105-11-85 for a more general discussion of this issue).

The visualisation is fairly crude at this stage. Circle plots like this are fairly simple to create, and pop up in all sorts of situations (e.g., RNA secondary structure methods, which I did some work on years ago). Of course, Circos would be an obvious tool to use to create the visualisations, but the overhead of installing it and learning how to use it meant I took a shortcut and wrote some SVG from scratch.

Although I've focussed on GenBank as a source of data, this visualisation could also be applied to other data. I briefly touched on this in Tag trees: displaying the taxonomy of names in BHL where a page in the Biodiversity Heritage Library contains the names of a flea and it's mammalian hosts. I think these circle plots would be a great way to highlight possible ecological associations mentioned in a text.