Showing posts with label Graphviz. Show all posts
Showing posts with label Graphviz. Show all posts

Friday, May 28, 2021

Maximum entropy summary trees to display higher classifications

How to cite: Page, R. (2021). Maximum entropy summary trees to display higher classifications https://doi.org/10.59350/af01t-6sw74

A challenge in working with large taxonomic classifications is how you display them to the user, especially if the user probably doesn't want all the gory details. For example, the Field Guide app to Victorian Fauna has a nice menu of major animal groups:

This includes both taxonomic and ecological categories (e.g., terrestrial, freshwater, etc.) and greatly simplifies the animal tree of life, but it is a user-friendly way to start browsing a larger database of facts about animals. It would be nice if we could automate constructing such lists, especially for animal groups where the choices of what to display might not seem obvious (everyone wants to see birds, but what insects groups would you prioritise?).

One way to help automate these sort of lists is to use summary trees (see also Karloff, H., & Shirley, K. E. (2013). Maximum Entropy Summary Trees. Computer Graphics Forum, 32(3pt1), 71–80. doi:10.1111/cgf.12094). A summary tree takes a large tree and produces a small summary for k nodes, where k is a number that you supply. In other words, if you want your summary to have 10 nodes then k=10. The diagram below summarises an organisation chart for 43,134 employees.

Summary trees show only a subset of the nodes in the complete tree. All the nodes with a given parent that aren't displayed get aggregated into a newly created "others" node that is attached to that parent. Hence the summary tree alerts the user that there are nodes which list in the full tree but which aren't shown.

Code for maximum entropy summary trees is available in C and R from https://github.com/kshirley/summarytrees, so I've been playing with it a little (I don't normally use R but there was little choice here). As an example I created a simple tree for animals, based on the Catalogue of Life. I took a few phyla and classes and built a tree as a CSV file (see the gist). The file lists each node (uniquely numbered), its parent node (the parent of the root of the tree is "0"), a label, and a weight. For an internal node the weight is always 0, for a leaf the weight can be assigned in various ways. By default you could assign each leaf a weight of 1, but if the "leaf" node represents more than one thing (for example, the class Mammalia) then you can give it the number of species in that class (e.g., 5939). You could also assign weights based on some other measure, such as "popularity". In the gist I got bored and only added species counts for a few taxa, everything else was set to 1.

I then loaded the tree into R and found a summary tree for k=30 (the script is in the gist):

This doesn't look too bad (note as I said above, I didn't fill in all the actual species counts because reasons). If I wanted to convert this into a menu such as the one the Victoria Fauna app uses I would simply list the leaf nodes in order, skipping over those labelled "n others", which would give me:

  • Mammalia
  • Amphibia
  • Reptilia
  • Aves
  • Actinopterygii
  • Hemiptera
  • Hymenoptera
  • Lepidoptera
  • Diptera
  • Coleoptera
  • Arachnida
  • Acanthocephala
  • Nemertea
  • Rotifera
  • Porifera
  • Platyhelminthes
  • Nematoda
  • Mollusca

These 18 taxa are not a bad starting point for a menu, especially if we added pictures from PhyloPic to liven it up. There are probably a couple of animal groups that could be added to make it a bit more inclusive.

Because the technique is automated and fast, it would be straightforward to create submenus for major taxa, with the added advantage that you don't beed to make decisions based whether you know anything about that taxonomic group, it can be driven entirely by species counts (for example). We could also use other measures for weights, such as number of Google search hits, size of pages on Wikipedia, etc. So far I've barely scratched the surface of what could be done with this tool.

P.S. The R code is:

library(devtools)
install_github("kshirley/summarytrees", build_vignettes = TRUE)

library(summarytrees)

data = read.table('/Users/rpage/Development/summarytrees/animals.csv', header=TRUE,sep=",")

g <- greedy(node = data[, "node"], 
            parent = data[, "parent"], 
            weight = data[, "weight"], 
            label = data[, "label"], 
            K = 30)
            write.csv(g$summary.trees[[30]], '/Users/rpage/Development/summarytrees/summary.csv') 

The gist has the data file, and a simple PHP program to convert the output into a dot file to be viewed with GraphViz.

Wednesday, February 22, 2012

Clustering strings

Revisiting an old idea (Clustering taxonomic names) I've added code to cluster strings into sets of similar strings to the phyloinformatics course site.

This service (available at http://iphylo.org/~rpage/phyloinformatics/services/clusterstrings.php) takes a list of strings, one per line, and returns a list of clusters. For example, given the names


Ferrusac 1821
Bonavita 1965
Ferussa 1821
Fer.
Lamarck 1812
Ferussac 1821


the service finds three clusters, displayed here using Google images:



(Note to self, investigate canviz as an alternative for displaying graphviz graphs.)

If you are curious, these strings are taxonomic authorities associated with the name Helicella, and based on this clustering there are three taxonomic names, one of which has three different variations of the author's name.

Tuesday, September 20, 2011

Orwellian metadata: making journals disappear

UnknownI've been spending a lot of time recently mapping bibliographic citations for taxonomic names to digital identifiers (such as DOIs). This is tedious work at the best of times (despite lots of automation), but it is not helped but the somewhat Orwellian practices of some publishers. Occasionally when an established journal gets renamed the publisher retrospectively applies that name to the previous journal. For example, in 2000 the journal Entomologica Scandinavica (ISSN 0013-8711) became Insect Systematics & Evolution (ISSN 1399-560X):


(diagram based on WorldCat xISSN history tool, rendered using Google Charts.)

Content for both Entomologica Scandinavica and Insect Systematics & Evolution is available from Ingenta's web site, but every article is listed as being in Insect Systematics & Evolution, and this is reflected in the metadata CrossRef has for each DOI.

For example, the paper
Andersen, N.M. & P.-p. Chen, 1993. A taxonomic revision of pondskater genus Gerris Fabricius in China, with two new species (Hemiptera: Gerridae). – Entomologica Scandinavica 24: 147-166

has the DOI doi:10.1163/187631293X00262 which resolves to a page saying this article was published in Insect Systematics & Evolution. The XML for the DOI says the same thing:



<issn type="print">1399560X</issn>
<issn type="electronic">1876312X</issn>
<journal_title>Insect Systematics & Evolution</journal_title>


In one sense this is no big deal. If you know the DOI then that's all you need to use to refer to the article (and the sooner we abandon fussing with citation styles and just use DOIs the better).

But if you haven't yet found the DOI then this is problem, because if I search CrossRef using the original journal name (Entomologica Scandinavica) I get nothing. As far as CrossRef is concerned the DOI doesn't exist. If, however, I happen to know that Entomologica Scandinavica is now Insect Systematics & Evolution, I rewrite the query and I retrieve the DOI.

It's bad enough dealing with taxonomic names changes without having to deal with journal names changes as well! It would be great if publishers didn't indulge in wholesale renaming old journals, or if CrossRef had a mechanism (perhaps based on WorldCat's xISSN History Visualization Tool) to handle retrospectively renamed journals.

Tuesday, April 18, 2006

Render DOT files on the fly on Mac OS X

Webdot isn't available for Mac OS X, and as I use an iBook running Panther for all my development work (before moving to a Linux box to host the results) I wanted to have the same functionality on my iBook. This can be achieved by hacking a simplified version of webdot. This Perl script creates a virtual web browser to serve the image. I've simplified things somewhat, but it works.

The two things you need to set in the script dot.cgi are the path to your copy of the Graphviz program dot,and a directory where dot can write temporary files (I use /tmp).

You can get a copy of the script here.

To render an image of a graph on the fly you insert an img
tag with the src attribute comprising:


  1. the path to the CGI script, e.g. /cgi-bin/dot.cgi

  2. a '/' delimiter

  3. the URL of the graph file, e.g. http://localhost/~rpage/dot/leda.7.46.gml.dot

  4. the extension of the image format you want (e.g., png, svg, etc.) preceeded by a dot "."



As an example, here is the dot file http://localhost/~rpage/dot/leda.dot as a PNG image, using the HTML:

<img src="/https/iphylo.blogspot.com/cgi-bin/dot.cgi/http://localhost/~rpage/dot/leda.dot.png" />



The source file for this graph looks like this:

graph G {
node [width=.2,height=.2,fontsize=10];
edge [fontsize=10,len=2];
0 [label="0"];
1 [label="3"];
2 [label="4"];
3 [label="5"];
4 [label="6"];
5 [label="7"];
0 -- 1 [label="13"];
0 -- 2 [label="12"];
0 -- 5 [label="8"];
0 -- 4 [label="71"];
1 -- 5 [label="84"];
1 -- 4 [label="8"];
2 -- 5 [label="18"];
2 -- 4 [label="11"];
2 -- 3 [label="51"];
3 -- 4 [label="87"];
}