Category Archives: biodiversity informatics

The New Nominomania

Roger Hyam’s blog post Calling time on biological nomenclature and the comments it received, also on Taxacom, makes me wonder if not biodiversity informatics is the enemy rather than the servant of science. What some of my colleagues argue for are empty name lists, including also artificial constructs like barcode species. Then erecting the haplotype as the focal point of taxonomy is apparently to be expected lying in ambush.

For taxonomists, names are abstractions of scientific knowledge, and cannot, consequently, be managed in a formalised top-down system. To call for science to be published in only certain journals, to advocate that certain kinds of “species” should be the only ones permitted, are not friendly proposals to rationalise information flows, but denials of the process of free information gathering. It is plain denying that taxonomic papers are primarily contributions to science in the first place, and name machines only secondarily. Taxonomy must remain a scientific exercise, and cannot be a mechanical process.

The idol project brought forth is the International Code of Nomenclature of Bacteria, where there is a Committee to decide, a single place to register names, and — most importantly, forgotten by the supporters — less than 10 000 diagnosable units are included. Since bacteria are so different from other organisms, and the named units so few (at least that have been admitted by this Committee …) the ICNB is simply not possible to use as a model for the several million species of multicellular organisms, most of which have not been named yet.

Whereas I am a friend of registration of names, and advocating that scientific names as defined in the Botanical and Zoological Codes are as good markers as can be (human-friendly they are) of scientific processes of elucidating the characteristics, whereabouts, and history of pieces of biodiversity, I cannot be positive to registration replacing the scientific procedure of testing hypotheses of phylogenetic distinctness labelled with scientific names. No committee should certainly be involved here. And whereas barcodes can probably be an interesting tool for the food industry and similar, I don’t see much use for it in taxonomy where we have species concepts based on evolutionary theory, type specimens, and diagnoses that are compatible with scientific theory and hypotheses. In taxonomy, contrast to the barcode shop, we also have flexible systems to classify biological units other than “species”.

Whereas taxonomists must be more collaborative with biodiversity informatics in, e.g., voluntary registration in ZooBank, and show more effort to make their work and naming visible, it is the task of biodiversity informatics to find the methods to discover, assemble, and present the objects of biodiversity. We must not adapt science to fit the index.

The concerted effort of GBIF and Encyclopedia of Life to build a Global Names Architecture (GNA), providing a Global Names Index (GNI), seems to me to be a way out of the dilemma that biodiversity informatics is entangled in: information about biodiversity cannot be extracted because there are too many names (with misspellings, synonyms, homonyms, etc.) out there and the approximate (can never be exact) meaning of a name may vary from one mention to another. Certain related efforts, such as transparently tagging names with identifiers, as is being done in Zootaxa and ZooKeys, are bridging the gap between computerified and human-mediated names. Thus the technology is there, it is evolving, and taxonomy should be able to continue as a science.

The real difference between the mega-name-consumers and taxonomy is that mega-name-consumers wish to have all in one place, which is probably of zero interest to taxonomy. They are also not interested in metadata such as diagnosis, type specimens, etc., and they do not want taxon concepts to change, which they inevitably must do in science. In taxonomy, only small sets of taxa (and names) are handled at any given time, and of these, all have a definite function in the particular study, may be a revision, a field guide, a phylogeny, or a classification. In such contexts, the name domain is self-contained, and all named units are related to each other by the hypothesis or scope of the study. Everything else is of zero interest. For a study of cichlid fishes, it is of no interest whatsoever if New Zealand Lepidoptera exist. Enter mega-name-consumers, who will need both in the same list because those lists are not based on any scientific criterion and it is absolutely not known what the list is for. If consumers could define their precise needs from study to study, it might be easier to design the tools to extract the names and concepts actually needed. To maintain lists of millions of names, even in a database, for no specific purpose does not make much sense. Indeed, most checklists of smaller scale as well, especially when produced by non-specialists are equally meaningless anachronisms of apparently undefeatable listmania.

So, we must ask from biodiversity informatics:

  1. Proper specification of what their taxonomic units (text-names or LSIDs) are going to be used for. Map species occurrences, make phylogenetic hypotheses, sort out homonyms, …?
  2. Design systems that can effectively detect, maintain, and trace name usage and relevant metadata, compatible with taxonomic objectives and procedures.
  3. Provide voluntary registration systems, and other tools facilitating the exchange of names and metadata between taxonomists and consumers.

Whereas 2 and 3 may be underway, I am beginning to doubt that anyone can give a good answer to 1…

For those who cannot embrace taxonomy fully, I recommend stamp collecting. It has all the flavors of registration, codes, hybridisation, phylogeography, central committees, misidentifications, rare haplotypes, identical reissues, fakes, top-down standards, and stasis. It is a totally unscientific enterprise with no limits to organisational options suitable for old frustrated men obsessed with control. Ooops, does it sound like DNA barcoding …?

Image: Wikimedia Commons, public domain

New toy in town – GNI

Some days ago – well, maybe weeks then – I touched on the usefulness of ZooBank, Catalog of Fishes, and friends. The bigger of them all is, however, GNI, a pronounceable acronym, a component of the GNA (The Global Names Architecture), but unrelated to GNU (GNU’s Not Unix). The Global Names Index is a name aggregator for scientific names of organisms. It contains 12 million names. You will now know why it takes a special category of wizards to practice taxonomy. These gentle people are managing 12 million names, and of course they will love this new toy brought to them by GBIF and EoL. Them, because GNI seems not to have much appeal beyond the professional taxonomist and biodiversity informatician.

GNI is one necessity when trying to build large systems of biological information, because all is indexed against names of organisms. To be sure, specialized systems like FishBase realized this many, many years ago and have systems that are superior within their domain. In the long run, however, a common approach may be the only way to endorse.

GNI is ok to search already now. Try Astyanax kullanderi, a fish you have not heard of before. Does it exist? One chance in 12 million. Enter and be confirmed.

It is there, in uBio, with one NameBank record drawn from Catalogue of Life and ultimately FishBase. It has an LSID there, but this is not the ZooBank LSID. We do not want to be confused, so we make a back click to find two GBIF records, neither georeferenced. It is the holotype, catalogued in NRM 21000 and served by GBIF-Sweden, but also in the GBIF edition of FishBase, which happens to be served also by GBIF-Sweden although the entry says it is served by FishBase Philippines. And at NRM this catalog number refers four paratypes.

Amazing, no?

Of course this tool is better needed for machine use than for humans to click around in. Or as David Remsen, the architect behind this construction puts it:

GNI was developed because of the central importance of the names of organisms in the management of data about organisms. The primary users of this site are not people, but other machines, so please don’t complain because the site is boring.

As a tool for testing the existence of names, it is already worth being bored a bit. If the result is positive, that is reassuring. If negative, apply the precautionary principle and ask your favorite taxonomist.

YouTube has this video of David Remsen explaining how the GNA works.

This is not Astyanax kullanderi, but a species of Synbranchus from Brazil,
closely related to Monopterus albus from Asia. Photo A. Kullander, CC-BY-NC

From fishes to ZooBank

Fishes are among the most informatized organisms that I know. There may be a number of reasons for that, but reasons aside, the fact is that Daniel Pauly and Rainer Froese created FishBase independently of Bill Eschmeyer’s Catalog of Fishes, and ichthyologist Julian Humphries created the museum collection database with the collection management system MUSE back in the late 1980s, giving fish collections a head start in informatics. Since 1976 Joseph Nelson has published the Fishes of the World, now in its fourth edition, as an index to systematic ichthyology and with an eclectic classification.

Whereas FishBase has a given hit rate of 20 million per month and so is doing fine, the Catalog of Fishes is maybe less well known. It started first as a catalog of genera of fishes, but was expanded and eventually published as three huge volumes with scientific names of fishes, over 50 000, complete with type locality, current status, and literature reference. It is presently a web resource and updated frequently. For the layman it may look like just too boring, but for the scientists it is a goldmine saving enormously on the time of finding information about specific species and their names.

This kind of compilations is important, because biodiversity research is facing now an enormous problem with names. There may be 1.8 million named species, and many more million out there to be found, but only a million or so species have been secured in databases. And every year at least 16 000 new species are described.

GBIF have an initiative called the GNI (Global Names Index) to harvest all names, and a structure, the GNA (Global Names Architecture) to manage them. They will do this together with other acronyms such as PESI. In the meantime, the Catalogue of Life, a collaboration including the US ITIS and global consortium Species2000, have a checklist of the world’s species with just a little over 1 million in, and where FishBase is one of the best parts.

But we cannot have it like this, endlessly chasing names that people drop here and there in more or less obtainable publications. Zoological and Botanical nomenclature have to go modern and collaborate with information society. There have to be a registration system for names, and the habit of paper publication has to go away in favor of digital publication.

To those not familiar with nomenclature, the situation is the following: For a name to be available and thus accepted to use as a scientific name for a species, genus, or family, it has to be published on paper with a few more simple conditions such as a certain number of copies and a degree of obtainability. It is perfectly OK to publish 2 copies of a species description and give them away to 25 people who all except one throw their copy away within 24 hours. The single surviving copy is now the globally accepted token for the name of that species. Not surprising that many taxonomist spend most of their time searching for publications instead of doing real research.

Digital-only publishing is not permitted. Well, there is an exception for CD-ROMs with deposition in libraries, but it is a bit awkward and it may be difficult to find those CD-ROMs.

The International Commission on Zoological Nomenclature is the body that writes the rules for Zoological Nomenclature – of course with the needs and well-being of the taxonomic community taken good care of. The Commission is now seriously considering digital-only, what we call e-only publishing of zoological names; and seriously considering a registration system for old and new names.

Both proposals are controversial. Concerning e-only publising there is now a proposed amendment to the Code, and the Commission has invited comments and discussions. Some of the discussion is now published, and worth reading.

Formalisation of ZooBank as a registry for new names is maybe a bit further away, but unavoidable. In contrast to a Code amendment, it requires an infrastructure and running funds that are not immediately available. Nevertheless, Richard Pyle, ichthyologist at the Bernice P. Bishop Museum in Hawai’i, is working day and night to build up the structure for ZooBank. You can already get a glimpse of the future from the development site. There are already much more than 5 000 nomenclatural acts registered.

ZooBank will have a healthy starter boost from Catalog of Fishes, so from a fish perspective this is perhaps no big step forward. But notice, there is an ichthyologist programming ZooBank!

Yes, Ichthyology rules biodiversity informatics …

In the beginning …

This is a fast start blog to introduce myself (only a glimpse) and what possible kind of writings can be expected here.

As an ichthyologist, I will write mainly about fish. I manage two e-mail lists, my twitter, and blogs for two projects. Let’s see if there is more to say.

As a biodiversity informatician, I will try to connect fish with computers. I already post biodiversity informatics news on a Swedish language blog. Let’s see if there is more to say.

Naturally, I must first introduce you to those wonderful resources.

cichlid-l is the discussion list for professionals and others interested in cichlids. Cichlids are freshwater fishes found in Africa, South and Central America, Madagascar and parts of Asia. It is the second or third most speciose family of fishes (and vertebrates). This list is fairly old, started in January 1995.

eurofish-l is the discussion list for all other ichthyologists, but with an intended focus on Europe and particularly the activities of the European Ichthyological Society.

I will be back about the access to these lists, since they currently seem to have been locked up behind the firewall.

The FishBase Blog is the news blog in FishBase. FishBase contains information about all the world’s fishes, available for free on the web, e.g., on the Swedish FishBase server.

The Swedish Fishbase team also maintains its own news blog, in the Swedish language.

And finally, GBIF-Sweden serves news form the biodiversity informatics world in the form of a blog.

Ah, twitter, somewhat neglected: http://twitter.com/svenok