Finding data gaps in the GBIF backbone taxonomy
When publishers supply GBIF with a scientific name, this name is sometimes not found in the GBIF taxonomic backbone. In these cases, the occurrence record gets a data quality flag called taxon match higher rank. This means that GBIF was only able to match the name to a higher rank (genus, family, order …).
At GBIF, we would always like to match the name supplied by the publisher to the lowest rank possible, so that when a user comes to GBIF looking for a certain name, they will have access to the largest amount of occurrence data.
In the graph below I divide names (not occurrences) supplied to GBIF from publishers that have received the taxon match higher rank flag.
- other : means that I could not find a good reason for why this name did not match to the backbone. This could be a misspelling or the name could be missing from the GBIF backbone. These are names which might reflect data gaps.
- unmatchable name : is a catch-all group for poorly formatted or unmatchable names (see below).
- hybrid (hybrid formula) : means the name is referring to hybrid. We expect poor checklist coverage for hybrid names.
- below species : means a name at a taxonomic rank below the species level could not be matched at that level. Usually we expect less checklist coverage for subspecies and varieties.
- too many choices : GBIF has two or more names with different authorship (homonyms), but the publishers does not provide authorship and/or higher taxonomy, so the name cannot be matched unambiguously.
I have processed some names from select groups to see if there are gaps for GBIF to fill.
Here we see that GBIF is probably missing many names from Coleoptera (Beetles) and Lepidoptera (Butterflies/Moths). There are also many potential missing names within birds, but this might be due to the large number of occurrence records we get from this group (Passeriformes).
If we break down Beetles by family we see that Chrysomelidae (Leaf beetles) are responsible for a large portion of the missing names in Beetles.
Too many choices
Sometimes publishers do not provide enough information for GBIF to choose between names in the backbone. For example, if a publisher only supplied us with Glocianus punctiger we would not be able to determine between the two choices, and it would get moved to the higher rank (genus Glocianus).
Providing authorship would allow GBIF to correctly match these occurrences to backbone.
Unmatchable names
Publishers supply GBIF with a variety of what I call unmatchable names, which are names that are impossible to match to the GBIF backbone. Sometimes these names are ok names but still missing from the backbone, like missing hybrids or OTUs. Other names are simply bad names that we can’t expect to fix. Some examples below:
name not matched | reason | link |
---|---|---|
Mystery mystery | bad name | records |
Sonus naturalis | bad name | records |
Bambusoideae spec. | subfamily name | records |
Coleoptera indet. | order name | records |
Astarte juv. | genus name with life stage | records |
Gen. sp. | bad name | records |
Astarte sp. BIOUG14667-B01 | family with id | records |
Phoneutria depilata (Strand 1909) sp. reval. | species name with remark | records |
Anisoptera Unknown Dragonfly Species | infra-order name with remarks | records |
Zygoptera | suborder name | records |
Philodromus Philodromus albidus / rufus | doubtful identification (alternative) | records |
Certhia brachydactyla/Certhia familiaris | doubtful identification (alternative) | records |
Corvus corone x C. cornix | missing hybrid | records |
BOLD:ADV7315 | OTU | records |
BOLD:ADX5419 | OTU | records |
If a name is truly missing from the GBIF backbone, GBIF would like to fill that data gap. This can be done by updating the checklists that feed into the backbone construction process.
Other (possibly missing)
As a non-expert it is hard to tell if a name is a data gap, just a misspelling or something else. So how many possibly missing names are actual data gaps? To check, I randomly sampled five possibly missing names from each group in the graph above to check if I could manually locate a source outside GBIF with the name.
Around 50% (44 of 86) of the possibly missing names appear to be genuinely missing from the GBIF backbone. We can therefore conservatively assume that there are thousands of missing names in the GBIF backbone. Keep in mind, however, that many missing names are missing synonyms—that is, they are not unique taxon concepts. Halving this number (25%) we can make a conservative minimum missing names table:
group | friendly name | min estimated missing names* |
---|---|---|
Coleoptera | Beetles | 26,600 |
Lepidoptera | Butterflies | 17,700 |
Passeriformes | Bird order | 4,200 |
Fabales | Plant order | 4,100 |
Asterales | Plant order | 4,000 |
Agaricales | Mushrooms | 1,600 |
Araneae | Spiders | 1,200 |
Rodentia | Rodents | 1,100 |
Carditida | Bivalves | 700 |
Anura | Frogs | 600 |
Carnivora | Carnivores | 300 |
Odonata | Dragonflies | 300 |
Chiroptera | Bats | 200 |
Cyatheales | Ferns | 100 |
Primates | Primates | 100 |
Neuroptera | Insect order | <100 |
Percopsiformes | Fish order | <100 |
*Based on conservative judgment that 25% of potentially missing names are genuinely missing from GBIF backbone. Download a full table of possibly missing names from the groups above here.
This table, of course, gives us no indication about how well described a certain group is; it just reflects the relative completeness of the GBIF backbone. This table also includes fossils, which might inflate some groups (such as primates). Also, these counts do not necessarily represent unique species concepts but just unique name strings that GBIF receives.
GBIF data publisher advice
As a data publisher, there are few things that can be done to improve name matching to the GBIF backbone.
- Run your dataset through the data validator and examine any names that get the data quality flag taxon match higher rank.
- Attempt to match your names to the GBIF backbone before publishing using the name matcher or rgbif.
- Check if your names all have authorship (if possible).
- Fill known higher-taxonomy.
- Try to avoid working name placeholders for the dwc:scientificName, but rather fill in the lowest known rank.
- Do not use ALL CAPS for names.
- Do not put identification qualifiers in the dwc:scientificName field but rather use the dwc:identificationQualifier field.