GBIF checklist datasets and data gaps

A checklist dataset is a catch-all term describing any dataset that contains primarily a list of taxonomic names. The lines between a checklist dataset and an occurrence dataset can be blurry. GBIF classifies at least 6 types of datasets as checklists. National (or regional) lists of species example Taxonomic list of species example Species description example Checklists made up of other checklists GBIF backbone taxonomy & Catalogue of Life Checklists with occurrences example Checklists made from occurrences example The top two are probably what most people imagine when they think of a checklist dataset.

Sequenced-based data on GBIF - What you need to know before analyzing data

As I mentioned in my previous post, a lot more sequence-based data has been made available on GBIF this past year. MGnify alone, published 295 datasets for a total of 13,285,109 occurrences. Even though most of these occurrences are Bacteria or Chromista, more than a million of them are animals and more than 300,000 are plants. So chances are, that even if you are not interested in bacteria, you might encounter sequence-based data on GBIF.

Sequence-based data on GBIF - Sharing your data

GBIF is trying to make it easier to share sequence-based data. In fact, this past year alone, we worked with UNITE to integrate species hypothesis for fungi and with EMBL-EBI to publish 295 metagenomics datasets. Unfortunately, documentation is not as quick to follow. Although we have now an FAQ on the topic, I thought that anyone could use a blog post with some advice and examples. Note that this blog post is not intended to be documentation.

Gridded Datasets Update

Gridded datasets are now flagged on the GBIF registry This update builds on work from a previous blog post. Gridded datasets are broadly datasets that have low coordinate precision due to rasterized sampling or rounding. This can be a data quality issue because a user might assume an occurrence record has more precision than it actually does. Current statistics 572 datasets are currently flagged as gridded or rasterized on the registry.

Country Centroids

Country Centroids are a known data quality issue within the GBIF network. Sometimes data publishers will not know the exact lat-lon location of a record and will enter the lat-long center of the country instead. This is a data issue because users might be unaware that an observation is pinned to a country center and assume it is a precise location. Below I plot the top country centroids found on GBIF within at least 1km.