Frictionless data is about removing the friction in working with data through the creation of tools, standards, and best practices for publishing data using the Data Package standard, a containerization format for any kind of data. It offers specifications and software around data publication, transport and consumption. Similarly to Darwin Core, data resources are presented in CSV files while the data model is described in a JSON structure.
If you are a (first time) publisher on GBIF and you are trying to decide which type of dataset would best fit your data, this blogpost is for you. All the records shared on GBIF are organized into datasets. Each dataset is associated with some metadata describing its content (the classic “what, where, when, why, how”). The dataset’s content depends strongly on the dataset’s class. GBIF currently support four types of dataset:
Recently a user noticed that there were Asian Red Pandas (Ailuridae) occurring in North America, and wondered if someone had made a mistake. When an occurrence observation comes from a zoo or botanical garden, it is usually considered a living specimen, but when it comes from a museum it is usually called a preserved specimen. This label helps users remove records that they might not want, which come from zoos.
Finding and accessing data There is a lot of GBIF-mediated data available. More than 1.3 B occurrence records covering hundreds of thousands of species in all part of the worlds. All free, open and available at the touch of a button. Users can download data through the GBIF.org portal, via the GBIF API, or one of the third-party tools available for programmatic access, e.g. rgbif. If there is one area in which GBIF has been immensely successful, it’s making the data available to users.
This past week our informatics team has been updating the Backbone taxonomy on GBIF.org. This is a fairly disruptive process which sometimes involves massive taxonomic changes but DON’T PANIC. This update is a good thing. It means that some of the taxonomic issues reported have been addressed (see for example this issue concerning the Xylophagidae family) and that new species are now visible on GBIF. Plus, it gives me an excellent opportunity to talk about the GBIF backbone taxonomy and answer some of the questions you might have.
It is now possible to download up to 100,000 names on GBIF! Until recently it was not possible to download occurrences for more than a few hundred species at the same time, but it is now possible to request more species names (up to 100,000 taxonkeys). For those multiple taxa downloaders out there, GBIF now supports download requests of up to 100,000(!) taxa. That should cover most use cases :) For such large requests, however, you will need to POST you query to the Occurrence Download API service: https://t.
Citizen Science datasets on GBIF plotted with all other (gray) GBIF datasets (>100K occurrences). There are many citizen science datasets with millions of occurrences (eBird, (Swedish) Artportalen), and the top 3 datasets on GBIF are all citizen science datasets. But in terms of number of unique species, only iNaturalist competes with large museum datasets like Smithsonian NMNH. Because of very large datasets like eBird and Artportalen, Citizen Science makes up a large percentage of the total occurrence records on GBIF.
jpg | pdf | svg | code It has been suggested that GBIF could make es50 maps similar to what organizations like OBIS are already doing. I decided to make one for land animals (graph above). link to code es50 (Hulbert index) is the statistically expected number of unique species in a random sample of 50 occurrence records, and is an indicator of biodiversity richness. The score can be computed without random sampling, but the mean of infinite random sampling will produce the same result.
Recently we were asked on GitHub whether there was a way to get all animal occurrences that are not a bird. This seems like an easy enough request, but unfortunately, there is currently no way to exclude groups from a download search and get everything but a certain group. A user can get all birds, but they can’t get no birds! I thought this was an interesting question and probably useful for other people wanting smaller downloads, since there are currenly around half a billion occurrence records for birds.
link to interactive map Big 15-300K total names Medium 5-15K total names Small 0-5K total names Here I plot the total names in checklists published on GBIF linked to a single country. A checklist dataset is a term for any dataset that contains primarily a list of taxonomic names. National species checklists are lists of species recorded from a country usually through some organized effort. GBIF has published a guide on best practices for making national checklist datasets, which advises making national checklists as big as possible.
A checklist dataset is a catch-all term describing any dataset that contains primarily a list of taxonomic names. The lines between a checklist dataset and an occurrence dataset can be blurry. GBIF classifies at least 6 types of datasets as checklists. National (or regional) lists of species example Taxonomic list of species example Species description example Checklists made up of other checklists GBIF backbone taxonomy & Catalogue of Life Checklists with occurrences example Checklists made from occurrences example The top two are probably what most people imagine when they think of a checklist dataset.
As I mentioned in my previous post, a lot more sequence-based data has been made available on GBIF this past year. MGnify alone, published 295 datasets for a total of 13,285,109 occurrences. Even though most of these occurrences are Bacteria or Chromista, more than a million of them are animals and more than 300,000 are plants. So chances are, that even if you are not interested in bacteria, you might encounter sequence-based data on GBIF.
GBIF is trying to make it easier to share sequence-based data. In fact, this past year alone, we worked with UNITE to integrate species hypothesis for fungi and with EMBL-EBI to publish 295 metagenomics datasets. Unfortunately, documentation is not as quick to follow. Although we have now an FAQ on the topic, I thought that anyone could use a blog post with some advice and examples. Note that this blog post is not intended to be documentation.
Gridded datasets are now flagged on the GBIF registry This update builds on work from a previous blog post. Gridded datasets are broadly datasets that have low coordinate precision due to rasterized sampling or rounding. This can be a data quality issue because a user might assume an occurrence record has more precision than it actually does. Current statistics 572 datasets are currently flagged as gridded or rasterized on the registry.
Country Centroids are a known data quality issue within the GBIF network. Sometimes data publishers will not know the exact lat-lon location of a record and will enter the lat-long center of the country instead. This is a data issue because users might be unaware that an observation is pinned to a country center and assume it is a precise location. Below I plot the top country centroids found on GBIF within at least 1km.
Where are we missing biodiversity data? A hunger map is a map of missing biodiversity data (a biodiversity data gap). The main challenge with hunger mapping is proving that a species does not exist but should exist in a region. Hunger maps are important because they could be used to prioritize funding and digitization efforts. Currently, GBIF has no way of telling what species are missing from where. In this blog post I review some potential ways GBIF could make global biodiversity hunger maps.
Citizen science Citizen science is scientific research conducted, in whole or in part, by amateur (or non-professional) scientists. Biodiversity observations by citizen scientists have become significant in the last 10 years thanks to projects like: eBird iNaturalist Artportalen Sweden Artsdatabanken Norway Southern African Bird Atlas Bird Life Austrailia Dansk Ornitologisk Forening Great Back Yard Bird Count Citizen science is scientific research conducted, in whole or in part, by amateur (or non-professional) scientists.
Not all filters are born equal It happens sometimes that users need GBIF data that fall within specific boundaries. The GBIF Portal provides a location filter where it is possible to draw a rectangle or a polygon on the map and get the occurrence records within this shape. However these tools have a limited precision and occasionally the job calls for more complex shapes than the GBIF Portal currently supports.
This blog post covers the publication of multimedia on GBIF. However, it is not intended to be documentation. For more information, please check the references below. NB: GBIF does not host original multimedia files and there is no way to upload pictures to the platform. For more information, please read the how to publish paragraphs. Media displayed on the GBIF portal Let’s say that you are looking for pictures of otters, or perhaps the call of a sea eagle.
Can we automatically label citizen science datasets? The short answer is yes, partially. Why label GBIF datasets as “citizen science”? What is citizen science? Citizen science is scientific research conducted, in whole or in part, by amateur (or non-professional) scientists. Citizen science is sometimes described as “public participation in scientific research,” participatory monitoring, and participatory action research (wikipedia definition). Citizen science on GBIF A 2016 study showed that nearly half of all occurrence records shared through the GBIF network come from datasets with significant volunteer contributions (for more information, see our “citizen science” page on gbif.
The GBIF maps api is an under-used but powerful web service provided by GBIF. The maps api is used by the main GBIF portal to create the maps including the big map used on gbif.org. We can make a simple call to the api by pasting the link below into a web browser. https://firstname.lastname@example.org?style=purpleYellow.point You should end up with an image like this. This api call is composed essentially of two elements
EBCC Atlas of European Breeding Birds (gridded) Naturalis Biodiversity Center (NL) - Aves (not gridded) Gridded data in GBIF Gridded datasets are a known problem at GBIF. Many datasets have equally-spaced points in a regular pattern. These datasets are usually systematic national surveys or data taken from some atlas (“so-called rasterized collection designs”). In this blog post I will describe how I found gridded dataset in GBIF.
Link To App Explanation of tool This tool plots the downloads through time for species or other taxonomic groups with more than 25 downloads at GBIF. Downloads at GBIF most often occur through the web interface. In a previous post, we saw that most users are downloading data from GBIF via filtering by scientific name (aka Taxon Key). Since the GBIF index currently sits at over 1 billion records (a 400+GB csv), most users will simplying filter by their taxonomic group of interest and then generate a download.