Making collection content discoverable when you don’t have occurrences published on GBIF

This blog post is a tutorial on how to upload collection descriptors in the Global Registry of Scientific Collections (GRSciColl). If you are someone working with a physical collection or work with people who do, you might be interested in making these collections findable. Ideally, the content of these collections would be digitized and made available online on relevant platforms like GBIF.org, iDigBio.org, ALA.org.au, etc. Sharing digital specimen records is a great way to ensure the discoverability of collection content.

GBIF SQL Downloads

GBIF has an experimental feature that allows users to download data from the GBIF database in SQL format. Contact helpdesk@gbif.org to request access. https://techdocs.gbif.org/en/data-use/api-sql-downloads

If your download can be formulated using the traditional predicate downloads, it is usually going to be faster to use regular downloads.

The experimental Occurrence SQL Download API allows users to query GBIF occurrences using SQL. In contrast to the Predicate Download API, the SQL API allows selection of the columns of interest and generation of summary views of GBIF data.

SQL downloads, like predicate downloads, require you to have a GBIF user account.

Working with extinct species on GBIF

GBIF users are often interested in getting occurrences of only living species or “no extinct” species. This seemingly straightforward request is often difficult to fulfill in practice.

Currently, there is no occurrence level filter for removing extinct species from an occurrence search. Additionally, there is no authoritative global extinct species list available from GBIF. This article discusses the difficulties and complexities of working with extinct species and some possible solutions.

GBIF attempts to improve identifier stability by monitoring changes of occurrenceIDs

Since 2022, GBIF has been monitoring changes of occurrenceIDs in datasets to improve the stability of GBIF identifiers. We pause data ingestion when we detect more than half of occurrence records in the latest version have different occurrenceIDs from the previous version (live on GBIF.org). This identifier validation process automatically creates issues on GitHub and GBIF helpdesk will contact the publishers to verify the changes of occurrenceIDs. Our aim in this process is to minimize the changes of GBIF identifiers and to support the needs for citations and data linking.

Accessing GBIF-mediated occurrence data to conserve EDGE species

Evolutionarily Distinct and Globally Endangered or EDGE species have few close relatives on the tree of life and are often extremely unusual in the way they look, live and behave, as well as in their genetic make-up. They represent a unique and irreplaceable part of the world’s natural heritage. After seeing an open call for projects that extend existing protected area networks for the conservation of EDGE species through the EDGE Protected Area and Conserved Area Fund, I wanted to see what data is currently available through GBIF that could help guide decisions on conservation financing.

Exploring Related Records in the Flowering Plant Genus Senegalia in Brazil

In 2020 GBIF released a news item “New data-clustering feature aims to improve data quality and reveal cross-dataset connections.” Basically, we run an algorithm across datasets shared with GBIF to search for similarities in occurrences data fields such as location, identifiers and dates. Please read this blog by Marie Grosjean and Tim Robertson for more details on how it works. In general, we can identify linkages between specimens, DNA sequences and literature citations.

Checklist publishing on GBIF - some explanations on taxonID, scientificNameID, taxonConceptID, acceptedNameUsageID, nameAccordingTo

When data publishers publish checklists, they will use a a Darwin Core Archive Taxon Core. And although the taxon core terms are already described here, what exactly to put in which field can sometimes be confusing. And there is a lot to read, like here https://github.com/tdwg/tnc/issues/1 Here I am sharing a summary of an email conversation we had with some data publishers on Helpdesk concerning some of the Taxon Core fields.

Which data can be shared through GBIF and what cannot

Preparing a dataset to be shared on GBIF.org can be a daunting task and many publishers realize that not all their data fits in the Darwin Core standard (DwC) and extensions GBIF uses to structure, standardize and display biodiversity data. This blog post will cover what data fits in GBIF, give examples of data that does not fit in the current format of GBIF, and provide guidance to how you can share relevant data in a metadata-only dataset or through a third-party.

Identifying potentially related records - How does the GBIF data-clustering feature work?

Many data users may suspect they’ve discovered duplicated records in the GBIF index. You download data from GBIF, analyze them and realize that some records have the same date, scientific name, catalogue number and location but come from two different publishers or have slightly different attributes. There are many valid reasons why these duplicates appear on GBIF. Sometimes an observation was recorded in two different systems, sometimes several records correspond to herbaria duplicates (you can check the work of Nicky Nicolson on the topic), sometimes a specimen was digitized twice, sometimes a record has been enriched with genetic information and republished via a different platform…

What are the flags "Collection match fuzzy", "Collection match none", "Institution match fuzzy", "Institution match none" and how to remove them?

You are a data publisher of occurrence data through GBIF.org, care about your data quality, and wonder what to do about the issue flags that show up on your occurrences. You might have noticed a new type flag this year relating to collection and institution codes and identifiers. They are the result of our attempt at linking specimens records to our Registry of Scientific Collections (GRSciColl).

GBIF API beginners guide

This a GBIF API beginners guide.

The GBIF API technical documentation might be a bit confusing if you have never used an API before. The goal of this guide is to introduce the GBIF API to a semi-technical user who may have never used an API before.

The purpose of the GBIF API is to give users access to GBIF databases in a safe way. The GBIF API also allows GBIF.org and rgbif to function.

Derived datasets

You’ve finished an analysis using GBIF-mediated data, you’re writing up your manuscript and checking all the references, but you’re unsure of how to cite GBIF. If you Google it, you’ll probably end up reading our citation guideslines and quickly realize that GBIF is all about DOIs. Datasets have their own DOIs and downloads of aggregated data also have their own DOIs.

But maybe you didn’t download data through the GBIF.org portal. Maybe you relied on an R package like rgbif or dismo that retrived data synchronously from the GBIF API? Maybe a grad student downloaded if for you? Maybe you accessed and analyzed the data using a cloud computing service, like Microsoft Azure or Amazon Web Services? In any case, which DOI do you cite if you don’t have one?

GBIF and Apache-Spark on Microsoft Azure tutorial

GBIF now has a snapshot of 1.3 billion occurrences records on Microsoft Azure.

It is hosted by the Microsoft AI for Earth program, which hosts geospatial datasets that are important to environmental sustainability and Earth science. Hosting is convenient because you could now use occurrences in combination with other environmental layers and not need to upload any of it to the Azure. You can read previous discussions about GBIF and cloud computing here. The main reason you would want to use cloud computing is to run big data queries that are slow or impractical on a local machine.

The Global Registry of Scientific Collections (GRSciColl) in 2021

The Global Registry of Scientific Collections, also known as GRSciColl, has been available on GBIF.org since 2019 but it recently got some more attention when we connected it to GBIF occurrences. Now is the perfect time to share a bit of GRSciColl history and what we plan for its future. A brief history of GRSciColl First of all, here are a few facts about GrSciColl today, at the start of 2021:

Common things to look out for when post-processing GBIF downloads

Post was updated on April 20 2022 to accommodate changes to dwc:establishmentMeans vocabulary handling.

Here I present a checklist for filtering GBIF downloads.

In this guide, I will assume you are familar with R. This guide is also somewhat general, so your solution might differ. This guide is intended to give you a checklist of common things to look out for when post-processing GBIF downloads.

(Almost) everything you want to know about the GBIF Species API

Today, we are talking about the GBIF Species API. Although you might not use it directly, you probably encountered it while using the GBIF web portal:

This API is what allow us to navigate through the names available on GBIF. I will try to avoid repeating what you can already find in its documentation. Instead, I will attempt to give an overview and answer some questions that we received in the past.

GBIF Issues & Flags

Publishers share datasets, but also manage data quality. GBIF provides access to the use of biodiversity data, but also flags suspicious or missing content. Users use data, but also clean and remove records. Each play an important role in managing and improving data quality..

GBIF Regional Statistics - 2020

I was asked to prepare some statistics for the GBIF regional regional meetings being held virtually this year. This blog post is a companion for those meetings.

You can watch a video presentation of the preparation of these meetings here. The presentation of this blog post starts here.

GBIF introduced a regional framework across the GBIF Network a little more than a decade ago, with groups based on clusters of national participants. Soon after the publication of Brooks et al. (2016), GBIF adopted their structure to provide a consistent approach to regional reporting and assessment processes map.

This post will cover countries/areas with some political status and not other GBIF Affiliates and Antarctica.

Which tools can I use to share my data on GBIF?

As you probably already know, GBIF.org doesn’t host any data. The system relies on each data provider making their data available online in a GBIF-supported format. It also relies on organization letting GBIF know where to find these data (in other words registering the data). But how to do just that? The good news is that there are several GBIF-compatible systems. They will export or make available the data for you in the correct format and several provide means to register them as datasets on GBIF.

GBIF occurrence license processing

GBIF is now processing occurrence licenses record-by-record.

**iNaturalist research-grade observations**

Previously all occurrence licenses defaulted to their dataset license (provided by the publisher).

Does Biodiversity Informatics 💘 Wikidata?

Open online APIs are fantastic! You can use someone else’s infrastructure to create workflows, do research and create products without giving anything in return, except acknowledgement. But wait a minute! Why is everyone not using them? Why do we create our own data sources and suck up the costs in time and money? Not to mention the duplication of effort.

Frictionless Data and Darwin Core

Frictionless data is about removing the friction in working with data through the creation of tools, standards, and best practices for publishing data using the Data Package standard, a containerization format for any kind of data. It offers specifications and software around data publication, transport and consumption. Similarly to Darwin Core, data resources are presented in CSV files while the data model is described in a JSON structure.

How to choose a dataset class on GBIF?

If you are a (first time) publisher on GBIF and you are trying to decide which type of dataset would best fit your data, this blogpost is for you. All the records shared on GBIF are organized into datasets. Each dataset is associated with some metadata describing its content (the classic “what, where, when, why, how”). The dataset’s content depends strongly on the dataset’s class. GBIF currently support four types of dataset:

Understanding basis of record - a living specimen becomes a preserved specimen

Recently a user noticed that there were Asian Red Pandas (Ailuridae) occurring in North America, and wondered if someone had made a mistake. When an occurrence observation comes from a zoo or botanical garden, it is usually considered a living specimen, but when it comes from a museum it is usually called a preserved specimen. This label helps users remove records that they might not want, which come from zoos.

Search, download, analyze and cite (repeat if necessary)

Finding and accessing data There is a lot of GBIF-mediated data available. More than 1.3 B occurrence records covering hundreds of thousands of species in all part of the worlds. All free, open and available at the touch of a button. Users can download data through the GBIF.org portal, via the GBIF API, or one of the third-party tools available for programmatic access, e.g. rgbif. If there is one area in which GBIF has been immensely successful, it’s making the data available to users.

Six questions answered about the GBIF Backbone Taxonomy

This past week our informatics team has been updating the Backbone taxonomy on GBIF.org. This is a fairly disruptive process which sometimes involves massive taxonomic changes but DON’T PANIC. This update is a good thing. It means that some of the taxonomic issues reported have been addressed (see for example this issue concerning the Xylophagidae family) and that new species are now visible on GBIF. Plus, it gives me an excellent opportunity to talk about the GBIF backbone taxonomy and answer some of the questions you might have.

Citizen Science on GBIF - 2019

Citizen Science datasets on GBIF plotted with all other (gray) GBIF datasets (>100K occurrences). There are many citizen science datasets with millions of occurrences (eBird, (Swedish) Artportalen), and the top 3 datasets on GBIF are all citizen science datasets. But in terms of number of unique species, only iNaturalist competes with large museum datasets like Smithsonian NMNH.

GBIF checklist datasets and data gaps

A checklist dataset is a catch-all term describing any dataset that contains primarily a list of taxonomic names. The lines between a checklist dataset and an occurrence dataset can be blurry.

Sequenced-based data on GBIF - What you need to know before analyzing data

As I mentioned in my previous post, a lot more sequence-based data has been made available on GBIF this past year. MGnify alone, published 295 datasets for a total of 13,285,109 occurrences. Even though most of these occurrences are Bacteria or Chromista, more than a million of them are animals and more than 300,000 are plants. So chances are, that even if you are not interested in bacteria, you might encounter sequence-based data on GBIF.

Sequence-based data on GBIF - Sharing your data

[Edit 2021-09-16]

Important: To find guidance on how to publish Sequence-based data on GBIF, please consult the following guide:

Andersson AF, Bissett A, Finstad AG, Fossøy F, Grosjean M, Hope M, Jeppesen TS, Kõljalg U, Lundin D, Nilsson RN, Prager M, Svenningsen C & Schigel D (2020) Publishing DNA-derived data through biodiversity data platforms. v1.0 Copenhagen: GBIF Secretariat. https://doi.org/10.35035/doc-vf1a-nr22.

[End edit 2021-09-16]

GBIF is trying to make it easier to share sequence-based data. In fact, this past year alone, we worked with UNITE to integrate species hypothesis for fungi and with EMBL-EBI to publish 295 metagenomics datasets.

Unfortunately, documentation is not as quick to follow. Although we have now an FAQ on the topic, I thought that anyone could use a blog post with some advice and examples.

Note that this blog post is not intended to be documentation. The information written is subject to change, feel free to leave comments if you have any question or suggestion.

Hunger mapping

Where are we missing biodiversity data?

A hunger map is a map of missing biodiversity data (a biodiversity data gap). The main challenge with hunger mapping is proving that a species does not exist but should exist in a region. Hunger maps are important because they could be used to prioritize funding and digitization efforts. Currently, GBIF has no way of telling what species are missing from where. In this blog post I review some potential ways GBIF could make global biodiversity hunger maps.

Will citizen science take over?

Citizen science is scientific research conducted, in whole or in part, by amateur (or non-professional) scientists. Biodiversity observations by citizen scientists have become significant in the last 10 years thanks to projects like:

Using shapefiles on GBIF data with R

Not all filters are born equal It happens sometimes that users need GBIF data that fall within specific boundaries. The GBIF Portal provides a location filter where it is possible to draw a rectangle or a polygon on the map and get the occurrence records within this shape. However these tools have a limited precision and occasionally the job calls for more complex shapes than the GBIF Portal currently supports.

Sharing images, sounds and videos on GBIF

This blog post covers the publication of multimedia on GBIF. However, it is not intended to be documentation. For more information, please check the references below. NB: GBIF does not host original multimedia files and there is no way to upload pictures to the platform. For more information, please read the how to publish paragraphs. Media displayed on the GBIF portal Let’s say that you are looking for pictures of otters, or perhaps the call of a sea eagle.

Finding citizen science datasets on GBIF

Can we automatically label citizen science datasets?

The short answer is yes, partially.

Why label GBIF datasets as “citizen science”?

What is citizen science?

Citizen science is scientific research conducted, in whole or in part, by amateur (or non-professional) scientists. Citizen science is sometimes described as “public participation in scientific research,” participatory monitoring, and participatory action research (wikipedia definition).

Finding gridded datasets

Gridded datasets are a known problem at GBIF. Many datasets have equally-spaced points in a regular pattern. These datasets are usually systematic national surveys or data taken from some atlas (“so-called rasterized collection designs”).

GBIF download trends

Link To App Explanation of tool This tool plots the downloads through time for species or other taxonomic groups with more than 25 downloads at GBIF. Downloads at GBIF most often occur through the web interface. In a previous post, we saw that most users are downloading data from GBIF via filtering by scientific name (aka Taxon Key). Since the GBIF index currently sits at over 1 billion records (a 400+GB csv), most users will simplying filter by their taxonomic group of interest and then generate a download.