Exploring Related Records in the Flowering Plant Genus Senegalia in Brazil
In 2020 GBIF released a news item “New data-clustering feature aims to improve data quality and reveal cross-dataset connections.” Basically, we run an algorithm across datasets shared with GBIF to search for similarities in occurrences data fields such as location, identifiers and dates. Please read this blog by Marie Grosjean and Tim Robertson for more details on how it works. In general, we can identify linkages between specimens, DNA sequences and literature citations.
Checklist publishing on GBIF - some explanations on taxonID, scientificNameID, taxonConceptID, acceptedNameUsageID, nameAccordingTo
Which data can be shared through GBIF and what cannot
Finding data gaps in the GBIF backbone taxonomy
When publishers supply GBIF with a scientific name, this name is sometimes not found in the GBIF taxonomic backbone. In these cases, the occurrence record gets a data quality flag called taxon match higher rank. This means that GBIF was only able to match the name to a higher rank (genus, family, order …).
The World Checklist of Vascular Plants (Fabaceae)
The World Checklist of Vascular Plants (WCVP): Fabaceae is a new GBIF mediated checklist that drastically increases the coverage of the family Fabaceae in the GBIF backbone.
Using Apache Arrow and Parquet with GBIF-mediated occurrences
As written about in a previous blog post, GBIF now has database snapshots of occurrence records on AWS. This allows users to access large tables of GBIF-mediated occurrence records from Amazon s3 remote storage. This access is free of charge.
Identifying potentially related records - How does the GBIF data-clustering feature work?
What are the flags "Collection match fuzzy", "Collection match none", "Institution match fuzzy", "Institution match none" and how to remove them?
You are a data publisher of occurrence data through GBIF.org, care about your data quality, and wonder what to do about the issue flags that show up on your occurrences. You might have noticed a new type flag this year relating to collection and institution codes and identifiers. They are the result of our attempt at linking specimens records to our Registry of Scientific Collections (GRSciColl).
GBIF API beginners guide
This a GBIF API beginners guide.
The GBIF API technical documentation might be a bit confusing if you have never used an API before. The goal of this guide is to introduce the GBIF API to a semi-technical user who may have never used an API before.
The purpose of the GBIF API is to give users access to GBIF databases in a safe way. The GBIF API also allows GBIF.org and rgbif to function.
Did you know that...? - some of the lesser known functionalities around GBIF.org
During the first-ever virtual GBIF 2021 Global Nodes Meeting, GBIFS hosted a “game show”: a one-hour “battle of Nodes vs. helpdesk”. The not-so-hidden goal was to demonstrate some of the lesser known functionalities of GBIF.org through a fun, interactive session.
GBIF and Apache-Spark on AWS tutorial
GBIF now has a snapshot of 1.3 billion occurrence✝ records on Amazon Web Services (AWS). This guide will take you through running Spark notebooks on AWS. The GBIF snapshot is documented : here.
June snapshot of https://t.co/CJaPsifdp0 occurrence data now available on the Amazon and Microsoft clouds, based on https://t.co/aGbvTisapJ. See https://t.co/lRXM2uqFh0 for more details.— GBIF (@GBIF) June 2, 2021
You can read previous discussions about GBIF and cloud computing here. The main reason you would want to use cloud computing is to run big data queries that are slow or impractical on a local machine.
You’ve finished an analysis using GBIF-mediated data, you’re writing up your manuscript and checking all the references, but you’re unsure of how to cite GBIF. If you Google it, you’ll probably end up reading our citation guideslines and quickly realize that GBIF is all about DOIs. Datasets have their own DOIs and downloads of aggregated data also have their own DOIs.
But maybe you didn’t download data through the GBIF.org portal. Maybe you relied on an R package like rgbif or dismo that retrived data synchronously from the GBIF API? Maybe a grad student downloaded if for you? Maybe you accessed and analyzed the data using a cloud computing service, like Microsoft Azure or Amazon Web Services? In any case, which DOI do you cite if you don’t have one?
GBIF and Apache-Spark on Microsoft Azure tutorial
GBIF now has a snapshot of 1.3 billion occurrences✝ records on Microsoft Azure.
It is hosted by the Microsoft AI for Earth program, which hosts geospatial datasets that are important to environmental sustainability and Earth science. Hosting is convenient because you could now use occurrences in combination with other environmental layers and not need to upload any of it to the Azure. You can read previous discussions about GBIF and cloud computing here. The main reason you would want to use cloud computing is to run big data queries that are slow or impractical on a local machine.
The GBIF Registry of Scientific Collections (GRSciColl) in 2021
Common things to look out for when post-processing GBIF downloads
Post was updated on April 20 2022 to accommodate changes to dwc:establishmentMeans vocabulary handling.
Here I present a checklist for filtering GBIF downloads.
In this guide, I will assume you are familar with R. This guide is also somewhat general, so your solution might differ. This guide is intended to give you a checklist of common things to look out for when post-processing GBIF downloads.
(Almost) everything you want to know about the GBIF Species API
Today, we are talking about the GBIF Species API. Although you might not use it directly, you probably encountered it while using the GBIF web portal:
- Typing a scientific name in the GBIF Occurrence search.
- Seeing a “Taxon Match Fuzzy” flag.
- Using the Species Name matching tool.
This API is what allow us to navigate through the names available on GBIF. I will try to avoid repeating what you can already find in its documentation. Instead, I will attempt to give an overview and answer some questions that we received in the past.
GBIF Issues & Flags
Publishers share datasets, but also manage data quality. GBIF provides access to the use of biodiversity data, but also flags suspicious or missing content. Users use data, but also clean and remove records. Each play an important role in managing and improving data quality..
GBIF Regional Statistics - 2020
I was asked to prepare some statistics for the GBIF regional regional meetings being held virtually this year. This blog post is a companion for those meetings.
You can watch a video presentation of the preparation of these meetings here. The presentation of this blog post starts here.
- The North American virtual nodes meeting 2020 was on 5 - 6 May 2020
- The Europe and Central Asia-virtual nodes meeting 2020 was on 11 - 12 May 2020
- The Latin America and Caribbean virtual nodes meeting 2020 will be on 18 - 20 May 2020
- The Africa virtual nodes meeting 2020 will be on 10 - 12 June 2020
GBIF introduced a regional framework across the GBIF Network a little more than a decade ago, with groups based on clusters of national participants. Soon after the publication of Brooks et al. (2016), GBIF adopted their structure to provide a consistent approach to regional reporting and assessment processes map.
This post will cover countries/areas with some political status and not other GBIF Affiliates and Antarctica.
Which tools can I use to share my data on GBIF?
GBIF occurrence license processing
GBIF is now processing occurrence licenses record-by-record.
**iNaturalist research-grade observations**
Previously all occurrence licenses defaulted to their dataset license (provided by the publisher).
Does Biodiversity Informatics 💘 Wikidata?
Open online APIs are fantastic! You can use someone else’s infrastructure to create workflows, do research and create products without giving anything in return, except acknowledgement. But wait a minute! Why is everyone not using them? Why do we create our own data sources and suck up the costs in time and money? Not to mention the duplication of effort.
Frictionless Data and Darwin Core
Frictionless data is about removing the friction in working with data through the creation of tools, standards, and best practices for publishing data using the Data Package standard, a containerization format for any kind of data. It offers specifications and software around data publication, transport and consumption. Similarly to Darwin Core, data resources are presented in CSV files while the data model is described in a JSON structure.
How to choose a dataset class on GBIF?
Understanding basis of record - a living specimen becomes a preserved specimen
Recently a user noticed that there were Asian Red Pandas (Ailuridae) occurring in North America, and wondered if someone had made a mistake. When an occurrence observation comes from a zoo or botanical garden, it is usually considered a living specimen, but when it comes from a museum it is usually called a preserved specimen. This label helps users remove records that they might not want, which come from zoos.
Search, download, analyze and cite (repeat if necessary)
Six questions answered about the GBIF Backbone Taxonomy
Downloading occurrences from a long list of species in R and Python
It is now possible to download up to 100,000 names on GBIF!
Until recently it was not possible to download occurrences for more than a few hundred species at the same time, but it is now possible to request more species names (up to 100,000 taxonkeys).
Citizen Science on GBIF - 2019
Citizen Science datasets on GBIF plotted with all other (gray) GBIF datasets (>100K occurrences). There are many citizen science datasets with millions of occurrences (eBird, (Swedish) Artportalen), and the top 3 datasets on GBIF are all citizen science datasets. But in terms of number of unique species, only iNaturalist competes with large museum datasets like Smithsonian NMNH.
Exploring es50 for GBIF
It has been suggested that GBIF could make es50 maps similar to what organizations like OBIS are
already doing. I decided to make one for land animals (graph above). link to code
Big National Checklists
GBIF checklist datasets and data gaps
A checklist dataset is a catch-all term describing any dataset that contains primarily a list of taxonomic names. The lines between a checklist dataset and an occurrence dataset can be blurry.
Sequenced-based data on GBIF - What you need to know before analyzing data
Sequence-based data on GBIF - Sharing your data
Important: To find guidance on how to publish Sequence-based data on GBIF, please consult the following guide:
Andersson AF, Bissett A, Finstad AG, Fossøy F, Grosjean M, Hope M, Jeppesen TS, Kõljalg U, Lundin D, Nilsson RN, Prager M, Svenningsen C & Schigel D (2020) Publishing DNA-derived data through biodiversity data platforms. v1.0 Copenhagen: GBIF Secretariat. https://doi.org/10.35035/doc-vf1a-nr22.
[End edit 2021-09-16]
GBIF is trying to make it easier to share sequence-based data. In fact, this past year alone, we worked with UNITE to integrate species hypothesis for fungi and with EMBL-EBI to publish 295 metagenomics datasets.
Unfortunately, documentation is not as quick to follow. Although we have now an FAQ on the topic, I thought that anyone could use a blog post with some advice and examples.
Note that this blog post is not intended to be documentation. The information written is subject to change, feel free to leave comments if you have any question or suggestion.
Gridded Datasets Update
Where are we missing biodiversity data?
A hunger map is a map of missing biodiversity data (a biodiversity data gap). The main challenge with hunger mapping is proving that a species does not exist but should exist in a region. Hunger maps are important because they could be used to prioritize funding and digitization efforts. Currently, GBIF has no way of telling what species are missing from where. In this blog post I review some potential ways GBIF could make global biodiversity hunger maps.
Will citizen science take over?
Citizen science is scientific research conducted, in whole or in part, by amateur (or non-professional) scientists. Biodiversity observations by citizen scientists have become significant in the last 10 years thanks to projects like:
Using shapefiles on GBIF data with R
Sharing images, sounds and videos on GBIF
Finding citizen science datasets on GBIF
Can we automatically label citizen science datasets?
The short answer is yes, partially.
Why label GBIF datasets as “citizen science”?
What is citizen science?
Citizen science is scientific research conducted, in whole or in part, by amateur (or non-professional) scientists. Citizen science is sometimes described as “public participation in scientific research,” participatory monitoring, and participatory action research (wikipedia definition).
Finding gridded datasets
Gridded datasets are a known problem at GBIF. Many datasets have equally-spaced points in a regular pattern. These datasets are usually systematic national surveys or data taken from some atlas (“so-called rasterized collection designs”).