Common things to look out for when post-processing GBIF downloads

Post was updated on April 20 2022 to accommodate changes to dwc:establishmentMeans vocabulary handling.

Here I present a checklist for filtering GBIF downloads.

In this guide, I will assume you are familar with R. This guide is also somewhat general, so your solution might differ. This guide is intended to give you a checklist of common things to look out for when post-processing GBIF downloads.

Below is an example a filtering checklist script that would work for most users. Individual users might want to add/remove some steps. After the script, I discuss each of these steps in more detail below. I would recommend looking at these two articles before continuing:

library(rgbif)
library(dplyr)
library(CoordinateCleaner)

taxonkey <- name_backbone("Calopteryx xanthostoma")$usageKey

# set up gbif credentials first
# https://docs.ropensci.org/rgbif/articles/gbif_credentials.html

gbif_download <- occ_download(
  pred("taxonKey", taxonkey),
  pred("hasCoordinate", TRUE), 
  pred("hasGeospatialIssue", FALSE), # remove GBIF default geospatial issues
  format = "SIMPLE_CSV") 

occ_download_wait(gbif_download) 

# filtering pipeline  
gbif_download %>%
  occ_download_get() %>%
  occ_download_import() %>%
  setNames(tolower(names(.))) %>% # set lowercase column names to work with CoordinateCleaner
  filter(occurrencestatus  == "PRESENT") %>%
  filter(!basisofrecord %in% c("FOSSIL_SPECIMEN","LIVING_SPECIMEN")) %>%
  filter(year >= 1900) %>% 
  filter(coordinateprecision < 0.01 | is.na(coordinateprecision)) %>% 
  filter(coordinateuncertaintyinmeters < 10000 | is.na(coordinateuncertaintyinmeters)) %>%
  filter(!coordinateuncertaintyinmeters %in% c(301,3036,999,9999)) %>% 
  filter(!decimallatitude == 0 | !decimallongitude == 0) %>%
  cc_cen(buffer = 2000) %>% # remove country centroids within 2km 
  cc_cap(buffer = 2000) %>% # remove capitals centroids within 2km
  cc_inst(buffer = 2000) %>% # remove zoo and herbaria within 2km 
  cc_sea() %>% # remove from ocean 
  distinct(decimallongitude,decimallatitude,specieskey,datasetkey, .keep_all = TRUE) %>%
  glimpse() # look at results of pipeline

GBIF default geospatial issues

GBIF removes common geospatial issues by default if you choose to have data with a location.

The following things will be removed:

Zero coordinate : Coordinates are exactly (0,0). null island
Country coordinate mismatch : The coordinates fall outside of the given country’s polygon.
Coordinate invalid : GBIF is unable to interpret the coordiantes.
Coordinate out of range : The coordinates are outside of the range for decimal lat/lon values ((-90,90), (-180,180)).

You can do this on the web portal.

Absence data

GBIF now has a field called occurrence status, which will tell you whether an occurrence is a presence or absence.

gbif_download %>%
filter(occurrenceStatus  == "PRESENT")

You can also do this on the web portal before downloading.

Fossils and living specimens

You might want to remove fossils and living specimens.

gbif_download %>% 
filter(!basisOfRecord %in% c("FOSSIL_SPECIMEN","LIVING_SPECIMEN")) %>%

Old records

You might also want to remove old records.

gbif_download %>% 
filter(year >= 1900)

You can also do this on the web portal before downloading.

High uncertainty

There are two fields that come with simple csv downloads that give uncertainty.

coordinatePrecision : A decimal representation of the precision of the coordinates.
coordinateUncertaintyInMeters : the uncertainty of the coordinates in meters.

These fields are not frequently used by publishers (around 600M occurrences do not fill in either), so filter with caution. I have kept missing values in my example.

If you want to be sure that a point has the acceptable level of uncertainty or precision for your study, you can remove those with missing values, but this will be removing a lot “ok” records.

# I keep missing values here
gbif_download %>% 
filter(
coordinatePrecision < 0.01 | is.na(coordinatePrecision)
) %>% 
filter(
coordinateUncertaintyInMeters < 1000 | is.na(coordinateUncertaintyInMeters)
)

You also want to remove records with known default values for coordinateUncertaintyInMeters. These can be GeoLocate centroids or some other default. It is good to remove them because usually the uncertainy is larger than what is stated by the value.

gbif_download %>% 
filter(!coordinateUncertaintyInMeters %in% c(301,3036,999,9999)) 
# known inaccurate default values

Points along equator or prime merdidian

Point plotted along the prime meridian or equator.

As of the writing of this guide there are 37K occurrences along the equator and 28K along the Prime meridian.

gbif_download %>% 
filter(decimalLatitude == 0 | decimalLongitude == 0)

# see also CoordinateCleaner::cc_zero()

Country centroids

Sometimes GBIF data publishers will not know the exact lat-lon location of a record and will enter the lat-long center of the country instead. This is a data issue because users might be unaware that an observation is pinned to a country center and assume it is a precise location.

It is now possible to filter out country/area centroids in a download using the distanceFromCentroidInMeters filter.

# download occurrences that are at least 2km from a centroid in Sweden
occ_download(
pred_gte("distanceFromCentroidInMeters","2000"),
pred("country","SE"),
format = "SIMPLE_CSV")

GBIF currently uses only PCLI level centroids from the Catalogue of Centroids.

You can also remove country centroids and province centroids using CoordinateCleaner.

library(CoordinateCleaner)

gbif_download %>% 
cc_cen(
lon = "decimalLongitude", 
lat = "decimalLatitude", 
buffer = 2000, # radius of circle around centroid to look for centroids
value = "clean",
test="both")

Centroids tend to be more common for geocoded museum collections (PRESERVED_SPECIMEN), so you might want to only filter centroids for preserved specimens, since other occurrences might be false positives.

You might also want to filter country capitals.

gbif_download %>% 
cc_cap(
lon = "decimalLongitude", 
lat = "decimalLatitude", 
buffer = 2000, # radius of circle around centroid to look for centroids
value = "clean")

Zoo and herbaria locations

Publishers do not always fill in the establishmentMeans=“MANAGED” or basisOfRecord=“LIVING_SPECIMEN”, so it is usually good to filter known zoo and herbaria locations.

library(CoordinateCleaner)

gbif_download %>% 
cc_inst(
lon = "decimalLongitude",
lat = "decimalLatitude",
buffer = 2000,
value = "clean",
verbose = TRUE
)

In the ocean

Obviously not to be used with marine species. If marine, you might want to do the opposite.

library(CoordinateCleaner)

gbif_download %>% 
cc_sea(
lon = "decimalLongitude",
lat = "decimalLatitude"
)

Location duplicates

Some of you will want to remove potential location duplicates.

gbif_download %>% 
distinct(decimalLongitude,decimalLatitude,speciesKey,datasetKey, .keep_all = TRUE)

It is probably a good idea to keep the datasetKey for citing the download later, if you post a derived dataset on Zenodo or something similar.

In general removing duplicates is not difficult based on location. GBIF also has a experimental feature for indentifying related records. It is, however, not optimized for data filtering yet.

R packages for filtering data

There are currently 3 R packages for filtering GBIF occurrences:

CoordinateCleaner is probably the most most complete.

Additional filters

There are some additional things you might want to check for which invovle more judgement calls:

outliers
metagenomics
outside native ranges
gridded datasets
automated identifications

I have written some companion R scripts for handling these issues as well link to scripts.