Common things to look out for when post-processing GBIF downloads
Post was updated on April 20 2022 to accommodate changes to dwc:establishmentMeans vocabulary handling.
Here I present a checklist for filtering GBIF downloads.
In this guide, I will assume you are familar with R. This guide is also somewhat general, so your solution might differ. This guide is intended to give you a checklist of common things to look out for when post-processing GBIF downloads.
Below is an example a filtering checklist script that would work for most users. Individual users might want to add/remove some steps. After the script, I discuss each of these steps in more detail below. I would recommend looking at these two articles before continuing:
library(rgbif) library(dplyr) library(CoordinateCleaner) taxonkey <- name_backbone("Calopteryx xanthostoma")$usageKey # set up gbif credentials first # https://docs.ropensci.org/rgbif/articles/gbif_credentials.html gbif_download <- occ_download( pred("taxonKey", taxonkey), pred("hasCoordinate", TRUE), pred("hasGeospatialIssue", FALSE), # remove GBIF default geospatial issues format = "SIMPLE_CSV") occ_download_wait(gbif_download) # filtering pipeline gbif_download %>% occ_download_get() %>% occ_download_import() %>% setNames(tolower(names(.))) %>% # set lowercase column names to work with CoordinateCleaner filter(occurrencestatus == "PRESENT") %>% filter(!basisofrecord %in% c("FOSSIL_SPECIMEN","LIVING_SPECIMEN")) %>% filter(year >= 1900) %>% filter(coordinateprecision < 0.01 | is.na(coordinateprecision)) %>% filter(coordinateuncertaintyinmeters < 10000 | is.na(coordinateuncertaintyinmeters)) %>% filter(!coordinateuncertaintyinmeters %in% c(301,3036,999,9999)) %>% filter(!decimallatitude == 0 | !decimallongitude == 0) %>% cc_cen(buffer = 2000) %>% # remove country centroids within 2km cc_cap(buffer = 2000) %>% # remove capitals centroids within 2km cc_inst(buffer = 2000) %>% # remove zoo and herbaria within 2km cc_sea() %>% # remove from ocean distinct(decimallongitude,decimallatitude,specieskey,datasetkey, .keep_all = TRUE) %>% glimpse() # look at results of pipeline
GBIF default geospatial issues
GBIF removes common geospatial issues by default if you choose to have data with a location.
The following things will be removed:
- Zero coordinate : Coordinates are exactly (0,0). null island
- Country coordinate mismatch : The coordinates fall outside of the given country’s polygon.
- Coordinate invalid : GBIF is unable to interpret the coordiantes.
- Coordinate out of range : The coordinates are outside of the range for decimal lat/lon values ((-90,90), (-180,180)).
You can do this on the web portal.
GBIF now has a field called occurrence status, which will tell you whether an occurrence is a presence or absence.
gbif_download %>% filter(occurrenceStatus == "PRESENT")
You can also do this on the web portal before downloading.
Fossils and living specimens
You might want to remove fossils and living specimens, and non-naturally established species.
gbif_download %>% filter(!basisOfRecord %in% c("FOSSIL_SPECIMEN","LIVING_SPECIMEN")) %>%
You might also want to remove old records.
gbif_download %>% filter(year >= 1900)
You can also do this on the web portal before downloading.
There are two fields that come with simple csv downloads that give uncertainty.
- coordinatePrecision : A decimal representation of the precision of the coordinates.
- coordinateUncertaintyInMeters : the uncertainty of the coordinates in meters.
These fields are not frequently used by publishers (around 600M occurrences do not fill in either), so filter with caution. I have kept missing values in my example.
If you want to be sure that a point has the acceptable level of uncertainty or precision for your study, you can remove those with missing values, but this will be removing a lot “ok” records.
# I keep missing values here gbif_download %>% filter( coordinatePrecision < 0.01 | is.na(coordinatePrecision) ) %>% filter( coordinateUncertaintyInMeters < 1000 | is.na(coordinateUncertaintyInMeters) )
You also want to remove records with known default values for coordinateUncertaintyInMeters. These can be GeoLocate centroids or some other default. It is good to remove them because usually the uncertainy is larger than what is stated by the value.
gbif_download %>% filter(!coordinateUncertaintyInMeters %in% c(301,3036,999,9999)) # known inaccurate default values
Points along equator or prime merdidian
Point plotted along the prime meridian or equator.
As of the writing of this guide there are 37K occurrences along the equator and 28K along the Prime meridian.
gbif_download %>% filter(decimalLatitude == 0 | decimalLongitude == 0) # see also CoordinateCleaner::cc_zero()
You can remove country centroids and province centroids using CoordinateCleaner.
library(CoordinateCleaner) gbif_download %>% cc_cen( lon = "decimalLongitude", lat = "decimalLatitude", buffer = 2000, # radius of circle around centroid to look for centroids value = "clean", test="both")
Centroids tend to be more common for geocoded museum collections (PRESERVED_SPECIMEN), so you might want to only filter centroids for preserved specimens, since other occurrences might be false positives.
You might also want to filter country capitals.
gbif_download %>% cc_cap( lon = "decimalLongitude", lat = "decimalLatitude", buffer = 2000, # radius of circle around centroid to look for centroids value = "clean")
Zoo and herbaria locations
Publishers do not always fill in the establishmentMeans=“MANAGED” or basisOfRecord=“LIVING_SPECIMEN”, so it is usually good to filter known zoo and herbaria locations.
library(CoordinateCleaner) gbif_download %>% cc_inst( lon = "decimalLongitude", lat = "decimalLatitude", buffer = 2000, value = "clean", verbose = TRUE )
In the ocean
Obviously not to be used with marine species. If marine, you might want to do the opposite.
library(CoordinateCleaner) gbif_download %>% cc_sea( lon = "decimalLongitude", lat = "decimalLatitude" )
Some of you will want to remove potential location duplicates.
gbif_download %>% distinct(decimalLongitude,decimalLatitude,speciesKey,datasetKey, .keep_all = TRUE)
It is probably a good idea to keep the datasetKey for citing the download later, if you post a derived dataset on Zenodo or something similar.
In general removing duplicates is not difficult based on location. GBIF also has a experimental feature for indentifying related records. It is, however, not optimized for data filtering yet.
R packages for filtering data
There are currently 3 R packages for filtering GBIF occurrences:
CoordinateCleaner is probably the most most complete.
There are some additional things you might want to check for which invovle more judgement calls:
- outside native ranges
- gridded datasets
- automated identifications
I have written some companion R scripts for handling these issues as well link to scripts.