It is now possible to download up to 100,000 names on GBIF!

Until recently it was not possible to download occurrences for more than a few hundred species at the same time, but it is now possible to request more species names (up to 100,000 taxonkeys).

NOTE: If your request can easily be summarized into higher taxon groups, it still makes more sense to download just that taxon group. For example, if you just want to download all dragonflies, all
mammals, or all vascular plants. These requests don’t require anything special.

Downloads through the web interface are still limited to around 200 names (taxonkeys) (species, genus, family, kingdom). This is due to limitations in browser. A search for 300 bird species fails.

Downloading occurrences for 60,000 tree species using rgbif and taxize

NOTE: This post has been updated on Feb 9 2022. Please use the lastest version of rgbif.

One good reason to download data using a long list of names, would be if your group of interest is non-monophyletic. Trees are a non-monophyletic group that include a variety of plant species that have independently evolved a trunk and branches. The Botanic Gardens Conservation
International
maintains a long list of >60,000 tree species. You can download a csv here or the one I used for the example below here.

Matching and downloading takes around 30 minutes, so run with fewer names if you don’t want to wait that long. This requires the latest version of rgbif.

install.packages("rgbif")

The important part here is to use rgbif::occ_download with pred_in and to fill in your gbif credentials. Or follow instructions here.

# fill in your gbif.org credentials 
user <- "" # your gbif.org username 
pwd <- "" # your gbif.org password
email <- "" # your email 
library(dplyr)
library(readr)  
library(rgbif) # for occ_download

# The 60,000 tree names file I downloaded from BGCI
file_url <- "https://data-blog.gbif.org/post/2019-07-11-downloading-long-species-lists-on-gbif_files/global_tree_search_trees_1_3.csv"

# match the names 
gbif_taxon_keys <- 
readr::read_csv(file_url) %>%
head(1000) %>% # use only first 1000 names for testing
pull("Taxon name") %>% # use fewer names if you want to just test 
name_backbone_checklist()  %>% # match to backbone
filter(!matchType == "NONE") %>% # get matched names
pull(usageKey) # get the gbif taxonkeys

# gbif_taxon_keys should be a long vector like this c(2977832,2977901,2977966,2977835,2977863)

# !!very important here to use pred_in!!
occ_download(
pred_in("taxonKey", gbif_taxon_keys),
format = "SIMPLE_CSV",
user=user,pwd=pwd,email=email
)

The results should now be on your downloads user page https://www.gbif.org/user/download.

More complex downloads are also possible with the new rgbif occ_download interface:

In the example below, I download all tree species which are

  • Matched tree species from above pred_in("taxonKey", gbif_taxon_keys)
  • With these basis of record pred_in("basisOfRecord", c('PRESERVED_SPECIMEN','HUMAN_OBSERVATION','OBSERVATION','MACHINE_OBSERVATION'))
  • Greater than 5000 meters elevation pred_gt("elevation", 5000)
  • Is found in the United States pred("country", "US")
  • Has coordinates pred("hasCoordinate", TRUE)
  • Has no gbif geospatial issues pred("hasGeospatialIssue", FALSE)
  • Is greater than or equal to the year 1990 pred_gte("year", 1990)
# use matched gbif_taxon_keys from above 
occ_download(
pred_in("taxonKey", gbif_taxon_keys),
pred_in("basisOfRecord", c('PRESERVED_SPECIMEN','HUMAN_OBSERVATION','OBSERVATION','MACHINE_OBSERVATION')),
pred_gt("elevation", 5000),
pred("country", "US"),
pred("hasCoordinate", TRUE),
pred("hasGeospatialIssue", FALSE),
pred_gte("year", 1990),
format = "SIMPLE_CSV",
user=user,pwd=pwd,email=email
)

Building a complex download requests without rgbif in R

You also make complex downloads without using rgbif. In this example I download all tree species that occur inside Chile(CL). Some other predicate examples can be found here https://www.gbif.org/developer/occurrence#download.

Fill in your gbif.org credentials

user <- "" # your gbif.org username 
pwd <- "" # your gbif.org password
email <- "" # your email 
library(taxize)
library(purrr)
library(tibble)
library(dplyr)
library(magrittr) # fot the %T>% pipe
library(roperators) # for %+% string operator

# pipeline for processing sci names -> downloads 

# The 60,000 tree names file I downloaded from BGCI
file_url <- "https://data-blog.gbif.org/post/2019-07-11-downloading-long-species-lists-on-gbif_files/global_tree_search_trees_1_3.csv"

# match the names 
readr::read_csv(file_url) %>% 
pull("Taxon name") %>% # use fewer names if you want to just test 
taxize::get_gbifid_(method="backbone") %>% # match names to the GBIF backbone to get taxonkeys
imap(~ .x %>% mutate(original_sciname = .y)) %>% # add original name back into data.frame
bind_rows() %T>% # combine all data.frames into one
readr::write_tsv(path = "all_matches.tsv") %>% # save as side effect for you to inspect if you want
filter(matchtype == "EXACT" & status == "ACCEPTED") %>% # get only accepted and matched names
filter(kingdom == "Plantae") %>% # remove anything that might have matched to a non-plant
pull(usagekey) %>% # get the gbif taxonkeys
paste(collapse=",") %>% 
paste('{
"creator": "' %+% user %+%'",
"notification_address": [
"' %+% email %+% '"
],
"sendNotification": true,
"format": "SIMPLE_CSV",
"predicate": {
"type": "and",
"predicates": [
{
"type": "in",
"key": "COUNTRY",
"values": ["CL"]
},
{
"type": "in",
"key": "TAXON_KEY",
"values": [',.,']
}
]}}',collapse="") %>% # create json sring 
writeLines(file("request.json")) # save the json request to use in httr::POST

request.json will look like this but with many more values for TAXON_KEY:


{
"creator": "jwaller",
"notification_address": [
"jwaller@gbif.org"
],
"sendNotification": true,
"format": "SIMPLE_CSV",
"predicate": {
"type": "and",
"predicates": [
{
"type": "in",
"key": "COUNTRY",
"values": ["CL"]
},
{
"type": "in",
"key": "TAXON_KEY",
"values": [ 2977832,2977901,2977966,2977835,2977863,2977814,8322626 ]
}
]}}


Now run the download job.

url = "http://api.gbif.org/v1/occurrence/download/request"

library(httr)

POST(url = url, 
config = authenticate(user, pwd), 
add_headers("Content-Type: application/json"),
body = upload_file("request.json"), # path to your local file
encode = 'json') %>% 
content(as = "text")

The results should now be on your downloads user page https://www.gbif.org/user/download.

Example using Python

The same example is also available in Python here. Note that this particular code doesn’t use the pygbif library but request and the GBIF API. It calls two functions available in the same folder, just download
this file before running the following code:

import pandas as pd
import requests
import json
import io
from functions_query_from_species_list import *

login = ""
password = ""
URL_species_file = "https://data-blog.gbif.org/post/2019-07-11-downloading-long-species-lists-on-gbif_files/global_tree_search_trees_1_3.csv"

# Get Taxon Keys
species_list = pd.read_csv(URL_species_file, encoding='latin-1')
taxon_keys = match_species(species_list, "Taxon name")

# filter keys however you see fit
key_list = taxon_keys.loc[(taxon_keys["matchType"]=="EXACT") & (taxon_keys["status"]=="ACCEPTED")].usageKey.tolist()

# Make download query
download_query = {}
download_query["creator"] = ""
download_query["notificationAddresses"] = [""]
download_query["sendNotification"] = False # if set to be True, don't forget to add a notificationAddresses above
download_query["format"] = "SIMPLE_CSV"
download_query["predicate"] = {
    "type": "in",
    "key": "TAXON_KEY",
    "values": key_list
}

# Generate download
create_download_given_query(login, password, download_query)

Citing your download

If you end up using your download in a research paper, you will want to cite the download’s DOI. Please see these citation guidelines for properly citing your download. Good citation practices ensure scientific transparency and reproducibility by guiding other researchers to the original sources of information.

Acknowledgements

GBIF thanks the project Tracking Invasive Alien Species (TrIAS) for funding the work required to provide these enhancements to all users of GBIF. TrIAS is a Belgian open science project that makes extensive use of GBIF services, see e.g. the methodology for their national checklist of alien species.