It is now possible to download up to 100,000 names on GBIF!

Until recently it was not possible to download occurrences for more than a few hundred species at the same time, but it is now possible to request more species names (up to 100,000 taxonkeys).

NOTE: If your request can easily be summarized into higher taxon groups, it still makes more sense to download just that taxon group. For example, if you just want to download all dragonflies, all
mammals, or all vascular plants. These requests don’t require anything special.

Downloads through the web interface are still limited to around 200 names (taxonkeys) (species, genus, family, kingdom). This is due to limitations in browser. A search for 300 bird species fails.

Downloading occurrences for 60,000 tree species using rgbif and taxize

One good reason to download data using a long list of names, would be if your group of interest is non-monophyletic. Trees are a non-monophyletic group that include a variety of plant species that have independently evolved a trunk and branches. The Botanic Gardens Conservation
International
maintains a long list of >60,000 tree species. You can download a csv here or the one I used for the example below here.

Matching and downloading takes around 30 minutes, so run with fewer names if you don’t want to wait that long. This requires the latest development version of rgbif.

install.packages("devtools")
devtools::install_github("ropensci/rgbif")

The important part here is to use rgbif::occ_download with type="in" and to fill in your gbif credentials.

# fill in your gbif.org credentials 
user <- "" # your gbif.org username 
pwd <- "" # your gbif.org password
email <- "" # your email 
library(dplyr)
library(purrr)
library(readr)  
library(magrittr) # for %T>% pipe
library(rgbif) # for occ_download
library(taxize) # for get_gbifid_

# The 60,000 tree names file I downloaded from BGCI
file_url <- "https://data-blog.gbif.org/post/2019-07-11-downloading-long-species-lists-on-gbif_files/global_tree_search_trees_1_3.csv"

# match the names 
gbif_taxon_keys <- 
readr::read_csv(file_url) %>% 
pull("Taxon name") %>% # use fewer names if you want to just test 
taxize::get_gbifid_(method="backbone") %>% # match names to the GBIF backbone to get taxonkeys
imap(~ .x %>% mutate(original_sciname = .y)) %>% # add original name back into data.frame
bind_rows() %T>% # combine all data.frames into one
readr::write_tsv(path = "all_matches.tsv") %>% # save as side effect for you to inspect if you want
filter(matchtype == "EXACT" & status == "ACCEPTED") %>% # get only accepted and matched names
filter(kingdom == "Plantae") %>% # remove anything that might have matched to a non-plant
pull(usagekey) # get the gbif taxonkeys

# !!very important here to use "type=in"!!
# make the download request to GBIF 
occ_download(
sprintf("taxonKey = %s", paste0(gbif_taxon_keys, collapse = ",")),
"hasGeospatialIssue = FALSE",
type="in", 
format = "SIMPLE_CSV",
user=user,pwd=pwd,email=email)

The results should now be on your downloads user page https://www.gbif.org/user/download.

NOTE: As of the writing of this post, complex downloads with extra options like "hasGeospatialIssue = FALSE" or "basisOfRecord = OBSERVATION" will not work with rgbif::occ_download. But this is being worked on. See dicussions here.

Buidling a complex request yourself in R

While rgbif is being improved, you can still make complex downloads (with more than one predicate) with very large taxa lists, if you build the json request yourself. In this example I download all tree species that occur inside Chile(CL). Some other predicate examples can be found here https://www.gbif.org/developer/occurrence#download.

Fill in your gbif.org credentials

user <- "" # your gbif.org username 
pwd <- "" # your gbif.org password
email <- "" # your email 
library(taxize)
library(purrr)
library(tibble)
library(dplyr)
library(magrittr) # fot the %T>% pipe
library(roperators) # for %+% string operator

# pipeline for processing sci names -> downloads 

# The 60,000 tree names file I downloaded from BGCI
file_url <- "https://data-blog.gbif.org/post/2019-07-11-downloading-long-species-lists-on-gbif_files/global_tree_search_trees_1_3.csv"

# match the names 
readr::read_csv(file_url) %>% 
pull("Taxon name") %>% # use fewer names if you want to just test 
taxize::get_gbifid_(method="backbone") %>% # match names to the GBIF backbone to get taxonkeys
imap(~ .x %>% mutate(original_sciname = .y)) %>% # add original name back into data.frame
bind_rows() %T>% # combine all data.frames into one
readr::write_tsv(path = "all_matches.tsv") %>% # save as side effect for you to inspect if you want
filter(matchtype == "EXACT" & status == "ACCEPTED") %>% # get only accepted and matched names
filter(kingdom == "Plantae") %>% # remove anything that might have matched to a non-plant
pull(usagekey) %>% # get the gbif taxonkeys
paste(collapse=",") %>% 
paste('{
"creator": "' %+% user %+%'",
"notification_address": [
"' %+% email %+% '"
],
"sendNotification": true,
"format": "SIMPLE_CSV",
"predicate": {
"type": "and",
"predicates": [
{
"type": "in",
"key": "COUNTRY",
"values": ["CL"]
},
{
"type": "in",
"key": "TAXON_KEY",
"values": [',.,']
}
]}}',collapse="") %>% # create json sring 
writeLines(file("request.json")) # save the json request to use in httr::POST

request.json will look like this but with many more values for TAXON_KEY:


{
"creator": "jwaller",
"notification_address": [
"jwaller@gbif.org"
],
"sendNotification": true,
"format": "SIMPLE_CSV",
"predicate": {
"type": "and",
"predicates": [
{
"type": "in",
"key": "COUNTRY",
"values": ["CL"]
},
{
"type": "in",
"key": "TAXON_KEY",
"values": [ 2977832,2977901,2977966,2977835,2977863,2977814,8322626 ]
}
]}}

Now run the download job.

url = "http://api.gbif.org/v1/occurrence/download/request"

library(httr)

POST(url = url, 
config = authenticate(user, pwd), 
add_headers("Content-Type: application/json"),
body = upload_file("request.json"), # path to your local file
encode = 'json') %>% 
content(as = "text")

The results should now be on your downloads user page https://www.gbif.org/user/download.

Example using Python

The same example is also available in Python here. Note that this particular code doesn’t use the pygbif library but request and the GBIF API. It calls two functions available in the same folder, just download
this file before running the following code:

import pandas as pd
import requests
import json
import io
from functions_query_from_species_list import *

login = ""
password = ""
URL_species_file = "https://data-blog.gbif.org/post/2019-07-11-downloading-long-species-lists-on-gbif_files/global_tree_search_trees_1_3.csv"

# Get Taxon Keys
species_list = pd.read_csv(URL_species_file, encoding='latin-1')
taxon_keys = match_species(species_list, "Taxon name")

# filter keys however you see fit
key_list = taxon_keys.loc[(taxon_keys["matchType"]=="EXACT") & (taxon_keys["status"]=="ACCEPTED")].usageKey.tolist()

# Make download query
download_query = {}
download_query["creator"] = ""
download_query["notificationAddresses"] = [""]
download_query["sendNotification"] = True
download_query["format"] = "SIMPLE_CSV"
download_query["predicate"] = {
    "type": "in",
    "key": "TAXON_KEY",
    "values": key_list
}

# Generate download
create_download_given_query(login, password, download_query)

Acknowledgements

GBIF thanks the project Tracking Invasive Alien Species (TrIAS) for funding the work required to provide these enhancements to all users of GBIF. TrIAS is a Belgian open science project that makes extensive use of GBIF services, see e.g. the methodology for their national checklist of alien species.