GBIF has an experimental feature that allows users to download data from the GBIF database in SQL format. Contact helpdesk@gbif.org to request access. https://techdocs.gbif.org/en/data-use/api-sql-downloads

If your download can be formulated using the traditional predicate downloads, it is usually going to be faster to use regular downloads.

The experimental Occurrence SQL Download API allows users to query GBIF occurrences using SQL. In contrast to the Predicate Download API, the SQL API allows selection of the columns of interest and generation of summary views of GBIF data.

SQL downloads, like predicate downloads, require you to have a GBIF user account.

SQL Download Workflow

The first step is to prepare your query. There is only one table available for querying, the occurrence table. You can check if the query is ok using query validation. You can also check what fields are available in the occurrence table using this endpoint https://api.gbif.org/v1/occurrence/download/describe/sql. There are +400 columns available.

SELECT datasetKey, countryCode, COUNT(*) FROM occurrence WHERE continent = 'EUROPE' GROUP BY datasetKey, countryCode

This query should be included in a json POST request, and saved to a file named, for example, query.json.

{
  "sendNotification": true,
  "notificationAddresses": [
    "userEmail@example.org" 
  ],
  "format": "SQL_TSV_ZIP", 
  "sql": "SELECT datasetKey, countryCode, COUNT(*) FROM occurrence WHERE continent = 'EUROPE' GROUP BY datasetKey, countryCode" 
}

The request should then be sent as a POST request to the endpoint https://api.gbif.org/v2/occurrence/download/request

curl --include --user YOUR_GBIF_USERNAME:YOUR_PASSWORD --header "Content-Type: application/json" --data @query.json https://api.gbif.org/v1/occurrence/download/request

The download will appear in your GBIF user account.

SQL examples - Multi-dimension Counts

One common query that is difficult to do with the traditional downloads interface is to get a occurrence counts of species by multiple dimensions. This query gets a table with countries and species with the most occurrences published to GBIF without having to download a large table and do the aggregation locally.

SELECT publishingcountry, specieskey, COUNT(*) as occurrence_count
FROM occurrence
WHERE publishingcountry IS NOT NULL AND specieskey IS NOT NULL
GROUP BY publishingcountry, specieskey
ORDER BY occurrence_count DESC;
publishingcountry specieskey occurrence_count
US 2495347 23467566
US 2490384 21442333
US 9510564 20975419
US 2482507 20396610
US 2482593 17855746
US 9761484 17814101

SQL examples - Time Series

Another interesting query would be to get a times series of number of species “collected/observed” over years, grouped by basis of record.

SELECT
    basisOfRecord,
    "year",
    COUNT(DISTINCT speciesKey) AS unique_species_count
FROM
    occurrence
GROUP BY
    basisOfRecord,
    "year";

Note that "year" needs to be double quoted as it is a reserve word. This is true also for other reserve words like "month", "day" etc. This graphic shows the rising influence of Human Observations in GBIF mediated occurrences data.

SQL examples - Grid Functions

Making a global map of unique species counts per grid cell is a common task, but because it requires a spatial join with the chosen spatial grid, it can be difficult to do without working with a large amount occurrence records.

For this reason GBIF’s SQL downloads provide support for a few pre-defined grid functions. These functions will return a grid cell code for each occurrence, which can then be used to aggregate or plot the data.

  • EEA Reference Grid, GBIF_EEARGCode
  • Military Grid Reference System, GBIF_MGRSCode
  • Quarter degree cell code, GBIF_QDGCCode
  • ISEA3H Grid cell code, GBIF_ISEA3HCode

Below is example of working with the Military Grid Reference System (MGRS) grid. This example will return a table with unique species counts per grid cell and an attached grid code, which in this case is mgrs.

SELECT 
  GBIF_MGRSCode(
    100000, 
    decimalLatitude,
    decimalLongitude,
    0 
  ) AS mgrs,
  COUNT(DISTINCT speciesKey) AS unique_species_count
FROM
  occurrence
GROUP BY
  mgrs

The grid code can then be used to join with a shapefile or geojson file that contains the grid cells.

This image uses shapefiles from this repository.

The sql grid functions were originally designed to be used with creating species occurrence cubes. Therefore a randomization parameter was supported COALESCE(coordinateUncertaintyInMeters, 0). This should be set to 0 if you want to use the grid functions with no randomization.

SELECT 
  GBIF_EEARGCode(
    10000, 
    decimalLatitude,
    decimalLongitude,
    0
  ) AS cellcode,
  COUNT(DISTINCT speciesKey) AS unique_species_count
FROM
  occurrence
GROUP BY
  cellcode

The EEA reference grid example can be found here.

https://sdi.eea.europa.eu/data/93315b78-089d-43a5-ac76-b3df627b2e4cf

Supported SQL

Only SELECT queries are supported, and only queries against a single table named occurrence. JOIN queries and sub-queries are not allowed. Selecting * is also not allowed. One must explicitly select the columns needed.

GROUP BY queries are supported, as are basic SQL window functions (OVER and PARTITION BY). ORDER BY is supported.

Most common SQL operators and functions are supported, such as AND, OR, NOT, IS NULL, RAND(), ROUND(…), LOWER(…), etc. Case is ignored by the GBIF SQL parser, and all column names are returned as lowercase.

When not to use

If you need only commonly used occurrence columns and simple filters, most of the time you can use the regular download interface instead of the SQL interface, and it will be faster.

Keep in mind that if you only need species counts for one dimension, then facet queries are usually going to be a much faster option (although you won’t receive a DOI). Some examples below:

http://api.gbif.org/v1/occurrence/search?facet=speciesKey&country=US
http://api.gbif.org/v1/occurrence/search?facet=speciesKey&country=US&year=1800,1900
http://api.gbif.org/v1/occurrence/search?facet=speciesKey&country=US&year=1800,1900&basisOfRecord=HUMAN_OBSERVATION
http://api.gbif.org/v1/occurrence/search?facet=country&facetLimit=200

rgbif example

SQL downloads is also supported by rgbif by using occ_download_sql(). Please see this article for the easiest way get started with SQL downloads. Remember to set up your GBIF credentials.

library(rgbif)

occ_download_sql("SELECT datasetKey, countryCode, COUNT(*) FROM occurrence WHERE continent = 'EUROPE' GROUP BY datasetKey, countryCode")

Further Reading

https://techdocs.gbif.org/en/data-use/api-sql-downloads