Using Apache Arrow and Parquet with GBIF-mediated occurrences
As written about in a previous blog post, GBIF now has database snapshots of occurrence records on AWS. This allows users to access large tables of GBIF-mediated occurrence records from Amazon s3 remote storage. This access is free of charge.
Parquet advantages
GBIF saves the snapshots it exports in a columnar data format known as Parquet. This format allows for certain types of queries to run very quickly.
With Parquet, the values in each column are physically stored in contiguous memory locations. Parquet contains row group level statistics that contain the minimum and maximum values for each column chunk. Queries that fetch specific column values need not read the entire row data thus improving performance. Additionally, file sizes for Parquet tables are typically smaller than the equivalent csv file.
Run a big query on your laptop with R
Interfaces to the arrow package are also available in other languages
The R package arrow allows large queries to run locally by only downloading the parts of the dataset necessary to perform the query.
This code will query the GBIF AWS snapshot in the gbif-open-data-eu-central-1
region from 2021-11-01
. Look here to find the latest snapshot.
# get occurrence counts from all species in Sweden since 1990
library(arrow)
library(dplyr)
gbif_snapshot <- "s3://gbif-open-data-eu-central-1/occurrence/2021-11-01/occurrence.parquet"
df <- open_dataset(gbif_snapshot)
df %>%
filter(
countrycode == "SE",
class == "Mammalia",
year > 1990
) %>%
group_by(species) %>%
count() %>%
collect()
Only certain dplyr verbs will work on arrow dataset objects.
Query performance
It is sometimes hard to predict what type of queries will run quickly. I have found that anything that does not aggregate to a count, will tend to run more slowly.
The query below takes longer to run. It returns around 23 records.
# runs relatively slowly
df %>%
filter(
countrycode == "BW",
kingdom == "Fungi"
) %>%
select(species) %>%
collect()
This aggregation query is much faster.
# runs faster
df %>%
filter(
countrycode == "BW",
kingdom == "Fungi"
) %>%
group_by(species) %>%
count() %>%
collect()
There are a few things that can be done to make arrow queries run faster:
- Have a fast internet connection (>=100 mb/s).
- Try removing array type columns first
select(-mediatype,-issue)
. - Pick an ASW region near you.
- Download a local copy.
It also possible to download a smaller local subset of data, which I discuss below. A local copy will always run faster than a copy on AWS.
Downloading a simple parquet from GBIF
Simple Parquet downloads are currently an undocumented feature. There is no promise that this feature will remain stable or function well.
Below you can make a simple Parquet download using rgbif. Set up your GBIF credentials first by following this short tutorial.
# install.packages("rgbif") # download latest version
library(rgbif)
# all Botswana occurrences
download_key <- occ_download(pred("country", "BW"),format = "SIMPLE_PARQUET")
occ_download_wait(download_key) # wait for download to finish
occ_download_get(download_key)
zip::unzip(paste0(download_key,'.zip')) # creates a folder "occurrence.parquet"
# rgbif::occ_download_import() # does not yet work for parquet downloads.
Wait a few minutes for the download to finish. Simple Parquet downloads tend to take up less disk space than the equivalent simple csv download. This Parquet download of Botswana is unzipped 67MB, while a simple-csv download of Botswana is unzipped 350MB.
Apache arrow parquet datasets also allow for lazy loading, so only the data after collect()
is loaded into your r-env memory.
# This 'slow' query will run very quickly locally
library(arrow)
library(dplyr)
local_df <- open_dataset("occurrence.parquet")
local_df %>%
select(-mediatype,-issue) %>% # for query speed
filter(
countrycode == "BW",
kingdom == "Fungi"
) %>%
collect()
gbifdb package
You can also use the new R package gbifdb. The goal of gbifdb is to provide a relational database interface to GBIF-mediated data. The project is under active development.
# duckdb installation
install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL)
devtools::install_github("cboettig/gbifdb")
library(gbifdb)
library(dplyr)
gbif <- gbif_remote()
gbif %>%
filter(phylum == "Chordata", year > 1990) %>%
count(class, year)
Citation
If you end up using your query from AWS in a research paper, you will want a DOI.
If you made a simple parquet download from GBIF, then you can just use the DOI associated with that download.
GBIF now has a derived dataset service for generating a citable doi from a list of involved datasetkeys with occurrences counts. See the GBIF citation guidelines and previous blog post.
You can generate a citation file from the query above using the following code.
# generate a citation file
citation <- df %>%
filter(
countrycode == "BW",
kingdom == "Fungi"
) %>%
group_by(datasetkey) %>%
count() %>%
collect()
readr::write_tsv("citation.tsv")
You can also now use your citation file with the development version of rgbif to create a derived dataset and a citable DOI, although you would need to upload your exported dataset to Zenodo (or something similar) first. Set up your GBIF credentials first by following this short tutorial.
# install.packages("rgbif") # requires latest version of rgbif
library(rgbif)
citation_data <- readr::read_tsv("citation.tsv")
# use derived_dataset_prep() if you just want to test
derived_dataset(
citation_data = citation_data,
title = "Research dataset derived from GBIF snapshot on AWS",
description = "I used AWS and arrow to filter GBIF snapshot 2021-11-01.",
source_url = "https://zenodo.org/fake_upload"
)
Registering a derived dataset can also be done using the web interface.