As written about in a previous blog post, GBIF now has database snapshots of occurrence records on AWS. This allows users to access large tables of GBIF-mediated occurrence records from Amazon s3 remote storage. This access is free of charge.
With Parquet, the values in each column are physically stored in contiguous memory locations. Parquet contains row group level statistics that contain the minimum and maximum values for each column chunk. Queries that fetch specific column values need not read the entire row data thus improving performance. Additionally, file sizes for Parquet tables are typically smaller than the equivalent csv file.
Run a big query on your laptop with R
Interfaces to the arrow package are also available in other languages
The R package arrow allows large queries to run locally by only downloading the parts of the dataset necessary to perform the query.
# get occurrence counts from all species in Sweden since 1990 library(arrow) library(dplyr) gbif_snapshot <- "s3://gbif-open-data-eu-central-1/occurrence/2021-11-01/occurrence.parquet" df <- open_dataset(gbif_snapshot) df %>% filter( countrycode == "SE", class == "Mammalia", year > 1990 ) %>% group_by(species) %>% count() %>% collect()
Only certain dplyr verbs will work on arrow dataset objects.
It is sometimes hard to predict what type of queries will run quickly. I have found that anything that does not aggregate to a count, will tend to run more slowly.
The query below takes longer to run. It returns around 23 records.
# runs relatively slowly df %>% filter( countrycode == "BW", kingdom == "Fungi" ) %>% select(species) %>% collect()
This aggregation query is much faster.
# runs faster df %>% filter( countrycode == "BW", kingdom == "Fungi" ) %>% group_by(species) %>% count() %>% collect()
There are a few things that can be done to make arrow queries run faster:
- Have a fast internet connection (>=100 mb/s).
- Try removing array type columns first
- Pick an ASW region near you.
- Download a local copy.
It also possible to download a smaller local subset of data, which I discuss below. A local copy will always run faster than a copy on AWS.
Downloading a simple parquet from GBIF
Simple Parquet downloads are currently an undocumented feature. There is no promise that this feature will remain stable or function well.
Below you can make a simple Parquet download using rgbif. Set up your GBIF credentials first by following this short tutorial.
# install.packages("rgbif") # download latest version library(rgbif) # all Botswana occurrences download_key <- occ_download(pred("country", "BW"),format = "SIMPLE_PARQUET") occ_download_wait(download_key) # wait for download to finish occ_download_get(download_key) zip::unzip(paste0(download_key,'.zip')) # creates a folder "occurrence.parquet" # rgbif::occ_download_import() # does not yet work for parquet downloads.
Wait a few minutes for the download to finish. Simple Parquet downloads tend to take up less disk space than the equivalent simple csv download. This Parquet download of Botswana is unzipped 67MB, while a simple-csv download of Botswana is unzipped 350MB.
Apache arrow parquet datasets also allow for lazy loading, so only the data after
collect() is loaded into your r-env memory.
# This 'slow' query will run very quickly locally library(arrow) library(dplyr) local_df <- open_dataset("occurrence.parquet") local_df %>% select(-mediatype,-issue) %>% # for query speed filter( countrycode == "BW", kingdom == "Fungi" ) %>% collect()
You can also use the new R package gbifdb. The goal of gbifdb is to provide a relational database interface to GBIF-mediated data. The project is under active development.
# duckdb installation install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL) devtools::install_github("cboettig/gbifdb")
library(gbifdb) library(dplyr) gbif <- gbif_remote() gbif %>% filter(phylum == "Chordata", year > 1990) %>% count(class, year)
If you end up using your query from AWS in a research paper, you will want a DOI.
If you made a simple parquet download from GBIF, then you can just use the DOI associated with that download.
You can generate a citation file from the query above using the following code.
# generate a citation file citation <- df %>% filter( countrycode == "BW", kingdom == "Fungi" ) %>% group_by(datasetkey) %>% count() %>% collect() readr::write_tsv("citation.tsv")
You can also now use your citation file with the development version of rgbif to create a derived dataset and a citable DOI, although you would need to upload your exported dataset to Zenodo (or something similar) first. Set up your GBIF credentials first by following this short tutorial.
# install.packages("rgbif") # requires latest version of rgbif library(rgbif) citation_data <- readr::read_tsv("citation.tsv") # use derived_dataset_prep() if you just want to test derived_dataset( citation_data = citation_data, title = "Research dataset derived from GBIF snapshot on AWS", description = "I used AWS and arrow to filter GBIF snapshot 2021-11-01.", source_url = "https://zenodo.org/fake_upload" )
Registering a derived dataset can also be done using the web interface.