Did you know that...? - some of the lesser known functionalities around GBIF.org

During the first-ever virtual GBIF 2021 Global Nodes Meeting, GBIFS hosted a “game show”: a one-hour “battle of Nodes vs. helpdesk”. The not-so-hidden goal was to demonstrate some of the lesser known functionalities of GBIF.org through a fun, interactive session.

The following is a summary of the questions and answers from this session, plus some extras that did not manage to make it into the time frame of the event. The summary is following the layout and sequence of the interactive hour:

Did you know…: a main question or task to be solved by a sub-group of participants, retreating to a breakout room for a few minutes before presenting the solution to the plenary - or, by presenting an incorrect or no answer, giving Marie as “GBIF helpdesk” the chance to win the point

and

Poll: a loosely related multiple-choice poll for the remaining participants in the main session room.

In this summary, the poll questions are marked with (++)
for the most typical solution, (+)
for possible alternatives, (~) for technically correct, but non-recommended options, and (-)
for incorrect responses.

Question 1: Warm-up

Session Video

Did you know there is a summary of the latest GBIF.org software and infrastructure changes? — Where is it?

Suggested answer:
https://www.gbif.org/release-notes

Poll: And where would you start looking for this summary?

a) The free-text search, where else? (+)
b) The link is in the green footer section of all pages in GBIF.org (-)
c) I’ll check the main menu of course! (++)

Notes:

You find the release notes in the main menu of GBIF.org, under About » Release notes. The free text search that you find in the header of all GBIF.org pages is another option if you know what you are looking for.

Question 2: Custom metrics

Session Video

Did you know that you can generate some custom data metrics yourself? — Get an overview of the occurrence “issues & flags” for publishers in your country or area, per publisher

Suggested answer:

Step-by-step: from the occurrence search page,

filter for the “publishing country or area” in the “advanced” filter set

in the table of results, switch to the “Metrics” tab

select the “Custom” setting, and

set the two relevant dimensions: “Publisher”, and “Issues and Flags”.

Poll: What is the best starting point here?

a) the occurrence search page (++)
b) “my” country page (-)
c) the GBIF API (+)

Notes:

In the session recording (video), Marie demonstrates the solution using custom metrics generation from around [18:40].
Alternative possibility suggested during the session [17:25]: Prepare the filter as for a download, and check the download feedback on issue labels.

Similar to using the API to aggregate the same kind of information, the proposed alternative option is a bit more labor intensive for getting an overview. More to the point here though, it is specific to Issues and Flags, rather than other custom dimensions — but useful to be aware of for this particular case.

Question 3: Spatial filtering

Session Video

Did you know that you can download a species list for an area? — How would you generate a species list for the “Azores” island group?

Suggested answer:

Start from an occurrence search

In the “advanced” filter options, use the “Administrative areas (gadm.org)” filter, and select the entry for “Azores”
In this particular example, the gadm.org administrative areas filter is possibly the best filter option. But depending on the search interest and geographic area, other possibilities like polygon, bounding box, or drag-dropping a geojson shape file onto the occurrence map may be equally well or better suited.

Once the filter is applied, go to the download tab, log in, and select “species list” as the download format

If you take “species” literally, you will still need to filter the downloaded result for taxonRank=SPECIES, as it includes identifications at all taxonomic ranks (taxon list)

Poll: How would you go about filtering data on the “Azores”?

a) I draw a bounding box or polygon on a map (++)
b) I use the administrative areas (gadm.org) filter (++)
c) I use the country or area filter (-)
d) none of these! I have a better way! (great! we’d love to hear! does it involve geojson files?) (++)

Notes:

A search by bounding box or polygon, depending on location and definition, is likely to include a wider area, in this case: the marine area between and around islands. Using the gadm.org filter will limit a search to the land area here.
If you want to use a geojson shape file as your filter, open the “map” tab of the occurrence page, and drag-and-drop your shape file into the map.
Outside of the concrete example (Azores), all options in the poll - including the “country or area” filter - are equally possible and “correct”, but fit some queries better than others.

Question 4: User feedback

Session Video

Did you know that you can find GBIF user feedback that relates to a country or organization? — Find all feedback that is reported against a country of your choice!

Suggested answer:

The GitHub repository that stores user feedback is called /gbif/portal-feedback. This is also where you can find the entry point to the “issues”. The filter is pre-set to only show open issues.

Countries are defined as labels of the ISO country codes – to search for all issues relating to Belgium, for example, add an additional filter as “label:BE”.

Poll: Is all user feedback stored in GitHub?

a) Yes, all user feedback is stored in GitHub (-)
b) No. Sometimes it is sent to a dataset contact via email (++)
c) No. Sometimes it is forwarded to the feedback system of another organization (++)
d) No. Some users prefer to write to helpdesk@gbif.org (++)

Notes:

For a brief demo of the GitHub GBIF feedback repository, also see the recording of “Live demos: Group 2 - Tools to support nodes with data mobilization strategies” (from about [1:06:00]).
The GitHub repository does not document everything that gets reported: some people prefer the non-public helpdesk email address (option d) for communication over the publicly indexed GitHub repository; some issues are automatically redirected to other platforms' feedback systems (option c), provided a technical connection has been established; and if the feedback is on an individual record, the commenter can choose whether to log the issue with GBIF, or rather send an email directly to the dataset contacts (option b).
Bonus tidbit: For issues that relate to occurrence records, the dataset and publisher UUIDs are stored within the issue as well. To find all issues for a given publisher or dataset, add the publisher or dataset UUID as a free text search value, like this (omit the quotation marks): “is:issue is:open 50c9509d-22c7-4a22-a47d-8c48425ef4a7”. This will find all open issues reported that relate to this publisher or dataset.

Question 5: Scientific name validation

Session Video

Did you know that you can check a list of scientific names against the GBIF Taxonomic Backbone? — Demonstrate how to do this!

Suggested answer:

Use the “Species Matching” tool. You find this from the main menu of GBIF.org, in the “Tools” menu section, under “GBIF Labs”.

If you want to try the examples that are available on the tool’s web page, be aware that dragging the example directly into the drop-circle of the user interface may not work – rather, store the file locally first.

Poll: What else can you do through GBIF.org?

a) search for species records by a gene sequence (++)
b) find groups of occurrence records, across datasets, that seem to be somehow related (++)
c) download a detailed report for a dataset check/validation (DwC-A) (-)
d) check whether there are any known issues with the systems behind GBIF.org, right now, that could explain e.g. slow data updates (++)

Notes:

To use the the Species Matching tool for checking a list of taxon names, create a file following the guidelines at the top of the tool page; consult or adapt the example files if you are not sure what the guidelines mean.
For a sequence blast (poll option a): try the Sequence ID tool, or this example. This feature is only available for selected taxon groups.
To find clusters of occurrence records (poll option b), use the advanced occurrence search facet “Is in cluster”. This is an early feature at this point, still under ongoing development - you can read more here. A cluster groups occurrence records that appear to be somehow related. Examples are: siblings of herbarium specimens deposited in other collections; derived units like eDNA data sampled from a specimen; related multimedia collections; duplicates of records published by different institutions; etc. For occurrence records that were tagged as belonging to a cluster, the record page has a second tab labeled “Cluster”, like in this example.
The GBIF.org “system health” indicator page (poll option d) is accessible from the main menu bar of all pages via the little “pulse” symbol. You find this next to the language setting, free text search and feedback symbols. If the site is not working as expected, or your newly published dataset does not become available as fast as you would expect it to, it is worth checking here whether there are any known issues.
Sorry - a detailed, downloadable report with data validation results (poll option c) is not available - yet. You can validate your data using the data validator tool, but the report is so far only available online. That said, though: the data validator tool is currently under revision, and downloadable reports are one of the features being considered for a future version.

Question 6: The API

Session Video

Did you know that GBIF.org builds on an API, and that you can use that API as well? — With your web browser as a client, use the GBIF registry API to find all technical installations of type BIOCASE_INSTALLATION. How many are there?

Suggested answer:

https://api.gbif.org/v1/installation/?type=BIOCASE_INSTALLATION. At the time of the event, the “count” value near the top of the response reported that there were 81 data publishing software installation endpoints of this type registered to GBIF

A proposed alternative solution, https://registry.gbif.org/installation/search?q=BioCASE, is the answer to a different question: it returns free text search results, i.e. entries that contain the text somewhere within the record like here, not installations of that type.

Poll: What does “API” stand for?

a) Advanced Programming Interface (~)
b) Academic Performance Index (~)
c) Adobe Acrobat Plug-In (~)
d) Application Programming Interface (++)

Notes:

The documentation of the GBIF API is here: https://www.gbif.org/developer/summary. You can find the link to this entry page under “API” in the green footer bar on all pages within GBIF.org.
For more documentation relevant to the answer for this specific quiz question, check the “Registry” section of the documentation, sub-section “Installations”.
All functionality of the user interface is based on the API - but: the API also allows for some options that the user interface cannot support. You can, for example, compose search filters that exclude certain values by using the “not” predicate for an occurrence download. If you want to learn more about options around occurrence data, maybe start by browsing the Occurrence section of the API documentation for inspiration.
And the acronym from the poll? In this context: “Application Programming Interface” would be correct – sometimes also referred to as “Advanced” Programming Interface, though this seems to be a misnomer. Both “Adobe Acrobat Plug-In” (option c) and “Academic Performance Index” (option b) do turn up in abbreviation resolvers under “API” as well, but are not meant here.

Question 7: Ingestion history

Session Video

Did you know that you can check up on the indexing process of a dataset yourself? — Since deploying our current generation of data “ingestion” infrastructure, iNaturalist has been crawled over 250 times. How do you find the logs that the infrastructure produced for crawl number 265, if you were asked to diagnose issues in that?

Suggested answer:

Step-by-step:

find the dataset in the dataset search of the GBIF registry

go to the dataset’s registry page, and

check the “Ingestion history” from the sub-menu

find “Attempt” number 265, and click on “Log”
This will bring you to the Elastic Search ingestion logs for this particular ingestion process run on that dataset
If this particular example does no longer show the Elastic Search log entries, try a more recent one to get an idea of the options provided by the logs in general

Alternative option: a registered dataset contact for the dataset in question, logged in to their GBIF.org user account using the same email address, has access to the logs from the dataset page in a segment “Because you are a trusted contact”. The logs accessed from here will be pre-filtered to the most recent ingestion run. Alternatively, the “history” option will enter the same registry page as through the pathway sketched out in the suggested answer

Poll: Which of these components can be helpful for diagnosing issues with a specific dataset?

a) a dataset download (++)
b) the IPT (+)
c) the Data Quality page at gbif.org (-)
d) the GBIF registry (++)
e) the Dataset metrics tab at gbif.org (+)

Notes:

Working with ingestion logs is an option that publishing software administrators and Nodes technical staff may be interested in exploring: it can give you more autonomy in diagnosing ingestion issues, for example, if local data update does not appear to be picked up by GBIF. If you are rather a data content curator or a data user, this option is probably not relevant for you.
The Data Quality page (option c) is static, authored content and knows nothing about specific datasets, so this will not help to diagnose issues as such (though the page does contain some pointers to other components).
Dataset downloads (option a): yes - a download summary page also gives access to the “issues & flags” filter and the yellow “pill” labels that report content remarks and issues. This option is mostly intended for mixed content from many datasets (user downloads), rather than for downloading an individual dataset. You will also find the issues reported with each record within the downloaded data. Given the width of the download file, this is easy to overlook, but well worth being on the lookout for, especially if you still want to apply some additional post-download filtering of data to exclude records with particular issues.
The IPT (option b) does provide some feedback on formal structure (table relationships) and data content – a good first checkpoint for a new or substantially reworked dataset configuration! Keep in mind though: the IPT is a stand-alone tool, independent from the GBIF ingestion process, and would not “know” why a specific ingestion run failed in a given situation.
The registry (option d), as demonstrated in the response to the main question, gives access to the timeline and the logs of all past crawls and ingestions. A good place to be aware of if you want to know what is going on with a given dataset!
Dataset metrics tab (option e): every dataset page, like the one in our iNaturalist example, has a dataset metrics tab. Near the bottom of that page, the issues and flags that were diagnosed in the ingestion process are listed, together with corresponding occurrence record counts. If you would like to learn more about the meaning of issues and flags, we recommend the GBIF Issues & Flags post in the GBIF data blog.
Keep in mind, though, that this is only possible for records that could be ingested at all: records that did not meet certain minimum standards e.g. on mandatory fields, or that got lost to technical configuration challenges, cannot be reported here
Now, I am curious - did you miss the GBIF data validator in this list? Yes! That is certainly another option worth keeping in mind if you want to check a Darwin Core Archive (DwC-A)! We already mentioned this in Question 5, though, so we did not want to bore you…

The following questions and polls did not make it into the live session and recording - but we don’t want to keep them from you, so here they are:

Question 8: GRSciColl

Did you know that the “GBIF Registry of Scientific Collections” (GRSciColl) is connected to occurrence records in GBIF.org? — Find the GRSciColl entry for the Royal Belgian Institute of Natural Sciences, and check which collection code has most records in their data

Suggested answer:

“arachnofauna” with 291,402 records (status: end of June, 2021)

How to get there, step-by-step:

from the main menu at GBIF.org, go to Tools -> Scientific Collections. You now entered the Registry of Scientific Collections, GRSciColl

go to the tab “Institutions”

search for the institution name, and

open that page by clicking on the name in the results list: Royal Belgian Institute of Natural Sciences

open the metrics tab, and

scroll down to “occurrences per collection code” to pick the entry with the most occurrence records
bonus: clicking on the name or count here will lead you to the occurrence records filtered for this entry

Poll: Imagine you are checking this institution entry in GRSciColl, and you find that you know of more collections that are not listed here. What can you do?

a) I can register “metadata only” datasets for this institution (-)
b) I can ask GBIFS to give me editing rights, so I can add the collections to GRSciColl myself (+)
c) I can suggest additions through the user interface (++)
d) I can log a GitHub issue with suggestions (~)

Notes:

“Metadata-only datasets” (poll option a) describe datasets (as opposed to physical collections) in GBIF.org, typically without adding any occurrence records - hence the name. Reasons vary why a dataset may be registered with just a description, at least initially. Regardless, as datasets and physical collections do not necessarily match each other in scope, such dataset registrations will never be automatically translated into collection entries in GRSciColl.
In addition, independent from the above: unless you already have been authorized by the institution in question, you would not be able to register a dataset to GBIF.org under their name.

The other three poll options all apply, though to varying degree:
Editing rights (option b): we do not grant these to everybody, but if you are an institutional contact, or for example a node manager, that is indeed an option. Requesting editing rights makes best sense if you have many and frequent changes to make, or if you would like to take general curational responsibility within a certain domain of GRSciColl (e.g. for an institution or country). Before you can get going, we will need to give you an introduction. Check with us via scientific-collections@gbif.org if this has your interest.
user interface (option c): yes! That is an option for any user now, and we would recommend this as the preferred way to log change requests or additions for more casual users. Your suggestion will not be immediately visible, as it will go through a moderation process to ensure proper procedures and discourage spam - but we will keep you updated. Check out the green “Suggest a Change” button on Institution, Collection and Contact pages within GRSciColl! To suggest a new collection under an institution, start from any existing collection, “Suggest a Change”, and then use the “More” button on the registry “Collection details” page to “Create new collection”.
GitHub issue (option d): well, yes…. technically, you could. The same applies to sending an email. We would rather you didn’t though. There are easier and more integrated ways to send your feedback through the user interface option, above.
And if you are considering to work offline with a larger section of the catalogue, please start by contacting us at scientific-collections@gbif.org

Question 9: Citations

Did you know that GBIF assigns unique DOIs to downloads of occurrence data, making citing the data easy, and enabling reproducibility and credit towards data publishers? — But - what is the best way to cite GBIF occurrence data obtained by other means, where no DOI has been assigned?

Suggested answer:

Generate a “derived dataset” DOI

requirements:

a user account at GBIF.org

a list (csv file) of the source datasets (GBIF dataset keys) and respective record counts from your otherwise-produced data download

public deposition of the dataset you that wish to cite (e.g. in Zenodo)

a GBIF registration (DOI) for your derived dataset - you can generate this
here

To learn more, check the “Derived datasets” post in the GBIF data blog

Poll: Which types of GBIF data downloads do get a DOI generated automatically?

a) a species list download for an occurrence search (++)
b) occurrence records retrieved through the occurrence search API (-)
c) an occurence download through AWS, Azure or other cloud services (-)
d) an occurrence download using the rgbif library “occ_download” function (+)

Notes:

The “derived dataset DOI” pathway was designed to help exactly in cases like poll options b) and c) where the the integrated generation of download DOIs cannot “catch”, unlike with downloads through the user interface at GBIF.org (option a).
GBIF search API (option b): be aware that this is a bit of a trick question - download DOIs are not available for records retrieved through the occurrence search API. For the occurrence download API on the other hand, they are. Check the blog post titled “Search, download, analyze and cite (repeat if necessary)" for more on data citations.
If you use the occurrence_download() function of the rgbif library, the underlying GBIF download API will trigger the generation of a download DOI. The rgbif library also contains a function, gbif_citation(), that will help you properly cite the data downloaded from GBIF through r (option d; also see here and here for instructions on use of the library and citation function). Just as for the GBIF API itself, be aware that this will only work properly for occurrence downloads, not for an aggregate of occurrence search results.

Question 10: Rollup metrics

Did you know that GBIF keeps track of growth of occurrence data over time? — Which time periods (years of occurrence) have the the highest numbers of occurrence records in the African region?

Suggested answer:

Check the “Global data trends” page, and filter for “Africa”. The panel “Records by year of occurrence” suggests that, for the African region, most records available through GBIF.org so far are on occurrences collected or observed in more recent years (from around 2007 onwards), but there is also stronger evidence from about 1986-92.

How to get there, step-by-step:

from the main menu at GBIF.org, go to Get Data » Trends

the filter is set in the header section of the page - in this case, select “Africa"

scroll down the page to find the panel “Records by year of occurrence” (orange color, under “Time and seasonality”). To view the graphic in more detail, click on the image

this graphic shows counts of available occurrence records within GBIF.org by year of collection/observation since 1950. Four different snapshots of the data availability status within the GBIF index are combined here. The most recent one is found at the bottom; the ones above show the status of data mobilization (not: species occurrence) in a time sequence

to answer the question, you want to refer to the bottom graph, representing the most recent snapshot - presently, this shows the status of data mobilization at July 1, 2021

Poll: What other metrics can you find on GBIF.org?

a) metrics on downloads made by users from a country (++)
b) the number of registered GBIF users per country (-)
c) an activity report for a country, territory or area (++)
d) changes in popularity of data publishing tools and protocols over time (-)

Notes:

Downloads made by users from a country (option a): basic counts on data download events and data volumes, per year and month and/or registered user country, are part of the regular analytics generation process. These reports are also available as csv files. This may be of interest to you if, e.g. as a node manager, you would like to create your own metrics or graphics based on the raw data
- The analytics tables only contain aggregated counts on downloads; information on individual downloads or users is not available.
- Since this is a data area that is not of public interest, it is rather hidden and not advertised on GBIF.org.
We do not publish counts, location, or any other information on registered users (option b)
Activity report (option c): yearly generated activity reports are linked from all “country or area” pages, like in this example.
- incidentally: if you truncate the URL of such an activity report, you can also find your way to the underlying analytics tables (see option a)
While this may have some historic interest, we do not maintain metrics on publishing tools and protocols that were used for data publication over time (option d). Old data snapshots representing past stages do exist; they are not readily available for access though

Congratulations!

You reached the end of the extended Quiz! How did it go - did you learn anything new? In our live event at the 2021 Global Nodes Meeting, GBIF Nodes representatives did very well indeed, and won the race!

Whether you found something that you did not know before, or could confirm that you are fully up-to-date with GBIF systems: I hope that you had some fun, and that we could pique your curiosity to keep exploring GBIF.org and beyond!

And, before closing:
Special thanks for their contributions to the live event go to:

Marie Grosjean, who defended the GBIFS helpdesk valiantly in the “battle” in spite of here game-rules inflicted handicap,
Andrew Rodrigues, guiding the session with aplomb in his role as Official Battle Adjudicator and game show host,

and, last not least:

the opposing team of combatants from our wonderful, engaged, and skillful community of GBIF Nodes! Thank you all for playing!