[Edit 2021-09-16]

Important: To find guidance on how to publish Sequence-based data on GBIF, please consult the following guide:

Andersson AF, Bissett A, Finstad AG, Fossøy F, Grosjean M, Hope M, Jeppesen TS, Kõljalg U, Lundin D, Nilsson RN, Prager M, Svenningsen C & Schigel D (2020) Publishing DNA-derived data through biodiversity data platforms. v1.0 Copenhagen: GBIF Secretariat. https://doi.org/10.35035/doc-vf1a-nr22.

[End edit 2021-09-16]

GBIF is trying to make it easier to share sequence-based data. In fact, this past year alone, we worked with UNITE to integrate species hypothesis for fungi and with EMBL-EBI to publish 295 metagenomics datasets.

Unfortunately, documentation is not as quick to follow. Although we have now an FAQ on the topic, I thought that anyone could use a blog post with some advice and examples.

Note that this blog post is not intended to be documentation. The information written is subject to change, feel free to leave comments if you have any question or suggestion.

Structure Of DNA - Public Domain

Type of sequence-based data on GBIF

The term sequence-based data or molecular-based data refers to any type of data associated to some genetic material (DNA or RNA sequence, genotype, etc.). These data can be sorted in one of the two following categories:

  • the genetic material comes from an observable specimen (whether it is macro or microscopic).
  • the genetic material is the only evidence of a given organism or community, see for example metagenomics samples.

These two categories have different implications in terms of quality control but this will be the topic of an other post.

Examples

The examples I will use in this post are the following:

Choosing a dataset class

Data providers must organize their data into datasets in order to share them on GBIF. Four different dataset classes are supported: resources metadata, checklists, occurrence datasets and sampling-event datasets (see this page for more details). For the rest of this post, I assume that you are familiar with the structures of Darwin Core Archives, please consult this page for more information).

When choosing which dataset class would suit your data best, keep in mind that extensions can only describe the core file. For example:

  • If you would like to share sequences for each occurrence, consider organizing your data as an occurrence dataset: the extension containing the sequences will attached to occurrences (see BIOWIDE eDNA Fungi dataset.
  • Sampling-event datasets allow to put a greater emphasis on sampling protocol. This is the class adopted by Mgnify to map its metagenomics data.

Hopefully, the rest of this blog post might give you a better idea of what is possible.

Linnaean classification and non-Linnaean classification

Portrait of Carl von Linné (Carolus Linnaeus) - Public domain

As mentioned in our FAQ:

A basic requirement [on GBIF] is that organisms are identified–either using Linnean classification or another accepted taxonomy (e.g. DNA-based Species Hypotheses (SH) for fungi). If a specific taxonomy is not available, organisms may be identified to the nearest possible taxon (e.g. Bacteria).

In other words, checklists, occurrence and sampling-event datasets must contain scientific names (our system doesn’t handle OTUs). If you would like to check what can be matched to the GBIF taxonomy, please try out our name matching tool: https://www.gbif.org/tools/species-lookup.

The BIOWIDE eDNA Fungi dataset is a good example of dataset with some DNA-based Species Hypotheses (SH) for fungi. If no scientific name is available, you can always publish metadata-only datasets. See for example this dataset about Microbial Fungi in soils on different Sub-Antarctic islands.

Protocols and analysis pipelines

The sampling protocol, sequencing technology, quality control and sequence analysis pipelines make a huge difference in what can be detected at the microscopic level. For metagenomics datasets, the set of methods employed is probably the most valuable piece of information for a data user. There are two ways to convey this type of information in a tabular form:

[Edit: 2023-11-09: this paragraph was updated] In the first case, the information can be structured using defined terms, which makes it easier for users to compare methods once the data is aggregated with other dataset. In addition to that, this is a good way to make the information compatible across platforms. For example, the International Barcode of Life project (iBOL) uses DNA derived data extension. However, GBIF won’t index these extenstions as it indexes only Core and Multimedia ones. Which means that the information isn’t displayed on the portal. In fact, it is only available to users downloading the source Darwin Core archive on the dataset pages (click on the download tab and choose Source archive).

The alternative is to use some of the Darwin Core terms. For example, Mgnify organized its datasets as follow:

These fields are displayed on the occurrence and event pages but contain mostly free text. The lack of structure can be an issue when comparing data from different data sources.

Note that the methods can also be described in the datasets metadata. For example, the SCAR - Microbial Antarctic Resource System metadata-only datasets contain sampling and laboratory protocols (see for example Microbial Fungi in soils on different Sub-Antarctic islands)

A third option could be to do both: structure the information in extensions and make it available in Darwin Core terms. This approach would require more work but would ensure that the information is structured and accessible to more users.

Extensions available

A few extensions are currently available for structuring laboratory protocols. As mentioned previously, BIOWIDE eDNA Fungi dataset uses GGBN extensions. The MIxS sample extension can also be an alternative.

Keep in mind that GBIF doesn’t maintain extensions (it is up to the community) so some extensions available can become deprecated.

Sequences

You cannot make your sequences available directly on GBIF but you can reference them via the dwc:associatedSequences field or share them via an extension.

The dwc:associatedSequences field should contain a reference to a sequence (for example to EBI or EMBL) not the sequence itself, see for example, the Centre for Biodiversity Genomics - Canadian Specimens.

The GGBN extensions allow to share sequences but it is not indexed by the system and is therefore available only in the raw Darwin Core archive, see the BIOWIDE eDNA Fungi dataset.

Environmental data

Your samples or observations might be associated with measurements such as the water salinity or the soil pH. You can structure this type of information by using the MeasurmentOrFact extension. However, as I mentioned in the two paragraphs above, GBIF doesn’t index extension so these measurements would only be available in the Darwin Core Archive.

An alternative would be to use the dwc:dynamicProperties field. This solution is not ideal but the properties will be displayed on the portal (see this occurrence from the Inter-Comparison of Marine Plankton Metagenome Analysis Methods).

Perhaps the best solution for now would be to use both MeasurmentOrFact and dwc:dynamicProperties to ensure that the information is accessible.

In summary

Publishing molecular data on GBIF is like publishing any other type of data, it depends on each particular case. We are still figuring things out and would appreciated any thoughts on the topic. Don’t hesitate to leave a comment or email helpdesk@gbif.org if you have any question.

References