Does Biodiversity Informatics 💘 Wikidata?

Open online APIs are fantastic! You can use someone else’s infrastructure to create workflows, do research and create products without giving anything in return, except acknowledgement. But wait a minute! Why is everyone not using them? Why do we create our own data sources and suck up the costs in time and money? Not to mention the duplication of effort.

One of the reasons is that we can’t fix someone else’s data. If we’re lucky we can email them a list of the errors and hope for a response, but this is not a viable solution for a process that we would hope to be automated. So what if we all go local? Then we don’t benefit from the latest data and the curation of experts in their domain and we all have to repeat the same work everywhere that the data are needed, particularly for people names, locality data, taxonomic names and concepts.

Is there a third way?

What about Wikidata? It’s open, readable and writable to machine and human, curated by an avid and active group of people globally and, most importantly if you find errors you can fix them.

By Andrawaag - Own work, CC0, https://commons.wikimedia.org/w/index.php?curid=87171699

In February 2020 a group of Wikidata enthusiasts got together in Warsaw Poland to discuss how biodiversity informatics could make the most of the opportunities of Wikidata. We’ve seen how Scholia has pushed the boundaries of scientometrics using Wikidata, and GeneWiki is opening up information on human genomics, and now biodiversity informaticians want a piece of the action.

The group divided into teams to tackle three live issues with Wikidata. A team worked on the primary issue of taxonomy, another on people, and a third on the “Big Picture”.

The conclusion of the Taxonomy team was that taxonomy is hard.

Wikidata has somewhat lost its way with taxonomy and it can be seen from the data that users do not understand the intricacies of taxonomic names versus taxonomic concepts. Unfortunately, the underlying data model is not fit for purpose and although alternatives are possible they might not be suitable for the average Wikipedian to understand. Nevertheless, undaunted by this conclusion the team members have a plan of how to model taxonomy in Wikibase and will test it out as a first step.

Taxonomic concepts are hard to model, because they are based on opinion and there is no global consensus. The Taxonomic Names and Concepts Interest Group of the Biodiversity Information Standards are also looking into ways to galvanize and empower the taxonomic community around sharing, publication, visualization, and community management of taxon concepts. Furthermore, DiSSCo and CoL are investigating how to enable the taxonomic experts to collectively curate these data.

The People Team reported back much more enthusiastically. People are a natural bridge that connects the body of their works, whether that is scientific literature, specimens, institutions and their collaborations. People are well modelled in Wikidata and apart from the addition of a few properties the People Team could see a great future for Wikidata’s people data.

The Big Picture team were tasked with foreseeing the future of how we might engage with Wikidata and where the priorities lie. They also saw opportunities in geography, literature, events, organizations etc, but they also saw people-based data as a high impact, lower-effort input area for exploitation. This team’s next step is presenting their ideas at the SPNHC & ICOM NATHIST 2020 meeting in June. They will use the opportunity to reach out to the collections community to engage with Wikidata. This will not be the only Wikimedia event being held there. There are already workshops on people and collection data.

Wikidata will continue to grow and evolve, whether or not biodiversity science engages with it. There are clearly opportunities and challenges ahead. The teams have homework to do, but the meeting has already spawned a large number of side projects and wherever we go from here it is likely to get interesting!

Acknowledgements

This workshop was supported by COST (European Cooperation in Science and Technology) as part of the Mobilise Action CA17106 on Mobilising Data, Experts and Policies in Scientific Collections. Deborah Paul acknowledges funding from iDigBio, funded by grants from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program [DBI-1115210 (2011-2018) and DBI-1547229 (2016-2021)]. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Quentin Groom acknowledges SYNTHESYS+ a Research and Innovation action in the H2020-EU.1.4.1.2. Programme. Donat Agosti acknowledges funding from the Arcadia Fund.