Country Centroids are a known data quality issue within the GBIF network.

Sometimes data publishers will not know the exact lat-lon location of a record and will enter the lat-long center of the country instead. This is a data issue because users might be unaware that an observation is pinned to a country center and assume it is a precise location.

Below I plot the top country centroids found on GBIF within at least 1km. Only the top unique point from each centroid is plotted, so there could be more occurrences near the centroids.

A few countries have very many points falling on their country centroids. For example, Brazil and Mexico each have over 25 k occurrence records within 1 km of their country centroid.

Other regional or province centroids

Publishers might not only pin occurrences to the center of country but might also give the lat-lon centers of smaller adminstrative zones. For example the United States does not have an obvious country centroid but Hawaii has 704 occurrences near the state-centroid.

Some countries like Monaco are too small to filter

Some countries and provinces are small enough that it does not make sense to filter them out even if they happen to be close to known centroids. For example, if GBIF were to flag all records within 2km of the Monaco centroid, we would incorrectly flag more than 27k records that might even be in France.

All provinces and countries less than 1000 sqkm

Also even if the records happen to be sitting at true centroids, they might be in a province or country so small that it is below the necessary resolution of the study. Some province centroids, such as these province centroids near Manila even overlap at just a 2 km buffer.

Less than 0.1% of GBIF records are centroids

After excluding regions smaller than 1000 sqkm there are…

  • 250 k records within 1km of a country centroid
  • 490 k records within 1km of a province or country centroid
  • 304 k records within 2km of a country centroid
  • 1.3 M records within 2km of a province or country centroid

So even with a large buffer (2km) there are at most <0.1% GBIF occurrence records sitting at or near a know centroid. This is about half as many records that have a zero-coordiante issue, but it is still an issue probably still worth flagging on gbif.org.

Herbaria and museums have the most country centroids

Below are the top 15 datasets with records within 1km of a known country centroid. I did not included province centroids in these count totals.

  1. Natural History Museum Collection Specimens 77 282 centroid records
  2. The Vascular Plant Collection at the Botanische Staatssammlung München 34 827 centroid records
  3. Geneva Herbarium – General Collection (G) 21 720 centroid records
  4. Geographically tagged INSDC sequences 15 469 centroid records
  5. MAL 14 921 centroid records
  6. Geneva Herbarium – De Candolle’s Prodromus 12 767 centroid records
  7. A global database for the distributions of crop wild relatives 7 057 centroid records
  8. Field Museum of Natural History Botany Seed Plant Collection 5 877 centroid records
  9. Botanical Museum, Copenhagen, the Phycology Herbarium 4 233 centroid records
  10. Field Museum of Natural History (Botany) Lichen Collection 4 216 centroid records
  11. National Museum of Natural History Luxembourg 4 012 centroid records
  12. Norwegian Biodiversity Information Centre - Other datasets 3 603 centroid records
  13. United States National Plant Germplasm System Collection 2 995 centroid records
  14. The Fungal Collection at the Senckenberg Museum für Naturkunde Görlitz 2 770 centroid records
  15. Field Museum of Natural History (Botany) Bryophyte Collection 2 281 centroid records

Very few of these datasets fill in the coordinate uncertainty in meters or footprintWKT fields for the centroid points.

Plants make up >50% of the country centroid records

Flowering plants (Magnoliopsida and Liliopsida) make up more than 50% of the country centroid records on GBIF. Other groups have fewer country centroids. Mammals, reptiles, and amphibians all have less than 10k country-centroid occurrence records. Since most of the top datasets with centroids are herbaria, this is not surprising.

Filter out country centroids with CoordinateCleaner

Currently GBIF.org does not flag or filter records that fall on centroids.

Although gbif.org does not currently filter out or flag points that might be centroids, it is still easy to do this using R. The best option is to use the R package CoordinateCleaner.

library(CoordinateCleaner)
library(rgbif)

# datasetkey of the Natural History Museum (London) Collection Specimens
# this dataset has around 77k centroid records of around 1M records without other geospatial issues
key <- "7e380070-f762-11e1-a439-00145eb45e9a" 

# fetch 10 000 records from NHM London Data using rgbif
NHM <- occ_data(datasetKey=key,
hasGeospatialIssue=FALSE, # remove those with other geospatial issues 
hasCoordinate=TRUE, # get only with coordinates
limit=10000)$data 

# use coordinate cleaner to clean out centroids
NHM_clean <- cc_cen(NHM,
lon = "decimalLongitude", 
lat = "decimalLatitude", 
buffer = 2000, # radius of circle around centroid to look for centroids
value = "clean",
test="both")

nrow(NHM_clean) # should be around 20 records less than 10 000