Scripts to build suitability data cubes and assess the impact of sampling bias when measuring ecological niche Milestone MS20: Code development for predictive habitat suitability modelling
- Virtual Suitability Cube building: code and tutorial
- Sampling bias measured on the ecological niche completeness: code and tutorial
Rocìo Beatriz Cortès Lobos (Unibo), Matilde Martini (Unibo), Michele Di Musciano (UnivAQ), Duccio Rocchini (Unibo)
Species suitability refers to how favorable an environment is for a species to survive, reproduce, and grow in a specific area and time. It takes into account factors like climate, landscape, and resource availability.
Species Distribution Models (SDMs), also known as Environmental Niche Models (ENMs), are tools that use environmental and species occurrence data to study and predict the distribution of species across time and space. SDMs help identify suitable habitats, forecast the movements of invasive species, and illustrate how species distributions might change due to factors like climate change. They are essential for conservation, allowing us to study how species interact with their environment and make informed decisions to protect biodiversity.
Studying species suitability under different environmental conditions is important for understanding population dynamics, planning conservation actions, and monitoring the effects of climate change and human activities on ecosystems. With this knowledge, we can make better decisions on how to protect habitats and species sustainably.
To facilitate the observation of suitability for multiple species over time and space, we developed a framework that uses Data Cubes, multidimensional arrays that organize data in a structured way. In this tutorial, we outline the steps to create a stars object, which includes three dimensions: time, space (represented as grid cells), and species, with suitability as the main attribute. Stars objects can be sliced, aggregated along one of the dimensions, and analyzed, making them ideal for studying species suitability.
For demonstrating the structure of the data cube, we use virtual species, which are artificially generated species with known suitability maps based on climate data. The main steps include combining climate data to calculate the suitability of two different species over time and in the same area, then merging these species into a single stars object.
Starting with a time series of climate variables, we combine them to create the suitability for two different virtual species, whose trends over time we want to observe.
The two data cubes of species suitability are treated as separate entities, which we can then combine by aggregating them over polygons, creating a vector data cube.
This approach makes it easy to visualize and analyze species suitability across time and space for multiple species at once.
# load packages
library(sf)
library(raster)
library(viridis)
library(virtualspecies)
library(ggplot2)
library(tidyverse)
library(terra)
library(stars)
library(geodata)
library(purrr)
library(reshape2)
To begin, we will download climate data from WorldClim. Climate data are among the datasets from which the suitability of a species can be calculated.
In this simple example, we will download only 3 variables: minimum temperature, maximum temperature, and precipitation. The study area is Luxembourg.
## Download climatic data
# worldclim_country function.
# The res paramater is the resolution, with valid values as 10,5,2.5 and 0.5
# The var parameter is the variable name, with valid values as tmin, tmax, tavg, prec, wind, vapr and bio.
# The worldclim_country function will return 12 layers of raster (in a SpatRaster), for each variable.
tmin <- worldclim_country("Lux", "tmin", path=tempdir(), res = 0.5, version = "2.1")
tmax <- worldclim_country("Lux", "tmax", path=tempdir(), res = 0.5, version = "2.1")
prec <- worldclim_country("Lux", "prec", path=tempdir(), res = 0.5, version = "2.1")
# 12 months of precipitation in Luxembourg. Each layer represents the average value for a month.
plot(tmin)
Even if there are 12 rasters, each corresponding to a month, the temporal dimension is not expressed in the RasterStack and needs to be added by time
function
# here, we will simply use numbers from 1 to 12 to represent # the months.
time(tmin) <- time(tmax) <- time(prec) <- 1:12
climate_vars <- c("tmin", "tmax", "prec")
print(tmin)
# class : SpatRaster
# dimensions : 180, 180, 12 (nrow, ncol, nlyr)
# resolution : 0.008333333, 0.008333333 (x, y)
# extent : 5.5, 7, 49, 50.5 (xmin, xmax, ymin, ymax)
# coord. ref. : lon/lat WGS 84 (EPSG:4326)
# source : LUX_wc2.1_30s_tmin.tif
# names : LUX_w~min_1, LUX_w~min_2, LUX_w~min_3, LUX_w~min_4, LUX_w~min_5, LUX_w~min_6, ...
# min values : -3.1, -3.2, -0.8, 1.2, 5.6, 8.2, ...
# max values : 0.0, -0.1, 2.6, 4.8, 9.0, 12.1, ...
# time (raw) : 1 to 12
The stars R package provides an infrastructure for managing data cubes. Data cubes are multi-dimensional arrays that enable the organization and analysis of large datasets across multiple dimensions, such as time, space, and various environmental variables.
We will create a stars object that contains the climatic variables as attributes and x and y as dimensions. This way, we have everything we need to proceed into a single object.
# first, we create a list that contains all the stacks, to which we apply st_as_stars.
stars_clima <- list(tmin, tmax, prec) %>%
lapply(st_as_stars) %>%
do.call("c", .) %>%
setNames(climate_vars)
print(stars_clima)
# stars object with 3 dimensions and 3 attributes
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# tmin -3.2 0.4 4.4 4.917869 9.4 14.1
# tmax 0.9 5.9 12.5 12.636610 18.9 25.1
# prec 46.0 67.0 75.0 76.737418 84.0 143.0
# dimension(s):
# from to offset delta refsys point x/y
# x 1 180 5.5 0.008333 WGS 84 FALSE [x]
# y 1 180 50.5 -0.008333 WGS 84 FALSE [y]
# time 1 12 1 1 NA NA
The purpose of this brief tutorial is to construct a data structure that allows for the comparison of suitability between two species occupying the same area over time.
As an example, we will consider only two species and two months to observe changes in suitability. The suitability of a real species can be obtained by applying Species Distribution Models (SDMs). But in our case, since we are interested in creating the structure, we will use virtualspecies, which are artificial species randomly generated.
During the creation of these random virtual species, the first step involves generating suitability based on the initial environmental data: in our case, the climatic data.
We will calculate the suitability for each species over the first two months, january and february.
# Let's choose to work with the first two months. For both months, we will create a stack containing the 3 climatic variables.
# number of months
num_months <- 2
# list of RasterStacks for each month
raster_stacks <- map(1:num_months, function(month) {
# extraction and convertion of the climatic variables
layers <- map(climate_vars, function(var) {
as(stars_clima[var,,,month], "Raster")
# convertion into rasters
})
# layer stack of the current month
stack_month <- stack(layers)
# names
names(stack_month) <- climate_vars
return(stack_month)
})
january <- raster_stacks[[1]]
february <- raster_stacks[[2]]
print(january)
# class : RasterStack
# dimensions : 180, 180, 32400, 3 (nrow, ncol, ncell, nlayers)
# resolution : 0.008333333, 0.008333333 (x, y)
# extent : 5.5, 7, 49, 50.5 (xmin, xmax, ymin, ymax)
# crs : +proj=longlat +datum=WGS84 +no_defs
# names : tmin, tmax, prec
# min values : -3.1, 0.9, 53.0
# max values : 0.0, 4.9, 135.0
plot(january, col = plasma(500, alpha = 1, begin = 0, end = 1, direction = 1))
The following function uses the generateRandomSp
function from the virtualspecies
, which creates random species based on the stack of climatic variables and allows us to set some parameters.
Each time the function is executed, a new random species is generated. Therefore, to ensure that we obtain the same species, we need to set a fixed seed using set.seed()
generate_suitability <- function(climate_stack, seed_value) {
set.seed(seed_value) # Assicura che la specie virtuale sia riproducibile
random_sp <- generateRandomSp(raster.stack = climate_stack,
convert.to.PA = FALSE,
species.type = "multiplicative",
approach = "response",
relations = "gaussian",
realistic.sp = TRUE,
plot = FALSE)
return(random_sp$suitab.raster)
}
# generate the suitability for species 1 in January and February and create a single object for species 1 that contains the suitability for both months
suit_sp1_jan <- generate_suitability(january, seed_value = 121)
suit_sp1_feb <- generate_suitability(february, seed_value = 121)
suit_sp1 <- c(suit_sp1_feb, suit_sp1_jan)
# same for species 2
suit_sp2_jan <- generate_suitability(january, seed_value = 456)
suit_sp2_feb <- generate_suitability(february, seed_value = 456)
suit_sp2 <- c(suit_sp2_feb, suit_sp2_jan)
# add time dimension
time(suit_sp2) <- time(suit_sp1) <- 1:2
# suitability for Species 1
plot(suit_sp1, col = viridis(500, alpha = 1, begin = 0, end = 1, direction = 1))
title("Suitability of Species 1 in Jan-Feb", outer=TRUE, line=-0.9)
# suitability for Species 2
plot(suit_sp2, col = viridis(500, alpha = 1, begin = 0, end = 1, direction = 1))
title("Suitability of Species 2 in Jan-Feb", outer=TRUE, line=-0.9)
For each species, we will create a data cube: the dimensions remain x, y, and time, but this time the attribute is the suitability only, with values ranging from 0 to 1, which characterizes each species.
# stars cube for Species 1
suit_sp1 <- list(suit_sp1)
suit_cube_sp1 <- do.call("c", lapply(suit_sp1, stars::st_as_stars)) %>% setNames(., c("suit"))
# stars cube for Species 2
suit_sp2 <- list(suit_sp2)
suit_cube_sp2 <- do.call("c", lapply(suit_sp2, stars::st_as_stars)) %>% setNames(., c("suit"))
print(suit_cube_sp1)
# stars object with 3 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# suit 0 0.1493433 0.5079458 0.4706163 0.7610167 1
# dimension(s):
# from to offset delta refsys x/y
# x 1 180 5.5 0.008333 WGS 84 [x]
# y 1 180 50.5 -0.008333 WGS 84 [y]
# time 1 2 1 1 NA
print(suit_cube_sp2)
# stars object with 3 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# suit 0 0.009498931 0.06620133 0.1808423 0.3206807 1
# dimension(s):
# from to offset delta refsys x/y
# x 1 180 5.5 0.008333 WGS 84 [x]
# y 1 180 50.5 -0.008333 WGS 84 [y]
# time 1 2 1 1 NA
We want to incorporate both species into the same object.
To combine the two species into a single object, we need to collapse the x and y dimensions into one, creating a grid that represents individual cells within the study area. This grid allows us to reduce the dimensions of the stars object and facilitates the transition to a vector data cube.
In this process, pixels in the raster are grouped based on their spatial intersection with a set of vector geometries. Each group is then reduced to a single value using an aggregation function, such as the mean or maximum. The result is a one-dimensional sequence of feature geometries defined in space.
By creating the grid, it is possible to reduce two dimensions (x and y) into a single one (cell): this way, we can observe the suitability over time for both species within the defined polygons. As a result, "species" becomes an additional dimension in our data structure.
# bounding box
bb_lux <- st_bbox(tmin)
# from bounding box to polygon
sf_lux <- st_as_sfc(bb_lux) %>%
st_sf()
# grid
lux_grid <- st_make_grid(sf_lux, cellsize = .1, n = c(100, 100), what = "polygons", square = FALSE, offset = st_bbox(sf_lux)[c("xmin", "ymin")]) %>%
st_as_sf() %>%
mutate(id = 1:nrow(.))
# plot raster with grid
# Convert SpatRaster to dataframe
tmin_df <- as.data.frame(tmin$LUX_wc2.1_30s_tmin_1, xy = TRUE, na.rm = TRUE)
ggplot() +
# Add raster layer
geom_raster(data = tmin_df, aes(x = x, y = y, fill = LUX_wc2.1_30s_tmin_1)) +
scale_fill_viridis_c(alpha = 1, begin = 0, end = 1, option = "viridis") +
# Add grid
geom_sf(data = lux_grid, color = "black", size = 0.5, fill = NA) +
theme_void() +
theme(
legend.position = "none",
panel.border = element_blank(),
plot.margin = margin(0, 0, 0, 0)
)
# aggregate by cells, calculating the average of the suitability values within each cell
agg_sp1 <- aggregate(suit_cube_sp1, lux_grid, mean, as_points = TRUE, na.action = na.omit)
agg_sp2 <- aggregate(suit_cube_sp2, lux_grid, mean, as_points = TRUE, na.action = na.omit)
print(agg_sp1)
# stars object with 2 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# suit 0.0009594404 0.1817593 0.5260862 0.4765746 0.7482857 0.9380438 36
# dimension(s):
# from to offset delta refsys point values
# x 1 297 NA NA WGS 84 FALSE POLYGON ((5.45 49.02887, ...
# time 1 2 1 1 NA NA NULL
print(agg_sp2)
# stars object with 2 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# suit 2.707365e-05 0.01614575 0.07359824 0.1784444 0.3089694 0.9104424 36
# dimension(s):
# from to offset delta refsys point values
# x 1 297 NA NA WGS 84 FALSE POLYGON ((5.45 49.02887, ...
# time 1 2 1 1 NA NA NULL
Finally, we merge the two cubes. Now the dimensions are 3 again: cell, time and species. The attribute represents the species' suitability over time and space.
cube_sp1sp2 <- c(agg_sp1, agg_sp2) %>%
st_redimension() %>%
st_set_dimensions(., which = "new_dim", values = c("specie1","specie2"), names = "species") %>%
st_set_dimensions(., which = "time", values = c("jan","feb"), names = "month") %>%
setNames(.,"suitability")
print(cube_sp1sp2)
# stars object with 3 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# suitability 2.707365e-05 0.0440898 0.2516058 0.3275095 0.5842543 0.9380438 72
# dimension(s):
# from to refsys point values
# x 1 297 WGS 84 FALSE POLYGON ((5.45 49.02887, ...,...,POLYGON ((7.05 50.41451, ...
# month 1 2 NA NA jan, feb
# species 1 2 NA NA specie1, specie2
Let's suppose we have a specific point in space: we want to investigate the suitability of the two species at that location
With pull
function you can extract the attributes in a specific point of space and time.
# first, we need to identify the corresponding cell for that location.
which_cell <- st_sf(geometry = st_sfc(st_point(c(5.5, 49.0)), crs = 4326)) %>% st_join(., lux_grid)
print(which_cell$id)
# [1] 10
# this object provides, for each species (,,1) and (,,2), the suitability in January and February in cell 10
# in the first case, it decreased; in the second, it increased.
pull(cube_sp1sp2[,10,,], "suitability")
# , , 1
# [,1] [,2]
# [1,] 0.9036776 0.8826196
#
# , , 2
# [,1] [,2]
# [1,] 0.001298286 0.007460382
## we can better visualize this information this way:
values_suit <- pull(cube_sp1sp2[,10,,], "suitability")
# transform values_suit into a long format without rewriting it
df_long <- melt(values_suit)
colnames(df_long) <- c("cell", "time", "species", "suitability")
# convert time and species into factors
df_long$time <- factor(df_long$time, labels = c("Jan", "Feb"))
df_long$species <- factor(df_long$species, labels = c("Species 1", "Species 2"))
# plot
ggplot(df_long, aes(x = species, y = suitability, color = species)) +
geom_point(size = 3) +
facet_wrap(~time, scales = "free_x", ncol = 2, strip.position = "bottom") +
labs(title = "Suitability for species in January and February",
x = "Month",
y = "Suitability") +
scale_color_manual(values = c("Species 1" = "purple2", "Species 2" = "orange")) +
theme_minimal() +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
panel.spacing = unit(2, "lines"),
axis.line = element_line(color = "black", size = 0.8),
strip.placement = "outside",
legend.position = "right",
panel.border = element_rect(color = "black", fill = NA, size = 0.))
The following is an example of suitability over the course of a whole year (12 months) for 5 different species in two distinct locations in the Luxembourg area.
We can imagine the structure of the cube in this way, where the temporal layers are 12
- virtualspecies, an R package to generate virtual species distributions (2018), Leroy B. et al., DOI
- Pebesma E, Bivand R (2023). Spatial Data Science: With applications in R. Chapman and Hall/CRC, London. DOI:10.1201/9780429459016, https://r-spatial.org/book/
- Meyer H, Milà C, Ludwig M, Linnenbrink J, Schumacher F (2024). CAST: 'caret' Applications for Spatial-Temporal Models. R package version 1.0.2, https://hannameyer.github.io/CAST/
- Gilardi A, Lovelace R (2024). osmextract: Download and Import Open Street Map Data Extracts. R package version 0.5.1.900, https://github.com/ropensci/osmextract.
- Predicting species distribution: offering more than simple habitat models, Guisan A., Thuiller W., 2005, DOI
- A practical overview of transferability in species distribution modeling, Werkowska W., Marquez A., Real R., Acevedo P., 2017, DOI
- Effect of roadside bias on the accuracy of predictive maps produced by bioclimatic models, Kadmon R., Farber O., Danin A., 2004, DOI
- The virtual ecologist approach: simulating data and observers, Zurell et al., 2010, DOI
- The n-dimensional hypervolume, Blonder et al., 2014, DOI
- The cumulative niche approach: a framework to assess the performance of ecological niche model projections, Arlè et al., 2024, DOI