Chapter.rmd

---
title: "The ecospat R package: a collection of pre-, core- and post-modeling tools
  to investigate species niches and distributions"
author: 
- "Olivier Broennimann$^1$$^,$$^2$$^,$$^*$ (olivier.broennimann@unil.ch)" 
- "Flavien Collart$^1$ (flaviencollart@hotmail.com)" 
- "Antoine Guisan$^1$$^,$$^2$ (antoine.guisan@unil.ch)"
date: $^1$Department of Ecology and Evolution, University of Lausanne, Switzerland
  $^2$Institute of Earth Surface Dynamics, University of Lausanne, Switzerland
  $^*$Corresponding author
output:
  bookdown::word_document2:
    toc: yes
    toc_depth: '4'
  bookdown::pdf_document2:
    toc: yes
    toc_depth: 4
bibliography: references.bib
csl: "global-change-biology"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Abstract

The R package *ecospat* is a collection of R functions and data sets designed to support spatial ecology analyses, with a focus on pre-, core- and post-modeling analyses of species distributions (SDMs). Pre-modeling analyses include data extractions, designing sampling strategies, sub-sampling data, assessing spatial autocorrelation, reducing collinearity among predictors, measuring environmental similarity and analog/non-analog situations, assessing niche stability within a species or niche difference among species, and measuring species co-occurrences to investigate community assembly. Core-modeling analyses cover modeling fitting, spatial predictions and projections, and model evaluation. It includes in particular the fitting of SDMs with the 'Ensemble of Small Models' approach (ESMs), that applies especially to species with small sample sizes (e.g. rare and endangered species). The post-modelling analyses comprise variance partitioning, determination of occupied patches and range size using the IUCN criteria, SDM prediction map binarization, and predictions of community properties from stacking individual species, including also functional and phylogenetic diversity. In this book section, we first illustrate two study cases where the ecospat package can be used: (a) modeling rare species distribution and quantifying species range size and (b) anticipating biological invasions via niche comparison. We then discuss other applications of the functions in the ecospat package and provide some perspectives about the future development of the package.

# Keywords

species distribution model, niche theory, Ensemble of small models, rare species, model evaluation, response curves, variable contribution, range size, Area of occupancy, Extent of occurrence, niche dynamics, niche quantification, map binarization.

# Introduction

Species distribution modeling (SDMs; @guisan2017; @peterson2011; @franklin2010) is a field of ecology and biogeography that has grown significantly in the last three decades [@araújo2019], with important potential conservation applications [@guisan2013]. SDMs work by relating statistically (or sometime in an expert-based way, @difebbraro2018; @fourcade2013) species observations (simple occurrences, presence-absences, or abundances) to a set of environmental attributes (e.g. climate, topography, substrate, landscape, human influences) characterizing the sampling sites. By doing this through a set of locations in the studied region, SDMs implicitly quantify the suitability of the environment for the species (i.e. habitat or environmental suitability), and if done across the whole species' environmental range, it then quantifies the species' realized environmental niche [@austin1990; @pulliam2000; @soberón2007; @araújo2006], which can be used to make geographic projections, e.g. in the future, provided some assumptions are met (e.g. equilibrium, niche stability, no niche truncation, etc; see @guisan2017; @zurell2020). Individual SDMs can then be used for many applications, ranging from climate change assessments (e.g. @broennimann2006; @patiño2023), anticipating biological invasions (e.g. @broennimann2007; @barbet-massin2018), determining suitable areas for translocation [@ferrarini2016], or rare species management [@guisan2006; @jeliazkov2022]. Individual species predictions can also be stacked across a whole taxonomic group [@dubuis2011; @kass2022] to predict community patterns, like richness, composition, or functional or phylogenetic diversity [@ferrier2006; @guisan2011; @damen2015; @shen2023], and these multiple species predictions can be then be used to support spatial conservation planning [@moilanen2022]. All these analyses and applications can now be performed thanks to numerous software packages, most notably in the R framework [@sillero2023].

One R package to perfrom SDMs is *ecospat* (@dicola2017; @broennimann2023, currently version 4.0.0), a collection of R functions and data sets designed to support spatial ecology analyses, with a focus on pre-, core- and post-modeling analyses of species distributions (SDMs) (Table \@ref(tab:table1)). Functions and tools within ecospat can be classified according to the three main group of analyses that are typically followed when building SDMs [@dicola2017]: (i) pre-modeling analyses, consisting of data preparation and exploratory analyses; (ii) core-modeling analyses, corresponding to model fitting and evaluation, and (iii) post-modeling analyses, which consist of using the models to reach study goals such as combining single species predictions to reconstruct communities (stacked SDMs; @guisan2011, @damen2018b, @zurell2020).

In ecospat, pre-modeling analyses include data extractions, designing sampling strategies, sub-sampling data, assessing spatial autocorrelation, reducing collinearity among predictors, measuring environmental similarity and analog/non-analog situations (mess), assessing niche stability within a species or niche difference among species (using a large set of functions, a key development in ecospat), and measuring species co-occurrences to investigate community assembly. Core-modeling analyses typically cover modeling fitting, spatial predictions and projections, and model evaluation. In ecospat, it includes especially fitting SDMs with the 'Ensemble of Small Models' approach (ESMs), that applies especially to species with small sample sizes (e.g. rare and endangered species) but also performs well for other species, and applying the different steps of the SESAM framework [@guisan2011] for modeling the spatial distribution of species assemblages and communities. It also includes evaluations of individual species models through the computation of cross-validations and evaluation metrics (such as the Boyce Index). The post-modelling analyses comprise variance partitioning, determination of occupied patches and range size using the IUCN criteria, SDM prediction map binarization, and predictions of community properties from stacking individual species (e.g. species richness; @dubuis2011), including also functional and phylogenetic diversity [@damen2018a]. 

In this book section, we first illustrate two study cases where the ecospat package can be used: (a) modeling rare species distribution and quantifying species range size and (b) anticipating biological invasions via niche comparison. We then discuss other applications of the functions in the ecospat package and provide some perspectives about the future development of the package.

```{r table1, message=FALSE, warning=FALSE, include=TRUE, paged.print=TRUE, echo=FALSE}
table1<-read.table("Table1.txt",header=TRUE,sep="\t")
knitr::kable(table1[-20,], row.names = FALSE, caption = "List of the functions in the ecospat package (adapted from Di Cola et al. 2017)")
```

# Case studies

## Modeling rare species

In this section, we will use the ecospat package to employ the method called "ensemble of small models (ESM)", which was developed by @lomba2010; @breiner2015; @breiner2018 and is particularly suitable for species with small sample size, like rare and endangered species (yet also works particularly well for species with larger sample sizes but at higher computational cost than more traditional approaches; @breiner2015) . One of the major problems to model rare species is that the number of occurrences is usually scarce. Although some studies reported that species could be accurately modeled with very low sample size (e.g. 3 occurrences in @vanproosdij2015), sample size is problematic for the modeling procedure, with the risk of overfitting models when the number of occurrences is low compared to the number of predictors [@vandepol2016] and problem of evalutation in the case of test data with small sample sizes [@collart2021, @collart2023]. In general, authors are limiting the number of predictors that are put in a model using the rule of thumb that not more than one predictor term should be used per 10 occurrences. To avoid this limitation, ESMs compute bivariate models and then combine all possible bivariate models into a weighted ensemble. By averaging simple small models to an ensemble, with their respective cross-validated performance as weights, ESMs avoid overfitting without losing explanatory power through reducing the number of predictor variables, and were shown to perform significantly better than standard SDMs with species having a low number of occurrences [@breiner2015]. For this section, we will be focusing on modeling the ecological niche of the alpine plant *Veronica alpina* in the Western Swiss Alps.

### Pre-Modeling

The ESM functions of the *ecospat* package rely on the model fitting functions in the *biomod2* package [@thuiller2013]. We thus need to first format our data by using the *BIOMOD_FormatingData* function, where species occurrences and associated coordinates, the environmental conditions and the name of the species of interest are given. In this example we use presence-absence data for 300 plots, where *Veronica alpina* is present in 12 locations, and maps of 5 environmental predictors. The latter variables are: growing degree days above 0°C, moisture index over the growing season (average values for June to August in mm day-1), annual sum of radiation (in kJ m-2 year-1), slope (in degrees), and topographic position.

```{r load-libraries, results='hide', message=F, warning=F}
# Load the packages
library(ecospat) 
library(biomod2) 
library(terra) 
library(viridis)
```

```{r load-coord, message=F, warning=F, results='asis'}
set.seed(123)
data("ecospat.testData")

# coordinates of the plots
xy <- ecospat.testData[,2:3]
# species presences and absences
sp_occ <- ecospat.testData$Veronica_alpina
sum(sp_occ) ## Number of occurrences
```

```{r load-env, message=F, warning=F, results='asis'}
# environmental data
env <- rast(system.file("extdata/ecospat.testEnv.tif", package="ecospat"))
```

```{r biomod-formating, results='hide',message=F, warning=F}
# Formatting the data with the BIOMOD_FormatingData() function from the package biomod2
library(biomod2)
myBiomodData <- BIOMOD_FormatingData(resp.var = as.numeric(sp_occ),
                                     expl.var = env,
                                     resp.xy = xy,
                                     resp.name = "Veronica.Alpina",
                                     filter.raster = TRUE)
```

### Core-Modeling

The function *ecospat.ESM.Modeling* is used to model the ecological niche of the species by generating bivariate models. The argument *data* is for the formatted dataset object generated by BIOMOD_FormatingData. The desired algorithms can be provided in the argument *models*. Model parameters can be adapted via the argument *models.options* by giving the object from the function bm_ModellingOptions() of the biomod2 package. As in the package biomod2, ESM can fit 12 different algorithms: Generalized Linear Model ('GLM'), Gradient Boosted Machine ('GBM'), eXtreme Gradient Boosting Training (XGBOOST), Generalized Additive Models ('GAM'), 'CTA', Artificial Neural Network ('ANN'), 'SRE', 'FDA', 'MARS', 'RF', Maximum entropy ('MAXENT', using the java software or 'MAXNET' from the maxnet package). Tuning to obtain the optimal parameters for the model can be realized with the argument *tune*. *Prevalence* can be set to build a "weighted response". If NULL, each observation (presence or absence) will have the same weight. You can also give a specific weight to observations via the argument *Yweights*. To evaluate the models, the function performs a repeated split-sampling cross-validation using the arguments *DataSplit* and *NbRunEval*. *DataSplit* corresponds to the percentage of observations used to calibrate the models. *NbRunEval* indicates the number of times the split-sampling procedure is replicated. The function also allows user-defined cross-validations by giving a logical matrix in the argument *DataSplitTable*, where each row corrends to an observation and each column corresponds to a run. A value TRUE means that an observation will be used for model calibration while a FALSE is for model evaluation. *weighting.score* corresponds to the evaluation metric that will be used to weight single bivariate models in the final ensemble model. The available evaluation metrics are: 'AUC', 'SomersD' (2xAUC-1), 'Kappa', 'TSS' or 'Boyce'. *which.biva* allows to split the bivariate model procedure in several parts. For example, if which.biva is 1:3, only the three first variable combinations will be modeled. This allows to run different bivariate splits on different computers. However, it is better not to use this option if all models are run on a single computer. If you do so, make sure to give each of your modeling subset a unique *modeling.id.* and avoid space characters. Parallel computing can be enabled with the argument *parallel.*

```{r esm-mod, message=FALSE, warning=FALSE, results='hide'}
my.ESM <-ecospat.ESM.Modeling(data = myBiomodData,
                              NbRunEval = 3, 
                              DataSplit = 70, 
                              DataSplitTable = NULL, 
                              Prevalence = 0.5,
                              weighting.score = "SomersD", 
                              models = "GLM", 
                              tune = FALSE,
                              modeling.id = as.character(format(Sys.time(), "%s")), 
                              models.options = NULL, 
                              which.biva = NULL, 
                              parallel = FALSE, 
                              cleanup = FALSE,
                              Yweights = NULL)
```

The following step is to combine all the bivariate models into an ensemble. To so, we can use the function *ecospat.ESM.EnsembleModeling* which will need the object returned by *ecospat.ESM.Modeling*, the evaluation metric used to weight the bivariate models (*weighting.score*) and a *threshold* to remove poor performing models. The argument *models* allows to select one or several algorithms to realize the ensemble.

```{r esm-ensemble, message=FALSE, warning=FALSE}
my.ESM.EF <- ecospat.ESM.EnsembleModeling(ESM.modeling.output = my.ESM, 
                                          weighting.score = "SomersD", 
                                          threshold = 0, 
                                          models = NULL)
```

ESM performances resulting from the cross-validations can be observed in the object returned by *ecospat.ESM.EnsembleModeling* (Table \@ref(tab:esm-table2)).

```{r esm-table2, message=FALSE, warning=FALSE, results='asis', echo=FALSE}
df<-t(my.ESM.EF$ESM.evaluations[,-c(1,15:16)])
colnames(df)<-c("RUN1","RUN2","RUN3")
knitr::kable(df, caption = "ESM performances based on a mean or standard deviations across bivariate model performances of a same run", digits = 3)
```

However, because a minimum sample size is needed to evaluate models (see @jiménez-valverde2020), it is recommended to evaluate ESMs using the recently proposed pooling evaluation approach [@collart2023]. The function *ecospat.ESM.EnsembleEvaluation* uses this approach, which consists of pooling the suitability values predicted with the hold-out data (evaluation dataset) across replicates to measure an overall evaluation score. As the same observation (presence or absence or background point) is presumably sampled in several replicates, the suitability values for each data point are consequently averaged across replicates where they were sampled. This procedure generates a series of independent suitability values with a size approximately equal to that of the number of observations (the number of suitability values might be slightly lower than the number of original observations as some data points may not be sampled by chance in any of the n replicates), which agreement to the actual observations can then be measured using sufficient sample size. This function can compute several metrics, which can be selected with the argument *metrics*. If needed, *EachSmallModels* allows evaluating each bivariate model via the pooling evaluation (Table \@ref(tab:esm-table2-2)).

```{r esm-eval,warning=FALSE,message=FALSE}
my.ESM.EF.eval <- ecospat.ESM.EnsembleEvaluation(
                    ESM.modeling.output = my.ESM,
                    ESM.EnsembleModeling.output = 
                    my.ESM.EF,metrics = c("SomersD","AUC","MaxTSS","Boyce"), 
                    EachSmallModels = FALSE)

## Evaluation dataset obtained by the pooling evaluation
pred.test <- as.data.frame(my.ESM.EF.eval$ESM.fit)

## Evaluation scores of the ESM based on the pooling evaluation
eval<-my.ESM.EF.eval$ESM.evaluations
```

```{r esm-table2-2, results='asis', echo=FALSE}
knitr::kable(eval, caption = "ESM performances based on the pooling evaluation", digits = 3)
```

The *ecospat* package also has numerous functions to compute model performances based on model predictions and species observation. For example, the Boyce index that only requires species presences for evaluating predictions [@hirzel2006] can be calculated with the function *ecospat.boyce*. The argument *obs* should contain the model prediction for the presences while *fit* should contain the predictions of background points. Correlation measurement can be changed via the argument *method*

```{r esm-boyce,results='asis'}
boyce.index <- ecospat.boyce(fit = pred.test$Fit_GLM, 
                             obs = pred.test$Fit_GLM[pred.test$resp==1], 
                             PEplot = FALSE,
                             method = "spearman")
boyce.index$cor
```

MaxTSS and MaxKappa can be calculated via the functions *ecospat.max.tss* and *ecospat.max.kappa* and the variations of TSS and Kappa metric on a threshold can be done with *ecospat.plot.tss* (Figure \@ref(fig:tss-plot)) and *ecospat.plot.kappa* (Figure \@ref(fig:kappa-plot)).

```{r tss-plot,results='asis',fig.cap="TSS value as a function of the probability threshold. The maxTSS of 0.61 corresponds to a probability threshold of 0.34"}
MaxTSS <- ecospat.max.tss(Pred = pred.test$Fit_GLM, Sp.occ = pred.test$resp)
MaxTSS$max.TSS

ecospat.plot.tss(Pred = pred.test$Fit_GLM, Sp.occ = pred.test$resp)
```

```{r kappa-plot,results='asis',fig.cap="TSS value as a function of the probability threshold. The maxKappa of 0.24 corresponds to a probability threshold of 0.76"}
MaxKappa <- ecospat.max.kappa(Pred = pred.test$Fit_GLM, Sp.occ = pred.test$resp)
MaxKappa$max.Kappa

ecospat.plot.kappa(Pred = pred.test$Fit_GLM, Sp.occ = pred.test$resp)
```

Model adequacy can also be checked by observing species response curves to each environmental predictors and assess whether they make ecological sense. To do so, The function *ecospat.ESM.responsePlot* can be used. This function is an adaptation of the Evaluation Strip method proposed by @elith2005 and needs the objects returned by *ecospat.ESM.Modeling* and *ecospat.ESM.EnsembleModeling*. The statistic used generate the response curve for a single predictor while keeping all others constant can be changed via the argument *fixed.var.metric.* When looking at the curves produced with this approach, we can see that two variables seem important for our studied species (Figure \@ref(fig:esm-resp)). To go deeper into this, we can also compute their variable importance.

```{r esm-resp,  message=F, warning=F,fig.cap="Response curves to each environmental predictors"}
response.plots <- ecospat.ESM.responsePlot(
                                ESM.EnsembleModeling.output = my.ESM.EF,
                                ESM.modeling.output = my.ESM,
                                fixed.var.metric = 'median')
```

To measure the contribution of each variable, the function *ecospat.ESM.VarContrib* can be employed. This function computes the ratio between the sum of weights of bivariate models where a focal variable was used and the sum of weights of bivariate models where the focal variable was not used. The ratio is corrected for the number of models with or without the focal variable. This ratio gives an indication on the proportional contribution of the variable in the final ensemble model [@breiner2015]. A value of higher than 1 indicates that the focal variable has a higher contribution than average. For the ensemble model, a weighted mean is applied among model algorithms. In this example, we can see that only two variables seem useful to quantify the ecological niche of *Veronica alpina* (Table \@ref(tab:esm-table4)).

```{r esm-varcontrib, results='asis',warning=F,message=F}
var.contrib <- ecospat.ESM.VarContrib(ESM.modeling.output = my.ESM,
                                      ESM_EF.output = my.ESM.EF)
```

```{r esm-table4, results='asis',warning=F,message=F,echo=FALSE}
knitr::kable(var.contrib, caption = "Variable contributions to ESMs", digits = 3)
```

After looking at the model performances and response curves, models can be projected using two functions: *ecospat.ESM.Projection* that projects each bivariate model, and *ecospat.ESM.EnsembleProjection* that generates the ensemble of these bivariate models. The *new.env* argument allows the use of a *SpatRaster* object to project models onto a new area or time period. The argument *name.env* gives a name to the projection. Parallel computing can be enabled with the argument *parallel.* The generated maps correspond to the rounded values of habitat suitability times 1000 (Figure \@ref(fig:esm-projplot)).

```{r esm-proj, results='hide', message=FALSE,warning=FALSE}

### Projection of simple bivariate models into new space 
my.ESM.proj.current <- ecospat.ESM.Projection(ESM.modeling.output = my.ESM,
                                              new.env = env,
                                              name.env = "currentTime",
                                              parallel = FALSE,
                                              cleanup = FALSE)

### Projection of calibrated ESMs into new space 
my.ESM.EF.proj.current <- ecospat.ESM.EnsembleProjection(
                              ESM.prediction.output = my.ESM.proj.current,
                              ESM.EnsembleModeling.output = my.ESM.EF,
                              chosen.models = "all")
```

```{r esm-projplot,fig.cap="projected ESM of Veronica alpina at present time"}
plot(my.ESM.EF.proj.current, col = rev(viridis::viridis(50)))
points(xy[sp_occ==1,],col = "black")
```

### Post-modeling

ESM projections can be afterwards binarized (i.e. transforming the continous suitability index into presence-absence). To binarize these maps, diverse thresholds can be computed via the function *ecospat.ESM.threshold* (Table \@ref(tab:esm-table5)). This function also provides evaluation scores for the full model (thus, evaluating the fit of the model but not its predictive power and transferability).

```{r esm-thresh, results='asis', warning=FALSE, message=FALSE}

Thr <- ecospat.ESM.threshold(ESM.EnsembleModeling.output = my.ESM.EF, PEplot = FALSE)
```

```{r esm-table5, results='asis', warning=FALSE, message=FALSE,echo=FALSE}
obj <- t(Thr[,-1])
colnames(obj) = Thr[1,1]
title<-"Various threshold and fit performances of ESM"
knitr::kable(obj, caption = title, digits = 3, longtable = TRUE)
```

Model projections can be afterwards binarized with the function *ecospat.binary.model* which need in the arguments *Pred*, a spatial grid (Figure \@ref(fig:esm-bin)) and *Threshold*, the value of the threshold.

```{r esm-bin,fig.cap="Binarization of the projected ESM based on the threshold maximizing the TSS"}
my.ESM.EF.proj.current.bin <- ecospat.binary.model(Pred = my.ESM.EF.proj.current,
                                                   Threshold = (Thr$TSS.th)*1000)
plot(my.ESM.EF.proj.current.bin)
```

After binarizing the maps, we could quantify the species range size or the occupied patches from ESM maps and IUCN criteria. In the ecospat package, the function *ecospat.rangesize* and *ecospat.occupied.patch* are made for these purposes.

More precisely, *ecospat.rangesize* allows quantifying the Area of Occupancy (AOO) and the Extent of Occurrence (EOO) criteria (@iucn2012; Figure \@ref(fig:esm-iucn)). Numerous parameters are available and described in the help file.

```{r esm-iucn, warning=FALSE,fig.cap="IUCN criteria. AOO corresponds to areas in green. EOO corresponds to the convex polygon in red"}
rangesize <- ecospat.rangesize(my.ESM.EF.proj.current.bin,
                               xy = xy[sp_occ==1,],
                               AOO.circles = TRUE,
                               lonlat = FALSE)
A00<-rangesize$RangeSize$AOO
A00/1000^2 #area in km2
EOO<-rangesize$RangeSize$EOO 
EOO/1000^2 #area in km2

plot(my.ESM.EF.proj.current.bin,legend = FALSE)
plot(rangesize$RangeObjects$AOO,add = TRUE, col = "red",legend = FALSE)
plot(rangesize$RangeObjects$EOO@polygons,add = TRUE, border = "red", lwd = 2)
```

*ecospat.occupied.patch* quantified the number of patches where species occupied based on species distribution predictions, species occurrences and a buffer value (in meter) around species occurrences (Figure \@ref(fig:esm-occpatch)).

```{r esm-occpatch, fig.cap="Occupied patches. Each patch corresponds to a different color"}
ocp  <- ecospat.occupied.patch(my.ESM.EF.proj.current.bin,
                               xy[sp_occ==1,],
                               buffer = 500)
plot(ocp,col = viridis::viridis(50))
points(xy[sp_occ==1,],col = "red",cex = 0.5, pch = 19)
```

## Niche dynamics of invasive species

In this section, we use the *ecospat* package to investigate the environmental niche dynamics of an invasive species across its native and invaded ranges. We will develop the example of the spotted knapweed (*Centaurea stoebe)*, an Asteraceae native to Central and Eastern Europe. In Europe, both diploid and tetraploid cytotypes can be found, but only the tetraploids have spread to North America, where it is considered an invasive species, most presumably because of its ability to produce perennial polycarpic plants [@treier2009; @mráz2011]. This species served as a study case upon which most of the methodological development on niche dynamics presented in this section have been based [@broennimann2007 ; @broennimann2008; @broennimann2012 ; @petitpierre2012 ; @broennimann2014; @guisan2014]. Note however that, for practical reasons, the analyses presented here use occurrence and environmental data at a lower resolution than used in the original analyses, which can lead to slightly different results. The main idea of the niche dynamics analyses is to quantify realized environmental niches (sensu @hutchinson1957, see @araújo2006; @broennimann2007) by investigating the occurrence density of species in a gridded climatic space allowing for direct pixel quantification and permutation tests [@broennimann2012].

### Data preparation

First we need to load the R packages employed for the analyses. In addition to the package *ecospat*, the package *terra* is required for handling geo-referenced datasets, the package *geodata* to download the climatic dataset, the package *dplyr* to query, subset and summarize data, and the package *ade4* to perform the ordination analyses.

```{r load2, results='hide', message=F, warning=F}
# Load the needed packages
library(ecospat)
library(terra)
library(tidyterra)
library(geodata)
library(dplyr)
library(ade4)
library(viridis)
```

Then we download the climatic data, here the bioclim variables of the WorldClim v2.1 dataset at 10' resolution using *worldclim_global* from the *geodata* package. For computational efficiency, we aggregate the climatic data at 30' resolution (\~50 km) using *aggregate* from *terra.* To delimit the study area for the analyses we also download (and unzip) the geo-referenced polygons of the Terrestrial Ecoregions of the World distributed by the WWF.

```{r clim, results='hide', message=F, warning=F}
## seed set for reproducible results
set.seed(123)
## create a randomly attributed temporary local folder to download the data
tempdir<-tempdir()

## download worlclim 2.1 climate data at 10' resolution and biomes data

clim.world<-geodata::worldclim_global("bio",res=10,version="2.1",path=tempdir)
clim.world<-terra::aggregate(clim.world,3,fun=mean,na.rm=TRUE)
names(clim.world)<-sub("wc2.1_10m_","",names(clim.world))
names(clim.world)<-sub("_","",names(clim.world))

## download Terrestrial Ecoregions of the World dataset

teow.URL<-"https://storage.googleapis.com/teow2016/Ecoregions2017.zip"
# official website for manual download: 
# https://files.worldwildlife.org/wwfcmsprod/files/Publication/file
# /6kcchn7e3u_official_teow.zip
download.file(teow.URL,destfile = paste0(tempdir,"/TEOW.zip"))
unzip(paste0(tempdir,"/TEOW.zip"),exdir = tempdir)
biomes<-terra::vect(paste0(tempdir,"/Ecoregions2017.dbf"),crs="+proj=longlat")

```

We now load and prepare the data for the two regions of analysis, namely Europe, where the species is native, and North America, where the species is invasive. The occurrence data are retrieved from the extdata folder included in the *ecospat* package and converted to a convenient vector format with *vect*. We thus use only occurrences of tetraploid cytotypes to ensure a fair comparison between ranges, as diploids are absent from the North American continent . We restrict the regions of analysis (background) to the biomes where the species is present in each realm using a pipe of *dyplr* functions. The biomes selected are Temperate Broadleaf Forests (4), Temperate coniferous forests (5), Boreal Forests (6), Temperate Grasslands (8) Mediterranean Forests and woodlands (12) and Deserts (13). We further restrict the background to areas distant of less than 500 km from existing occurrences (this avoids including large regions of the Paleoarctic realm where the species does not occur in the native range). The background vector can then be used to *crop* and *mask* the climatic data. Note that the choice of the background area has important consequences on niche quantification [@rödder2011; @broennimann2012], especially in the similarity tests presented below. The general advice in selecting the background is to include areas that are, or have been accessible to the species through dispersal during its evolutionary life [@barve2011]. We recommend performing a sensitivity analysis on the parameters chosen for the background selection to ensure the stability and reliability of results. The plots show the background area and overlaid occurrences to visualize the final data to be analyzed (Figures \@ref(fig:dataEU) and \@ref(fig:dataNA).

```{r dataEU, fig.cap="European background and occurrences. Occurrences are shown with open circles. The color gradient correponds to mean annual temperatures (bio1) over the background area"}
## data for Europe
# occurrences

occ.EU.df<-read.delim(system.file("extdata/Csto4xEU.txt", package="ecospat"))
occ.EU<-vect(occ.EU.df,geom=c("x", "y"),crs="+proj=longlat")

# biomes

buf<-buffer(occ.EU,1500000) %>% aggregate() # 1500km buffer around occurrences
bkg<-biomes  %>% filter(BIOME_NUM %in% c(4,5,6,8,12,13)) %>% group_by(REALM) %>% 
     summarize() %>% crop(buf) # biomes in the realm where Centaurea is present
clim.EU<-crop(clim.world,bkg)
clim.EU<-mask(clim.EU,bkg)
plot(clim.EU[[1]], col=viridis(100,option="turbo")) 
    #plot annual temperature (bio1)
plot(occ.EU,cex=0.5,pch=21,alpha=0.4,add=TRUE)
```

```{r dataNA, fig.cap="North American background and occurrences. Occurrences are shown with open circles. The color gradient correponds to mean annual temperatures (bio1) over the background area"}
## data for North America
# occurrences

occ.NAM.df<-read.delim(system.file("extdata/Csto4xNAM.txt", package="ecospat"))
occ.NAM<-vect(occ.NAM.df,geom=c("x", "y"),crs="+proj=longlat")

# biomes

buf<-buffer(occ.NAM,1500000) %>% aggregate()
bkg<-biomes %>% filter(BIOME_NUM %in% c(2,4,5,6,8,12,13)) %>% group_by(REALM) %>%  
     summarize() %>% crop(buf)
clim.NAM<-crop(clim.world,bkg)
clim.NAM<-mask(clim.NAM,bkg)
plot(clim.NAM[[1]], col=viridis(100,option="turbo"))
plot(occ.NAM,cex=0.5,pch=21,alpha=0.4,add=TRUE)
```

### Niche quantification

The niche quantification will be performed in two-dimensional space represented by the two first components (PC1 and PC2) of a principal component analysis (PCA). We chose to use two niche axes that maximize the variation contained in the original 19 bioclim variables from the WorldCim dataset, but the user could as well choose two specific variables relevant for the ecology of the species based on expert assessment. Niche quantification in one dimension is presented in a following section. To calibrate the PCA on the full extent of climatic values present in both background areas, we first join the European and North American datasets using *merge* from *terra.* We then extract the climatic values for each pixels of the global background (*clim.glob*), for each continental background (*clim.glob.EU*, *clim.glob.NAM*), and for each set of continental occurrences (*clim.occ.EU*, *clim.occ.NAM*) using *extract* from *terra*. The PCA is calibrated on the values of the global background using *dudi.pca* from *ade4*. The function ecospat.plot.contrib allows to plot the correlation circle (Figure \@ref(fig:pca)) indicating the correlation of original variables with the principal component axes with arrows. Here PC1 is mostly (but not exclusively) correlated with precipitation variables (bio12 to bio 19) while PC2 is more correlated with temperature variables (bio1 to bio11). The inertia of each principal component (similar to the variance explained in a regression model context) is also indicated. Here the combined contribution of PC1 and PC2 explains almost 80% of the climatic variation in the original dataset. The calculation of PCA scores (*li* vectors in *dudi.pca* objects) for the pixels of the backgrounds (*scores.glob, scores.glob.EU, scores.glob.NAM*) and for the occurrences (*scores.occ.EU, scores.occ.NAM*) using *suprow* from *ade4* will be useful for the next step.

```{r pca, results='hide',message=F, warning=F, fig.cap="Correlation circle showing the importance of original variables in the PCA"}

# extraction of climate values

clim<-terra::merge(clim.EU,clim.NAM)
clim.glob<-na.omit(terra::values(clim))
clim.glob.EU<-na.omit(terra::values(clim.EU))
clim.glob.NAM<-na.omit(terra::values(clim.NAM))
clim.occ.EU<-terra::extract(clim.EU,occ.EU,ID=FALSE)
clim.occ.NAM<-terra::extract(clim.NAM,occ.NAM,ID=FALSE)

## PCA calibration
pca.env<-dudi.pca(clim.glob,scannf=FALSE,nf=2)
ecospat.plot.contrib(contrib=pca.env$co,eigen=pca.env$eig)

## PCA scores

scores.glob<-pca.env$li
scores.glob.EU<-suprow(pca.env,clim.glob.EU)$li
scores.glob.NAM<-suprow(pca.env,clim.glob.NAM)$li
scores.occ.EU<-suprow(pca.env,clim.occ.EU)$li
scores.occ.NAM<-suprow(pca.env,clim.occ.NAM)$li
```

The following step is the core of the niche quantification analysis. The function *ecospat.grid.clim.dyn* first creates a gridded two-dimensional climatic space of RxR pixels bounded by the minimum and maximum values of the PCA scores of the background points (*score.glob*). It then uses kernel density estimation functions to calculate the density of occurrence in each pixel. The result is a grid of occurrence densities in a two-dimensional climatic space, with high values indicating the core of species niche, low values indicating the margin of the niche, and zero values indicating conditions outside of the niche. For comparison between ranges, occurrences densities are standardized between zero and one (object z.uncor). An alternative version of occurrences densities is also calculated by dividing in each pixel the occurrence density by the density of the corresponding climatic condition in the background (object z.cor return by *ecospat.grid.clim.dyn;* defined as occurrence occupancy in @broennimann2012). The function *ecospat.plot.niche* allows to plot the calculated niche as a gradient of occurrence densities along the two first PCA axes (Figure \@ref(fig:grid-clim1)). Alternatively, the niches can be directly plotted with the *plot* function as *rast* objects (Figure \@ref(fig:grid-clim2)).

```{r grid-clim1, fig.cap="Occurence densities in a) Europe, and b) North America. Shades of gray pixels indicate occurrence densities along the two first PCA axes. Solid lines show the contours of all climates, and dashed line show the 50% most common climates"}
z.EU  <- ecospat.grid.clim.dyn(glob=scores.glob, 
                               glob1 = scores.glob.EU, 
                               sp= scores.occ.EU, 
                               R=100, th.sp=0,th.env=0,kernel.method = "ks")
z.NAM <- ecospat.grid.clim.dyn(glob=scores.glob, 
                               glob1 = scores.glob.NAM, 
                               sp= scores.occ.NAM, 
                               R=100, th.sp=0,th.env=0,kernel.method = "ks")
par(mfrow=c(1,2))
ecospat.plot.niche(z.EU,title="a)")
ecospat.plot.niche(z.NAM,title="b)")
```

```{r grid-clim2, fig.cap="Occurence densities in a) Europe, and b) North America plotted as rasters"}
par(mfrow=c(1,2))
plot(z.EU$z.uncor,col=viridis(100,option="magma",direction = -1),main="a)")
plot(z.NAM$z.uncor,col=viridis(100,option="magma",direction = -1),main="b)")
```

We can analyze the results of the niches quantification more in depth. First we want to know how much the two niches overlap. This can be achieved with the *ecospat.niche.overlap.* which calculates the Schoener's D and Hellinger I overlap indices. The parameter *cor=TRUE* allows to assess the overlap when correcting for the density of environmental conditions in the background. In our case, using cor=FALSE, we detect an overlap D of 0.36.

```{r}

ecospat.niche.overlap(z1=z.EU,z2=z.NAM,cor=FALSE)
```

We can ask the question whether the overlap measured is higher than random. This can be achieved with *ecospat.niche.equivalency.test* and *ecospat.niche.similarity.test* functions, based on [@warren2008]. The test of niche equivalency asks whether the overlap observed between the two niches is significantly higher than the overlap that one would get if the occurrences between the two niches would be randomly reallocated. On the other hand, the test of similarity asks whether the overlap observed between the two niches is significantly higher than the overlap one would get if the two niches would be randomly moved in the background space. Both tests have been implemented in *ecospat* for correspondence with @warren2008, though here in environmental space (while in geographical space for @warren2008; see @broennimann2012), but the niche equivalency test almost always provides significant results even when the overlap is very low and is thus in general not informative [@collart2021; @peterson2011]. We thus only run the niche similarity test, with parameters *rand.type* = 1 to randomize both niches in the background, *rep*=100 to set the number of replications, *intersection*=NA to perform the randomization in global background without removing marginal environments, and using *overlap.alternative* = "higher" as the alternative hypothesis for the test (i.e. a p-value \< 0.05 indicate that the observed overlap is significantly higher than the simulated overlap from randomized niches). Here, with a p-value of 0.25, the niche similarity test indicates that the niche of the spotted knapweed in Europe and in North America do not significantly depart from similarity.

```{r}
sim.test<-ecospat.niche.similarity.test(z1=z.EU,z2=z.NAM,rand.type=1,rep=100,
                                intersection=NA, overlap.alternative="higher")

sim.test$obs$D
```

We can further investigate whether some parts of the native niche are not occupied in the invaded range, and conversely, if the species occupies new conditions in the invaded range. This can be done with the *expansion* , *stability* and *unfilling* indices returned by the *ecospat.niche.dyn.index* function (see COUE framework in @guisan2014). *Stability* and *expansion* correspond to the percentages of pixels of the invasive niche which conditions present, and not present in the native range, respectively. *Unfilling* corresponds to the percentage of the native niche which is not present in the invaded range. The results indicate that the invasive niche of the spotted knapweed expands largely beyond conditions present in the native niche, while only a small fraction of the native niche is not occupied in the invaded ranges. It is possible to plot the *expansion*, *stability* and *unfilling* indices in the environmental space using the function *ecospat.plot.niche.dyn* (Figure \@ref(fig:USEplot)).

```{r USE}
dyn<-ecospat.niche.dyn.index(z.EU,z.NAM)
dyn$dynamic.index.w
```

```{r USEplot, fig.cap="niche dynamics. The plot shows the niche stability in blue, niche expansion in orange, and niche unfilling in yellow. The solid green and red contour lines indicate the extent of environmental conditions that exist in the native and invaded ranges, respectively. The dotted contour lines indicate the quantile of the most abundant environment in both ranges (here 50% most abundant, with quant=05). The densities of occurrences in the native range (argument interest=1) are displayed using gray shading."}
# colorblind friendly palette, Wong, B. (2011). Nature Methods
pal<-c("#D55E00","#E69F00","#F0E442","#CC79A7","#009E73","#0072B2","#56B4E9")

ecospat.plot.niche.dyn(z.EU,z.NAM,quant=0.5,name.axis1 ="PC1",
            name.axis2="PC2",interest=1,col.unf="#F0E442", col.exp="#E69F00", 
            col.stab="#0072B2", colZ1="#009E73", colZ2="#D55E00",transparency = 40)
```

Finally, we might wonder where are the areas predicted with a high density of occurrence by the niche quantification models. Indeed, every pixel in the geography corresponds to a predicted density of occurrence in the environmental space. The function *ecospat.niche.zProjGeo* allows doing that. It is possible to project the density of occurrence of one niche (*z* object) across its own range of calibration (e.g. for Europe using *zproj*=NULL and *env*=clim.EU; Figure \@ref(fig:EUniche)) but also to project the density of occurrence to another range, here for example to project the European niche of the spotted knapweed in North America (Figure \@ref(fig:EUnicheinNA)), which interestingly becomes then a form of SDM.

```{r EUniche, fig.cap="European niche projected in Europe. Dark colors indicate predicted high density of occurrence. Open circles indicate know occurrences of the species."}
geo.z.EU<-ecospat.niche.zProjGeo(z=z.EU,zproj=NULL,env=clim.EU) 
plot(geo.z.EU,col=viridis(100,option="magma",direction =-1)) 
plot(occ.EU,add=TRUE,cex=0.5,pch=21,alpha=0.4)  
```

```{r EUnicheinNA, fig.cap="European niche projected in North America. Dark colors indicate predicted high density of occurrence. Open circles indicate know occurrences of the species."}
geo.z.EUtoNAM<-ecospat.niche.zProjGeo(z=z.EU,zproj=z.NAM,env=clim.NAM) 
plot(geo.z.EUtoNAM,col=viridis(100,option="magma",direction =-1)) 
plot(occ.NAM,add=TRUE,cex=0.5,pch=21,alpha=0.4)  
```

Similarly, we can also project niche dynamic indices in geography using the function *ecospat.niche.dynIndexProjGeo* (Figure \@ref(fig:EUnichedyninNAM)). The argument *proj* controls the range of projection (i.e. proj=2 for the North American range), and *env* sets the environmental *rast* object used for the projection. The results here show that a large part of the North East and Midwest, as well as smaller fragmented areas of the Rocky Mountains in the West have climatic conditions corresponding to the native niche of spotted knapweed, but that large areas, especially in the South correspond to an expansion of the niche.

```{r EUnichedyninNAM, fig.cap="Niche dynamics of the European niche projected in North America. Colors indicate the type of niche dynamics in regard to the European niche."}
geo.dyn<-ecospat.niche.dynIndexProjGeo(z.EU,z.NAM,proj=2,env=clim.NAM)

plot(geo.dyn,col = c("grey", "#F0E442", "#E69F00", "#0072B2"),
     plg=list(legend=c("outside both niches", "niche unfilling", 
                       "niche expansion", "niche stability"), x="bottomleft"))
plot(occ.NAM,add=TRUE,cex=0.5,pch=21,alpha=0.4)
```

# Discussion and perspectives

We have illustrated the use of the ecospat R package to model a rare alpine plant species with ensemble of small models (ESMs; @lomba2010; @breiner2015) and to quantify and compare niches in the native and invaded ranges of an invasive plant species with the COUE framework [@broennimann2012; @guisan2014]. These two examples only used part of the functions available in ecospat. Other examples could have been developed for instance along the steps of the spatially-explicit species assemblage modelling (SESAM) frameworks [@guisan2011], including pre-modelling functions to analyse co-occurrences, biotic interactions and assembly of species to form assemblages [@damen2018b; @scherrer2019a], post-modelling of assemblages and communities from individual species predictions [@damen2015], or the pre- and post- analyses and modelling of spatial patterns of phylogenetic diversity, at the level of individual species [@pio2014; @damen2018a] or communities [@ndiribe2013]. For an example of community modelling using these functions, we send interested readers to @dicola2017 (example 2) and to the related published references (some cited above or in @damen2015). By containing some unique features (COUE niche framework, ESMs, SESAM framework), the ecospat package is complementary to other R packages for SDMs and spatial analyses in ecology, biogeography and conservation (see e.g. @joo2019; @meynard2019; @sillero2023, Kass et al. in review). 

Furthermore, ecospat is complemented by three other packages initially developed in our ecospat group: migclim [@engler2012], covsel (@adde2023a) and n-sdm (@adde2023b). The migclim package allows integration of dispersal constraints into projections of species distribution models (see e.g. @engler2009), while covsel allows for efficient selection of predictors in SDMs out of large sets of candidate covariates. The n-sdm package is very complementary to ecospat in the sense that it allows modelling large number of species on high-power computing platforms (HPCs; i.e. clusters) using three wishable properties: (i) a nested hierarchical approach (the 'n' in n.sdm) to avoid niche-truncation (when the study area is too small to capture all environmental requirements of a species; @chevalier2021; @chevalier2022), (ii) the covsel approach for an efficient selection of predictors [@adde2023a], allowing to consider also predictors that describe the environmental characteristics in the neighborhood of observations (i.e. focal windows analyses; as in @scherrer2019b), not only in the pixel contain the observation; and (iii) the ESM approach described above to also include rare and infrequent species when modelling very large numbers of species, e.g. to assess how biodiversity might evolve under various environmental change scenarios at a national scale (Adde et al. in prep.). 

Finally, ecospat is continuously evolving by increasing and adapting the number of functions to help researchers develop robust and efficient spatial analyses and models to predict the distribution of species and communities. For instance, the ESM functions are currently being optimized to run more efficiently and with less dependencies (Collart et al. in prep.), and the niche functions were expanded to include a 'niche margin index' (NMI; @broennimann2021; Pearman, Broennimann et al. in press) allowing to measure not only niche innerness (i.e. what SDM predictions do) but also niche outerness. Last, as the migclim package currently only runs under ancient versions of R, we foresee to integrate it into a next version of ecospat.

# References