detail Prepare/Model/Combine sections

marinebon · Nov 23, 2023 · b13f652 · b13f652
1 parent ed174c0
commit b13f652
Show file tree

Hide file tree

Showing 19 changed files with 324 additions and 12 deletions.
diff --git a/_quarto.yml b/_quarto.yml
@@ -25,15 +25,26 @@ book:
       Built with <a href="https://quarto.org/" target="_blank">Quarto</a>
   chapters:
     - index.qmd
-    - part: create.qmd
+    - part: "Prepare"
+      chapters:
+        - prep.qmd
+        - occ.qmd
+        - abs.qmd
+        - env.qmd
+    - part: "Model"
       chapters:
-        - prep-data.qmd
         - model.qmd
-    - part: combine.qmd
+        - split.qmd
+        - fit.qmd
+        - calibrate.qmd
+        - predict.qmd
+        - evaluate.qmd
+    - part: "Combine"
       chapters:
+        - combine.qmd
         - ensemble.qmd
         - mosaic.qmd
-        - group.qmd
+        - taxa.qmd
         - indicators.qmd
     - software.qmd
     - organize.qmd

diff --git a/abs.qmd b/abs.qmd
@@ -0,0 +1,21 @@
+---
+title: "Pseudo-absences"
+subtitle: "Generate pseudo-absence or background environmental values to compare with occurrence environment"
+---
+
+Describe various strategies for generating pseudo-absences.
+
+-   [Pseudo-absences • biomod2](https://biomodhub.github.io/biomod2/articles/vignette_pseudoAbsences.html)
+    -   [@barbet-massin2012]
+
+## All background
+
+A common Maxent strategy is to feed all background points into Maxent, and then to use the resulting distribution as a null model. This is the default strategy in Maxent [@phillips2017; @phillips2006; @phillips2008].
+
+## Mask by FAO areas
+
+The FAO areas applicable to species are included in the `aquamapsdata`, presumably from evaluating OBIS observations and the literature.
+
+## Use occurrences from same Family, different species
+
+By using the same family, we can be sure that the pseudo-absences are ecologically similar to the species of interest.
diff --git a/calibrate.qmd b/calibrate.qmd
@@ -0,0 +1,24 @@
+# Calibrate
+
+The process of refining the model to only the most relevant environmental predictor terms is commonly called "Model Selection." One of the most cited scientific paper of all time [@akaike1974] is based on taking a most parsimonious approach to this process -- the so called Akaike Information Criteria (AIC).
+
+It is important to avoid using environmental predictors that are correlated with each other, since the effect of a predictor on the response could be the ecologically inverse, the result of explaining variance on the residuals of the other correlated predictor.
+
+## Predict
+
+The prediction step applies the environmental relationships from the fitted model to a new set of data, typically the seascape of interest, and perhaps with some sort of temporal snapshot (e.g., climatic annual or monthly average).
+
+## Evaluate
+
+Model evaluation uses the set aside test data from the earlier splitting to evaluate how well the model predicts the response of presence or absence. Since the test response data is binary \[0,1\] and the prediction from the model is continuous \[0-1\], a threshold needs to be applied to assign to convert the continuous response to binary. This is often performed through a Receiver Operator Characteristic (**ROC**) curve (@fig-rocr), which evaluates at each threshold the **confusion matrix** (@tbl-confusion-matrix).
+
+|          |              |               |                |
+|----------|--------------|---------------|----------------|
+|          |              | Predicted     |                |
+|          |              | 0 (absence)   | 1 (presence)   |
+| Observed | 0 (absence)  | True absence  | False presence |
+|          | 1 (presence) | False absence | True presence  |
+
+: Confusion matrix to understand predicted versus observed. {#tbl-confusion-matrix}
+
+![ROC curve generated by showing rates of false positive vs false negative as function of changing the threshold value (rainbow colors). Source: [ROCR: visualizing classifier performance in R](https://cran.rstudio.com/web/packages/ROCR/vignettes/ROCR.html)](figures/rocr.png){#fig-rocr}
diff --git a/combine.qmd b/combine.qmd
@@ -1,5 +1,6 @@
 ---
-title: "Combine SDMs"
+title: "Combine"
+subtitle: "Combine SDMs from the same or multiple species"
 ---
 
 We look at combining SDMs to calculate biodiversity based on addressing questions of interest and relevance.
diff --git a/create.qmd b/create.qmd
@@ -5,3 +5,48 @@
 %%| fig-cap: "Diagram of SDM data preparation and model fitting."
 %%| file: diagrams/sdm-process.mmd
 ```
+
+# Prepare Data
+
+```{mermaid}
+%%| label: fig-prep
+%%| fig-cap: "Diagram of SDM data preparation for model fitting."
+%%| file: diagrams/sdm-prep.mmd
+```
+
+-   **obs\
+    **observations: occurrences from OBIS; masked by FAO regions defined by AquaMaps [@aquamapsdata]
+    -   **presence**\
+        OBIS: species occurrence
+    -   **absence**\
+        OBIS not-species, but same family
+-   **env\
+    **environment
+-   **tbl**\
+    table of observations (presence and absence) with environmental values
+
+## Environmental Predictors
+
+### Physiographic
+
+-   `depth`\
+    Bathymetric Depth
+
+-   `d2coast`\
+    Distance to Coast
+
+-   `d2shelf`\
+    Distance to Shelf
+
+### Time Varying
+
+-   `vgpm`\
+    Vertically integrated primary Productivity model
+
+### Depth & Time Varying
+
+-   `temp`\
+    Temperature, either sea-surface temperature (SST) or some modeled product from HyCOM, ROMS or Copernicus
+
+-   `salin`\
+    Salinity
diff --git a/env.qmd b/env.qmd
@@ -0,0 +1,30 @@
+---
+title: "Environment"
+subtitle: "Extract environmental predictors (static and/or dynamic) from various sources for observations (presence and pseudo-absence)"
+---
+
+These data are also used at the prediction step.
+
+### Physiographic
+
+-   `depth`\
+    Bathymetric Depth
+
+-   `d2coast`\
+    Distance to Coast
+
+-   `d2shelf`\
+    Distance to Shelf
+
+### Time Varying
+
+-   `vgpm`\
+    Vertically integrated primary Productivity model
+
+### Depth & Time Varying
+
+-   `temp`\
+    Temperature, either sea-surface temperature (SST) or some modeled product from HyCOM, ROMS or Copernicus
+
+-   `salin`\
+    Salinity
diff --git a/evaluate.qmd b/evaluate.qmd
@@ -0,0 +1,14 @@
+# Evaluate
+
+Model evaluation uses the set aside test data from the earlier splitting to evaluate how well the model predicts the response of presence or absence. Since the test response data is binary \[0,1\] and the prediction from the model is continuous \[0-1\], a threshold needs to be applied to assign to convert the continuous response to binary. This is often performed through a Receiver Operator Characteristic (**ROC**) curve (@fig-rocr), which evaluates at each threshold the **confusion matrix** (@tbl-confusion-matrix).
+
+|          |              |               |                |
+|----------|--------------|---------------|----------------|
+|          |              | Predicted     |                |
+|          |              | 0 (absence)   | 1 (presence)   |
+| Observed | 0 (absence)  | True absence  | False presence |
+|          | 1 (presence) | False absence | True presence  |
+
+: Confusion matrix to understand predicted versus observed. {#tbl-confusion-matrix}
+
+![ROC curve generated by showing rates of false positive vs false negative as function of changing the threshold value (rainbow colors). Source: [ROCR: visualizing classifier performance in R](https://cran.rstudio.com/web/packages/ROCR/vignettes/ROCR.html)](figures/rocr.png){#fig-rocr}
diff --git a/explorations/sdm-1_predicts.qmd b/explorations/sdm-1_predicts.qmd
@@ -6,6 +6,7 @@ url-code: "https://github.com/marinebon/sdm-explore/blob/main/sdm_1.qmd"
 categories: 
    - "data: OBIS"
    - "tech: R"
+   - "model: Maxent"
 editor: source   
 ---
 

diff --git a/figures/rocr.png b/figures/rocr.png
diff --git a/fit.qmd b/fit.qmd
@@ -0,0 +1,30 @@
+# Fit
+
+Model fitting in theory is quite complex, but quite simple in practice, with feeding the prepared data into the modeling function.
+
+However there are MANY modeling techniques from which to choose. For instance check out 238 entries in [6 Available Models | The caret Package](https://topepo.github.io/caret/available-models.html). 
+
+## Calibrate
+
+The process of refining the model to only the most relevant environmental predictor terms is commonly called "Model Selection." One of the most cited scientific paper of all time [@akaike1974] is based on taking a most parsimonious approach to this process -- the so called Akaike Information Criteria (AIC).
+
+It is important to avoid using environmental predictors that are correlated with each other, since the effect of a predictor on the response could be the ecologically inverse, the result of explaining variance on the residuals of the other correlated predictor.
+
+## Predict
+
+The prediction step applies the environmental relationships from the fitted model to a new set of data, typically the seascape of interest, and perhaps with some sort of temporal snapshot (e.g., climatic annual or monthly average).
+
+## Evaluate
+
+Model evaluation uses the set aside test data from the earlier splitting to evaluate how well the model predicts the response of presence or absence. Since the test response data is binary \[0,1\] and the prediction from the model is continuous \[0-1\], a threshold needs to be applied to assign to convert the continuous response to binary. This is often performed through a Receiver Operator Characteristic (**ROC**) curve (@fig-rocr), which evaluates at each threshold the **confusion matrix** (@tbl-confusion-matrix).
+
+|          |              |               |                |
+|----------|--------------|---------------|----------------|
+|          |              | Predicted     |                |
+|          |              | 0 (absence)   | 1 (presence)   |
+| Observed | 0 (absence)  | True absence  | False presence |
+|          | 1 (presence) | False absence | True presence  |
+
+: Confusion matrix to understand predicted versus observed. {#tbl-confusion-matrix}
+
+![ROC curve generated by showing rates of false positive vs false negative as function of changing the threshold value (rainbow colors). Source: [ROCR: visualizing classifier performance in R](https://cran.rstudio.com/web/packages/ROCR/vignettes/ROCR.html)](figures/rocr.png){#fig-rocr}
diff --git a/index.qmd b/index.qmd
@@ -62,6 +62,14 @@ By definition `r glossary("MBON")` is a network, so this is inclusive of and mea
 
 -   The world is quickly moving towards a future trying to conserve 30% of the oceans by 2030, so called "[**30 by 30**](https://en.wikipedia.org/wiki/30_by_30)". In the U.S., this is [America the Beautiful](https://www.noaa.gov/america-the-beautiful) initiative. We need biodiversity indicators to track progress. This push for conservation is driven by increasing impacts of **climate change**, as evidenced by marine heatwaves and shifts in population distributions.
 
+## Process
+
+```{mermaid}
+%%| label: fig-process
+%%| fig-cap: "Diagram of SDM data preparation and model fitting."
+%%| file: diagrams/sdm-process.mmd
+```
+
 ## Contribute
 
 We very much welcome your feedback, contributions and collaboration. Here are a few ways from least to most involved:
@@ -84,15 +92,18 @@ We very much welcome your feedback, contributions and collaboration. Here are a
 
 4.  If you are a regular contributor, you can be added to the collaborators of this repository to push changes directly (without needing a pull request).
 
-## Features of this Book
+## Features
+
+This Quarto book has a few cool features:
 
 -   Multiple formats\
     From the singe set of source Quarto documents (\*.qmd), several output formats are rendered: html, pdf, docx. This is particularly helpful when suggesting changes. It also lends itself well to being carved into manuscripts.
 
 -   Self-rendering\
     Github hosts the web pages (\*.html), which get rendered from the source code (\*.qmd) using a Github Action. So edits can be made simply through the web interface and all outputs get updated (html, pdf, docx). It also ensures the reproducibility of the document with a common setup environment.
 
--   Mermaid diagrams
+-   Mermaid diagrams\
+    e.g., @fig-process, @fig-prep, @fig-model
 
 -   Quarto document listings
 

diff --git a/indicators.qmd b/indicators.qmd
@@ -1,5 +1,6 @@
 ---
 title: "Indicators"
+subtitle: "Calculate indicators of ecological or management interest beyond taxonomic groupings"
 ---
 
 ## Diversity
@@ -13,7 +14,7 @@ Here are the classic diversity indices from the R package `vegan`:
 > D_2 &= \frac{1}{\sum_{i=1}^S p_i^2}    &\text{inverse Simpson}
 > \end{aligned}
 > $$
-> 
+>
 > where $p_i$ is the proportion of species $i$, and $S$ is the number of species so that $\sum_{i=1}^S p_i = 1$, and $b$ is the base of the logarithm.
 
 ## Endemism

diff --git a/model.qmd b/model.qmd
@@ -1,8 +1,10 @@
-# Model
+---
+title: "Model"
+subtitle: "Model the distribution of a species"
+---
 
 ```{mermaid}
 %%| label: fig-model
 %%| fig-cap: "Diagram of SDM Modeling processes."
 %%| file: diagrams/sdm-model.mmd
 ```
-
diff --git a/occ.qmd b/occ.qmd
@@ -0,0 +1,21 @@
+---
+title: "Occurrences"
+subtitle: "Fetch presence observations and filter for quality control"
+---
+
+To describe:
+
+-   `robis`
+
+-   Filter based on quality flags
+
+-   Remove outliers
+
+    -   [`eks`](https://cran.r-project.org/web/packages/eks/vignettes/tidysf_kde.html)\
+        *Tidy and Geospatial Kernel Smoothing for spatially filtering outlier observations*
+
+        ![Source: Kernel density estimates for tidy and geospatial data in the eks package](figures/software/eks.png){#fig-eks}
+
+## Fetch OBIS
+
+## Filter occurrences
diff --git a/predict.qmd b/predict.qmd
@@ -0,0 +1,18 @@
+# Predict
+
+The prediction step applies the environmental relationships from the fitted model to a new set of data, typically the seascape of interest, and perhaps with some sort of temporal snapshot (e.g., climatic annual or monthly average).
+
+## Evaluate
+
+Model evaluation uses the set aside test data from the earlier splitting to evaluate how well the model predicts the response of presence or absence. Since the test response data is binary \[0,1\] and the prediction from the model is continuous \[0-1\], a threshold needs to be applied to assign to convert the continuous response to binary. This is often performed through a Receiver Operator Characteristic (**ROC**) curve (@fig-rocr), which evaluates at each threshold the **confusion matrix** (@tbl-confusion-matrix).
+
+|          |              |               |                |
+|----------|--------------|---------------|----------------|
+|          |              | Predicted     |                |
+|          |              | 0 (absence)   | 1 (presence)   |
+| Observed | 0 (absence)  | True absence  | False presence |
+|          | 1 (presence) | False absence | True presence  |
+
+: Confusion matrix to understand predicted versus observed. {#tbl-confusion-matrix}
+
+![ROC curve generated by showing rates of false positive vs false negative as function of changing the threshold value (rainbow colors). Source: [ROCR: visualizing classifier performance in R](https://cran.rstudio.com/web/packages/ROCR/vignettes/ROCR.html)](figures/rocr.png){#fig-rocr}
diff --git a/prep-data.qmd → prep.qmd b/prep-data.qmd → prep.qmd
@@ -1,4 +1,9 @@
-# Prepare Data
+---
+title: "Prepare"
+subtitle: "Prepare observations and environmental data for modeling"
+---
+
+# Prepare
 
 ```{mermaid}
 %%| label: fig-prep