33-Spatially-Continuous-Data-II.Rmd

# Spatially Continuous Data II

*NOTE*: The source files for this book are available with companion package [{isdas}](https://paezha.github.io/isdas/). The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes. 

Previously, you learned about the analysis of area data. Starting with this practice, you will be introduced to another type of spatial data: continuous data, also called fields. 

## Learning objectives

In the previous practice you were introduced to the concept of fields/spatially continuous data. Three different approaches were discussed that can be used to convert a set of observations of a field at discrete locations into a surface, namely tile-based approaches, inverse distance weighting (IDW), and $k$-point means. In this practice, you will learn:

1. About intervals of confidence for predictions.
2. Using trend surface analysis as an interpolation tool.
3. The difference between accuracy and precision in interpolation.

## Suggested readings

- Bailey TC and Gatrell AC [-@Bailey1995] Interactive Spatial Data Analysis, Chapters 5 and 6. Longman: Essex.
- Bivand RS, Pebesma E, and Gomez-Rubio V [-@Bivand2008] Applied Spatial Data Analysis with R, Chapter 8. Springer: New York.
- Brunsdon C and Comber L [-@Brunsdon2015R] An Introduction to R for Spatial Analysis and Mapping, Chapter 6, Sections 6.7 and 6.8. Sage: Los Angeles.
- Isaaks EH and Srivastava RM  [-@Isaaks1989applied] An Introduction to Applied Geostatistics, Chapter 4. Oxford University Press: Oxford.
- O'Sullivan D and Unwin D [-@Osullivan2010] Geographic Information Analysis, 2nd Edition, Chapters 9 and 10. John Wiley & Sons: New Jersey.

## Preliminaries

As usual, it is good practice to begin with a new session of `R` or at least to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in `R` to clear the workspace is `rm` (for "remove"), followed by a list of items to be removed. To clear the workspace from _all_ objects, do the following:
```{r}
rm(list = ls())
```

Note that `ls()` lists all objects currently on the workspace.

Load the libraries you will use in this activity:
```{r message = FALSE, warning=FALSE}
library(isdas)
library(plotly)
library(spatstat)
library(spdep)
library(tidyverse)
```

Begin by loading the data file that we will use in this chapter:
```{r}
data("Walker_Lake")
```

You can verify the contents of the dataframe:
```{r}
summary(Walker_Lake)
```

We have already met this data set before: it contains a set of coordinates `X` and `Y` (units are meters; the origin is false), as well as two quantitative variables `V` and `U` (notice that there are missing observations in `U`), and a factor `T`.

## Uncertainty in the predictions

A common task in the analysis of spatially continuous data is to estimate the value of a variable at a location where it was not measured - or in other words, to spatially interpolate the variable. In Chapter \@ref(spatially-continuous-data-i), we introduced three methods for spatial interpolation based on a sample of observations.

The three algorithms that was saw before (i.e., Voronoi polygons, IDW, and $k$-point means) accomplish the task of providing spatial estimates. The values that we obtain with these methods are called _point estimates_. What is a point estimate? Recall the definition of a field that is the outcome of a purely spatial process:
$$
z_i = f(u_i, v_i) + \epsilon_i
$$

Accordingly, the prediction of the field at a new location is defined as a function of the estimated process and some random residual as follows:
$$
\hat{z}_p = \hat{f}(u_p, v_p) + \hat{\epsilon}_p
$$
The first part of the prediction ($\hat{f}(u_p, v_p)$) is the point estimate of the prediction, whereas the second part ($\hat{\epsilon}_p$) is the random part of the estimate.

The methods we saw in Chapter \@ref(spatially-continuous-data-i) can be used to estimate point estimates of the process. Unfortunately, they do not provide an estimate for the random element, so it is not possible to assess the uncertainty of the estimated values directly, since this depends on the distribution of the random term.

There are different ways in which at least some crude assessment of uncertainty can be attached to point estimates obtained from Voronoi polygons, IDW, or $k$-point means. For example, a very simple approach could be to use the sample variance to calculate intervals of confidence. This could be done as follows.

We know that the sample variance describes the inherent variability in the distribution of values of a variable in a sample. Consider for instance the distribution of the variable in the Walker Lake dataset:
```{r}
ggplot(data = Walker_Lake, aes(V)) + 
  geom_histogram(binwidth = 60)
```

Clearly, there are no values of the variable less than zero, and values in excess of 1,000 are rare.

The standard deviation of the sample is:
```{r}
sd(Walker_Lake$V)
```

The standard deviation is the average deviation from the mean. We could use this value to say that typical deviations from our point estimates are a function of this standard deviation (to what extent, it depends on the underlying distribution).

A problem with using this approach is that the distribution of the variable is not normal, and the distribution of $\hat{\epsilon}_p$ is unknown; the standard deviation is centered on the mean (meaning that it is a poor estimate for observations away from the mean); and in any case the standard deviation of the sample is too large for local point estimates if there is spatial pattern (since we know that the local mean will vary systematically).

There are other approaches to deal with non-normal variables, for instance Wilcox's test, but some of the other limitations remain.
```{r}
wilcox.test(Walker_Lake$V, conf.int = TRUE, conf.level = 0.95)
```

As an alternative, the _local_ standard deviation could be used.

Consider the case of $k$-point means. The point estimate is based on the values of the $k$-nearest neighbors:
$$
\hat{z}_p = \frac{\sum_{i=1}^n{w_{pi}z_i}}{\sum_{i=1}^n{w_{pi}}}
$$

With:
$$
w_{pi} = \bigg\{\begin{array}{ll}
1 & \text{if } i \text{ is one of } kth \text{ nearest neighbors of } p \text{ for a given }k \\
0 & otherwise \\
\end{array}
$$

The standard deviation could be calculated also based in the values of the $k$-nearest neighbors, meaning that it would be based on the local mean. Here, we will interpolate the field using the Walker Lake data. First create a target grid for interpolation, and extract the coordinates of observations: 
```{r}
# Create a prediction grid and convert to simple features:
target_xy = expand.grid(x = seq(0.5, 259.5, 2.2), 
                        y = seq(0.5, 299.5, 2.2)) |>
  st_as_sf(coords = c("x", "y"))

# Convert the `Walker_Lake` dataframe to a simple features object using as follows:
Walker_Lake.sf <- Walker_Lake |> 
  st_as_sf(coords = c("X", "Y"))
```

Interpolation using $k=5$ neighbors:
```{r cache=FALSE}
kpoint.5 <- kpointmean(source_xy = Walker_Lake.sf, 
                       target_xy = target_xy, 
                       z = V,  
                       k = 5) |>
  rename(V = z)
```

We can plot the interpolated field now. These are the interpolated values:
```{r}
ggplot() +
  geom_sf(data = kpoint.5, 
          aes(color = V)) +
  scale_color_distiller(palette = "OrRd", 
                       direction = 1)
```

In addition, we can plot the _local_ standard deviation:
```{r}
ggplot() +
  geom_sf(data = kpoint.5, 
          aes(color = sd)) +
  scale_color_distiller(palette = "OrRd", 
                       direction = 1)
```

The local standard deviation indicates the typical deviation from the local mean. As expected, the local values of the standard deviation are usually lower than the standard deviation of the sample, and it tends to be larger for the tails, that is the locations where the values are rare - we have less information, hence greater uncertainty.

The local standard deviation is a crude estimator of the uncertainty because we do not know the underlying distribution. Other approaches based on bootstrapping (randomly sampling from the observed values of the variable) could be implemented, but they are beyond the scope of the present discussion.

The issue of assessing the level of uncertainty in the predictions with Voronoi polygons, IDW, and $k$-point means reflects the fact that these methods were not designed to deal explicitly with the random nature of predicting fields. Other methods deal with this issue more naturally. We will revisit two estimation methods that we covered before, and see how they can be applied to spatial interpolation.

## Trend surface analysis

Trend surface analysis is a form of multivariate regression that uses the coordinates of the observations to fit a surface to the data.

We can illustrate this technique by means of a simulated example. We will begin by simulating a set of observations, beginning with the coordinates in the square unit region:
```{r}
# `n` is the number of observations to simulate
n <- 180

# Here we create a dataframe with these values: `u` and `v` will be the coordinates of our process
df <- data.frame(u = runif(n = n, min = 0, max = 1), 
                 v = runif(n = n, min = 0, max = 1))
```

Once we have simulated the coordinates for the example, we can plot their locations:
```{r}
ggplot(data = df, aes(x = u, y = v)) + 
  geom_point() + 
  coord_equal()
```

We can now proceed to simulate a spatial process as follows:
```{r}
# Use `mutate()` to create a new stochastic variable `z` that is a function of the coordinates and a random normal variable that we create with `rnorm()`; this random variable has a mean of zero and a standard deviation of 0.1.
df <- mutate(df, z = 0.5 + 0.3 * u + 0.7 * v + rnorm(n = n, mean = 0, sd = 0.1))
```

A 3D scatterplot can be useful to explore the data:
```{r}
# Create a 3D scatterplot with the function `plot_ly()`. Notice that the way this function works is similar to `ggplot2`: the arguments are a dataframe, what should be plotted on the x-axis, the y-axis, the z-axis, and other aesthetics (aspects) of the plot. Here the color will be proportional to the values of `z`. The function `add_markers()` is similar to the family of `geom_` functions in `ggplot2`, but more general, since it will try to guess what you are trying to plot based on the inputs (in this case points). The function `layout()` is used to control other parts of the plot: here the `aspectratio` is selected so that the scale is identical for all three axes.
plot_ly(data = df, x = ~u, y = ~v, z = ~z, color = ~z) |> 
  add_markers() |> 
  layout(scene = list(
    aspectmode = "manual", aspectratio = list(x=1, y=1, z=1)))
```

We can fit a trend surface to the data as follows. This is a regression model that uses the coordinates of the observations as covariates. In this case, the trend is linear:
```{r}
trend.l <- lm(formula = z ~ u + v, data = df)
summary(trend.l)
```

Given a trend surface model, we can estimate the value of the variable $z$ at locations where it was not measured. Typically this is done by interpolating on a fine grid that can be used for visualization or further analysis, as shown next.

We will begin by creating a grid for interpolation. We will call the coordinates `x.p` and `y.p`. We generate these by creating a sequence of values in the domain of the data, for instance in the [0,1] interval:
```{r}
u.p <- seq(from = 0.0, to = 1.0, by = 0.05)
v.p <- seq(from = 0.0, to = 1.0, by = 0.05)
```

For prediction, we want all combinations of `x.p` and `y.p`, so we expand these two vectors into a grid, by means of the function `expand.grid()`:
```{r}
# The function `expand.grid()` creates a grid with all the combination of values of the inputs.
df.p <- expand.grid(u = u.p, v = v.p)
```

Notice that while `u.p` and `v.p` are vectors of size 21, the dataframe `df.p` contains `{r}21 * 21` observations, that is, all the combinations of `u.p` and `v.p`.

Once we have the coordinates for interpolation, the `predict()` function can be used in conjunction with the results of the estimation. When invoking the function, we indicate that we wish to obtain as well the standard errors of the fitted values (`se.fit = TRUE`), as well as the interval of the predictions at a 95% level of confidence:
```{r}
preds <- predict(trend.l, newdata = df.p, se.fit = TRUE, interval = "prediction", level = 0.95)
```

The interval of confidence _of the predictions_ at the 95% level of confidence is given in the form of the lower (`lwr`) and upper (`upr`) bounds:
```{r}
summary(preds$fit)
```

These values indicate that the predictions of $z_p$ are, with 95% of confidence, in the following interval:
$$
CI_{z_p} = [z_{p_{lwr}}, z_{p_{upr}}].
$$

A convenient way to visualize the results of the analysis above, that is, to inspect the trend surface and the interval of confidence of the predictions, is by means of a 3D plot as follows.

First create matrices with the point estimates of the trend surface (`z.p`), and the lower and upper bounds (`z.p_l`, `z.p_u`):
```{r}
z.p <- matrix(data = preds$fit[,1], nrow = 21, ncol = 21, byrow = TRUE)
z.p_l <- matrix(data = preds$fit[,2], nrow = 21, ncol = 21, byrow = TRUE)
z.p_u <- matrix(data = preds$fit[,3], nrow = 21, ncol = 21, byrow = TRUE)
```

The plot is created using the coordinates used for interpolation (`x.p` and `y.p`) and the matrices with the point estimates `z.p` and the upper and lower bounds. The type of plot in the package `plotly` is a _surface_:
```{r}
trend.plot <- plot_ly(x = ~u.p, y = ~v.p, z = ~z.p, 
        type = "surface", colors = "YlOrRd") |> 
  add_surface(x = ~u.p, y = ~v.p, z = ~z.p_l, 
              opacity = 0.5, showscale = FALSE) |>
  add_surface(x = ~u.p, y = ~v.p, z = ~z.p_u, 
              opacity = 0.5, showscale = FALSE) |> 
  layout(scene = list(
    aspectmode = "manual", aspectratio = list(x = 1, y = 1, z = 1)))

trend.plot
```

In this way, we have not only an estimate of the underlying field, but also a measure of uncertainty for our predictions, since our estimated values are bound, with 95% confidence, between the lower and upper surfaces.

It is important to note that, although the confidence interval provides a measure of uncertainty, it does not provide an estimate of the prediction error $\hat{\epsilon}_p$. This quantity cannot be calculated directly, because we _do not know the true value of the field at location $p$_. We will revisit this point later.

For the time being, we will apply trend surface analysis to the Walker Lake dataset.

We will first calculate the polynomial terms of the coordinates, for instance to the 3rd degree (this can be done to any arbitrary degree, however keeping in mind the caveats discussed previously with respect to trend surface analysis):
```{r}
Walker_Lake <- mutate(Walker_Lake,
                        X3 = X^3, X2Y = X^2 * Y, X2 = X^2, 
                        XY = X * Y,
                        Y2 = Y^2, XY2 = X * Y^2, Y3 = Y^3)
```

We can proceed to estimate the following models.

Linear trend surface model:
```{r}
WL.trend1 <- lm(formula = V ~ X + Y, data = Walker_Lake)
summary(WL.trend1)
```

Quadratic trend surface model:
```{r}
WL.trend2 <- lm(formula = V ~ X2 + X + XY + Y + Y2, data = Walker_Lake)
summary(WL.trend2)
```

Cubic trend surface model:
```{r}
WL.trend3 <- lm(formula = V ~ X3 + X2Y + X2 + X + XY + Y + Y2 + XY2 + Y3, 
                data = Walker_Lake)
summary(WL.trend3)
```

Inspection of the results of the three models above suggests that the cubic trend surface provides the best fit, with the highest adjusted coefficient of determination, even if the value is relatively low at approximately 0.16. Also, the cubic trend yields the smallest standard error, which implies that the intervals of confidence are tighter, and hence the degree of uncertainty is smaller.

We will compare two of these models to see how well they fit the data.

First, we create an interpolation grid. Summarize the information to ascertain the domain of the data:
```{r}
summary(Walker_Lake[,2:3])
```

We can see that the spatial domain is in the range of [8,251] in X, and [8,291] in Y. Based on this information, we will generate the following sequence that then is expanded into a grid for prediction:
```{r}
X.p <- seq(from = 0.0, to = 255.0, by = 2.5)
Y.p <- seq(from = 0.0, to = 295.0, by = 2.5)
df.p <- expand.grid(X = X.p, Y = Y.p)
```

To this dataframe we add the polynomial terms:
```{r}
df.p <- mutate(df.p, X3 = X^3, X2Y = X^2 * Y, X2 = X^2, 
               XY = X * Y, 
               Y2 = Y^2, XY2 = X * Y^2, Y3 = Y^3)
```

The interpolated quadratic surface is then obtained as:
```{r}
WL.preds2 <- predict(WL.trend2, newdata = df.p, se.fit = TRUE, interval = "prediction", level = 0.95)
```

Whereas the interpolated cubic surface is obtained as:
```{r}
WL.preds3 <- predict(WL.trend3, newdata = df.p, se.fit = TRUE, interval = "prediction", level = 0.95)
```

The predictions are transformed into matrices for plotting.

Quadratic trend surface and lower and upper bounds of the predictions:
```{r}
z.p2 <- matrix(data = WL.preds2$fit[,1], nrow = 119, ncol = 103, byrow = TRUE)
z.p2_l <- matrix(data = WL.preds2$fit[,2], nrow = 119, ncol = 103, byrow = TRUE)
z.p2_u <- matrix(data = WL.preds2$fit[,3], nrow = 119, ncol = 103, byrow = TRUE)
```

Cubic trend surface and lower and upper bounds of the predictions:
```{r}
z.p3 <- matrix(data = WL.preds3$fit[,1], nrow = 119, ncol = 103, byrow = TRUE)
z.p3_l <- matrix(data = WL.preds3$fit[,2], nrow = 119, ncol = 103, byrow = TRUE)
z.p3_u <- matrix(data = WL.preds3$fit[,3], nrow = 119, ncol = 103, byrow = TRUE)
```

This is the quadratic trend surface with its confidence interval of predictions:
```{r}
WL.plot2 <- plot_ly(x = ~X.p, y = ~Y.p, z = ~z.p2, 
        type = "surface", colors = "YlOrRd") |> 
  add_surface(x = ~X.p, y = ~Y.p, z = ~z.p2_l, 
              opacity = 0.5, showscale = FALSE) |>
  add_surface(x = ~X.p, y = ~Y.p, z = ~z.p2_u, 
              opacity = 0.5, showscale = FALSE) |> 
  layout(scene = list(
    aspectmode = "manual", aspectratio = list(x = 1, y = 1, z = 1)))
WL.plot2
```

And, this is the cubic trend surface with its confidence interval of predictions:
```{r}
WL.plot3 <- plot_ly(x = ~X.p, y = ~Y.p, z = ~z.p3, 
        type = "surface", colors = "YlOrRd") |> 
  add_surface(x = ~X.p, y = ~Y.p, z = ~z.p3_l, 
              opacity = 0.5, showscale = FALSE) |>
  add_surface(x = ~X.p, y = ~Y.p, z = ~z.p3_u, 
              opacity = 0.5, showscale = FALSE) |> 
  layout(scene = list(
    aspectmode = "manual", aspectratio = list(x = 1, y = 1, z = 1)))
WL.plot3
```

Alas, these models are not very reliable estimates of the underlying field. As can be seen from the plots, the confidence intervals are extremely wide, and in both cases include negative numbers in the lower bound. The uncertainty associated with these predictions is quite substantial.

Another question, however, is whether the point estimates are accurate. To get a sense of whether this is the case we can add the observations to the plot:
```{r}
WL.plot3 |>
  add_markers(data = Walker_Lake, x = ~X, y = ~Y, z = ~V, 
              color = ~V, opacity = 0.7, showlegend = FALSE)
```

Alas, the trend surface does a mediocre job with the point estimates as well.

A possible reason for this is that the model failed to capture all or even most of the systematic spatial variability of this field. To explore this, we will plot the residuals of the model, after labeling them as "positive" or "negative":
```{r}
Walker_Lake$residual3 <- ifelse(WL.trend3$residuals > 0, "Positive", "Negative")
```

Plot the residuals:
```{r}
ggplot(data = Walker_Lake, 
       aes(x = X, y = Y, color = residual3)) +
  geom_point() +
  coord_equal()
```

Visual inspection of the distribution of the residuals strongly suggests that they are not random. We can check this by means of Moran's $I$ coefficient, if we create a list of spatial weights as follows:
```{r}
# Create a set of spatial weights with the 5 nearest neighbors.
WL.listw <- Walker_Lake[,2:3] |> 
  as.matrix() |>
  knearneigh(k = 5) |>
  knn2nb() |>
  nb2listw()
```

The results of the autocorrelation analysis of the residuals are:
```{r}
moran.test(x = WL.trend3$residuals, listw = WL.listw)
```

Given the low $p$-value, we fail to reject the null hypothesis, and conclude, with a high level of confidence, that the residuals are not independent. This has important implications for spatial interpolation, as we will discuss in the following chapter.

## Accuracy and precision

Before concluding this chapter, it is worthwhile to make the following distinction between accuracy and precision of the estimates.

Accuracy refers to how close the predicted values $\hat{z}_p$ are to the true values of the field. Precision refers to how much uncertainty is associated with such predictions. Narrow intervals of confidence imply greater precision, whereas the opposite is true when the intervals of confidence are wide.

An example of these two properties is as shown in Figure \@ref{fig:accuracy-precision}.

```{r accuracy-and-precision, fig.cap= "\\label{fig:accuracy-precision}Accuracy and precision", echo=FALSE}
knitr::include_graphics(rep("figures/33-Figure-1.jpg"))
```

Panel a) in the figure represents a set of accurate points, since they are on average close to the mark. However, they are imprecise, given their variability. This is akin to a good point estimate that has wide confidence intervals.

Panel b) is a set of inaccurate and imprecise points.

Panel c) is a set of precise but inaccurate points.

Finally, Panel d) is a set of accurate and precise points.

Accuracy and precision are important criteria when assessing the quality of a predictive model.

This concludes the chapter.