-
Notifications
You must be signed in to change notification settings - Fork 0
/
negbinom_experiment.Rmd
122 lines (91 loc) · 4.36 KB
/
negbinom_experiment.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
title: "Fitting logseries and negative binomial SADs with random forests"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library('vegan')
library('randomForest')
foo1 <- require('mobsim')
foo2 <- require('pika')
if(!foo1) {
install.packages('mobsim')
library('mobsim')
}
if(!foo2) {
devtools::install_github('ajrominger/pika')
library('pika')
}
```
## Infering the scale of a truncated negative binomial
The zero-truncated negative binomial can be parameterized to represent most all log-normal distributions (given finite data) and in the limit of $k = 0$ is exactly Fisher's log-series. So it's a good distribution to see if we can properly parameterize with random forests.
### Species abundances simulated using pika package
```{r}
# parameters for the negative binomial
mu <- 10
k <- seq(1, 6, length.out = 6)
# parameters for the log-series maintaining the mean
p <- uniroot(function(p) -1/log(1 - p) * p / (1 - p) - mu,
c(.Machine$double.eps, 1 - .Machine$double.eps))$root
b <- -log(p)
# simulation parameters
nsim <- 100
nspp <- 500
# simulate negative binomial SADs
negbSAD <- rtnegb(nsim * nspp * length(k), mu = mu, k = rep(k, each = nsim * nspp))
negbSAD <- split(negbSAD, rep(1:(nsim * length(k)), each = nspp))
param <- rep(k, each = nsim)
# add to that fisher SADs
fishSAD <- rfish(nsim * nspp, beta = b)
fishSAD <- split(fishSAD, rep(1:nsim, each = nspp))
allSAD <- c(negbSAD, fishSAD)
allParam <- c(param, rep(0, nsim))
```
### Random forest fitting
First let's make sure the RF won't just be picking up something simple like a spurious difference in means.
```{r, fig.width=4, fig.height=4, fig.align='center'}
plot(density(sapply(negbSAD, mean)), xlim = range(sapply(allSAD, mean)),
xlab = 'mean of sim data', main = '')
lines(density(sapply(fishSAD, mean)), col = 'red')
legend('topright', legend = c('n binom', 'logseries'),
col = c('black', 'red'), lty = 1, bty = 'n')
```
The means of means are close for each distribution, but logseries has more variability in means across simulations. I think that's ok.
Now on to the fitting with RF based on Renyi entropy
```{r, fig.width=4, fig.height=4, fig.align='center'}
# summary statistic of Renyi entropies
sumStat <- t(sapply(allSAD, renyi))
# run the random forest
rf <- randomForest(sumStat, allParam)
# see how well prediction works
plot(allParam, rf$predicted, xlab = 'True parameter', ylab = 'Predicted parameter')
abline(0, 1, col = 'red')
```
The prediction looks pretty good!
I'm curious to see if using different weights in the Renyi entropy makes a difference. For example, paying more attention to rare species
```{r, fig.width=4, fig.height=4, fig.align='center'}
sumStat <- t(sapply(allSAD, renyi, scales = c(0, 2^(-6:2), Inf)))
# run the random forest
rfRare <- randomForest(sumStat, allParam)
# see how well prediction works
plot(allParam - 0.05, rf$predicted, xlab = 'True parameter', ylab = 'Predicted parameter',
col = 'gray')
points(allParam + 0.05, rfRare$predicted)
abline(0, 1, col = 'red')
legend('topleft', legend = c('rare Renyi', 'default'), col = c('black', 'gray'),
pch = 1, bty = 'n')
```
We can see that focusing on rare species (or rather not focusing so much on the dominants) helps pin down the difference between the logseries (true parameter = 0) and also the value of the true parameter when $k$ is small in the negative binomial. But the results are effectively equivalent for larger values of $k$.
For negative binomial we can also see if the RF does any better or worse than maximum likelihood
```{r, fig.width=4, fig.height=4, fig.align='center', cache=TRUE}
# maximum likelihood estimation with *pika*
mle <- sapply(allSAD, function(x) {sad(x, 'tnegb')$MLE[2]})
plot(allParam - 0.05, rfRare$predicted, xlab = 'True parameter',
ylab = 'Predicted parameter', col = 'gray',
ylim = range(mle, rfRare$predicted))
points(allParam + 0.05, mle)
abline(0, 1, col = 'red')
legend('topleft', legend = c('MLE', 'rare Renyi'), col = c('black', 'gray'),
pch = 1, bty = 'n')
```
So my take home is that RF on rare Renyi is better for negative binomial when $k$ is small, but is a less consistent estimator (i.e. the mean of the inferred parameter can be off) than ML. That makes sense, ML is provably least biased for $n$ approaching infinity. But maybe with more Renyi entropies RF would do just as good.