-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathbeyond_normality.qmd
372 lines (302 loc) · 11.5 KB
/
beyond_normality.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
---
title: "Beyond Normality"
share:
permalink: "https://book.martinez.fyi/beyond_normality.html"
description: "Business Data Science: What Does it Mean to Be Data-Driven?"
linkedin: true
email: true
mastodon: true
---
In the preceding chapters, we basked in the warm glow of normality, a
simplifying assumption that allowed us to model outcomes as if they arose from
the familiar bell curve. Yet, the real world often serves up data generated by
far more diverse processes. Outcomes might be binary choices (to adopt or
reject), rankings on an ordinal scale (like satisfaction surveys), or simple
counts (such as the number of subscribers). In this chapter, we embark on an
expedition beyond the comfortable confines of normality, charting the terrain of
these alternative data types and the specialized tools we need to analyze them.
## Binary Outcomes: The Coin Flips of Data
Binary outcomes are the coin flips of the data world – two sides, two
possibilities. Think success/failure, yes/no, or the ever-important adopt/reject
decision. To model these, we turn to the trusty logistic regression. While
linear probability models are also used, logistic regression has the distinct
advantage of keeping our predictions bounded between the sensible limits of 0
and 1.
This workhorse of a model is a type of generalized linear model, employing a
logit link function:
$$
\text{BernoulliLogit}(y|\theta) = \text{Bernoulli}(y|\text{logit}^{-1}(\theta))
$$
where
$$
\text{logit}^{-1}(\theta) = \frac{1}{1 + \exp(-\theta)}
$$
```{r inverse_logit, fig.align = 'center', message=FALSE, warning=FALSE}
library(ggplot2)
library(tibble)
library(dplyr)
inv_logit <- function(theta) {
return(1 / (1 + exp(-theta)))
}
ggplot2::ggplot() +
ggplot2::stat_function(fun = inv_logit) +
ggplot2::theme_minimal() +
ggplot2::xlim(-5, 5) +
ggplot2::xlab(expression(theta))
```
To illustrate, let's cook up some fake data:
```{r fake_data}
# Fake data
N <- 2000
K <- 2
set.seed(1982)
fake_data <- tibble::tibble(
x1 = rnorm(N, mean = 0, sd = 1),
x2 = rnorm(N, mean = 0, sd = 1),
treat = sample(
x = c(TRUE, FALSE), size = N, replace = TRUE,
prob = c(0.5, 0.5)
),
r = runif(n = N, min = 0, max = 1)
) %>%
dplyr::mutate(
p0 = inv_logit(theta = -3 + 0.1 * x1 + 0.25 * x2),
p1 = inv_logit(theta = -3 + 0.1 * x1 + 0.25 * x2 + 0.2),
y0 = dplyr::case_when(p0 > r ~ 1, TRUE ~ 0),
y1 = dplyr::case_when(p1 > r ~ 1, TRUE ~ 0),
y = dplyr::case_when(
treat ~ as.logical(y1),
TRUE ~ as.logical(y0)
)
)
dplyr::glimpse(fake_data)
mean_y0 <- mean(fake_data$y0)
mean_y1 <- mean(fake_data$y1)
impact <- round((mean(fake_data$y1) - mean(fake_data$y0)) * 100, 2)
```
In this fabricated dataset, we've engineered a scenario where, without
intervention, only `r scales::percent(mean_y0)` would have adopted the feature.
Yet, with the intervention applied universally, adoption would have jumped to
`r scales::percent(mean_y1)`. The true impact of this intervention is a hefty
`r impact` percentage points.
### Prior predictive checking
As Bayesians, we're not just number crunchers; we're storytellers. We weave
narratives about data, and our priors are the opening chapters. So, before we
unleash our model, let's ponder what our priors imply.
Our model takes this form:
$$
\begin{aligned}
\theta &= \alpha + X \beta + \tau T \\
\alpha &\sim N(\mu_{\alpha}, sd_{\alpha}) \\
\beta_j &\sim N(\mu_{\beta_j}, sd_{\beta_j}) \\
\tau &\sim N(\mu_{\tau}, sd_{\tau})
\end{aligned}
$$
```{r priors_check, fig.align = 'center', message=FALSE, results = "hide"}
logit <- im::logit$new(
data = fake_data,
y = "y", # this will not be used
treatment = "treat",
x = c("x1", "x2"),
mean_alpha = -3,
sd_alpha = 2,
mean_beta = c(0, 0),
sd_beta = c(1, 1),
tau_mean = 0.05,
tau_sd = 0.5,
fit = FALSE # we will not be fitting the model
)
logit$plotPrior()
```
### Fitting the Model: Where Theory Meets Data
Satisfied with our priors, we're ready to fit the model to the data:
```{r fit, fig.align = 'center', message=FALSE, results = "hide"}
logit <- im::logit$new(
data = fake_data,
y = "y",
treatment = "treat",
x = c("x1", "x2"),
mean_alpha = -3,
sd_alpha = 2,
mean_beta = c(0, 0),
sd_beta = c(1, 1),
tau_mean = 0.05,
tau_sd = 0.5,
fit = TRUE
)
```
Let's glance at the trace plot of tau to ensure our chains mixed well and converged:
```{r tracePlot, fig.align = 'center'}
logit$tracePlot()
```
To sum up our findings, we have a few handy methods at our disposal:
```{r findings, fig.align = 'center'}
logit$pointEstimate()
logit$credibleInterval(width = 0.95)
logit$calcProb(a = 0)
```
We can also use the prediction function to predict new data and compare the
differences between groups. The `predict` function takes the `new_data` and
`name` argument to name the group. For example, here we will predict the data as
if all units are treated, then make another prediction as if all units are not
treated and summarize the two groups.
```{r predSummary}
fake_treated_data <- fake_data %>% mutate(treat = TRUE)
fake_control_data <- fake_data %>% mutate(treat = FALSE)
logit$predict(
new_data = fake_treated_data,
name = "y1"
)
logit$predict(
new_data = fake_control_data,
name = "y0"
)
logit$predSummary(name = "y1", width = 0.95, a = 0)
logit$predSummary(name = "y0", width = 0.95, a = 0)
```
We can also compare the differences between two groups of predictions.
```{r predCompare}
logit$predCompare(name1 = "y1", name2 = "y0", width = 0.95, a = 0)
```
We can also summarize and compare the predictions conditioning on subgroups.
```{r subgroup}
logit$predSummary(
name = "y1",
subgroup = (fake_data$x1 > 0),
width = 0.95, a = 0
)
logit$predCompare(
name1 = "y1",
name2 = "y0",
subgroup1 = (fake_treated_data$x1 > 0),
subgroup2 = (fake_control_data$x1 > 0),
width = 0.95, a = 0
)
```
Finally, we can get the posterior predictive draws for advanced analysis.
```{r getPred}
pred <- logit$getPred(name = "y1")
```
## Ordinal Outcomes
Ordinal outcomes are a rather curious beast in the realm of causal inference.
They have a natural order - think of a survey respondent rating their
satisfaction on a scale from "very dissatisfied" to "very satisfied" - but the
intervals between the values don't necessarily hold equal weight. The distance
between "dissatisfied" and "neutral" may not be the same as that between
"satisfied" and "very satisfied." This lack of equal spacing is a challenge we
must address head-on when analyzing the impact of an intervention or treatment.
This nuance of ordinal outcomes is also present in other domains, such as user
experience ratings (e.g., poor, fair, good, excellent), or educational
attainment levels (e.g., less than high school, high school diploma, some
college, college degree). In each case, there's a clear order, but the spacing
between levels is not uniform.
This lack of uniform spacing can complicate our analysis, particularly when
applying traditional regression models designed for continuous outcomes. We need
a tailored approach that respects the ordinal nature of the data, while still
allowing us to draw meaningful causal inferences.
TODO
## Count Outcomes {#sec-negative-binomial}
In this section, we're talking about events we can tally—website visits,
customer complaints, product sales, you name it. These outcomes are inherently
non-negative integers, reflecting the discrete nature of the events we're
counting.
The go-to tool for analyzing count outcomes is often Poisson regression.
However, real-world data frequently throws us a curveball in the form of
overdispersion, where the variance of the count data outstrips its mean. This is
where negative binomial regression swoops in to save the day. It's a souped-up
version of Poisson regression that accounts for overdispersion by adding an
extra parameter to the model.
Count data pops up in all sorts of scenarios. We might be interested in the
effect of a new marketing campaign on app downloads or the impact of a software
update on user logins. With count outcomes, the possibilities are as endless as
the events we can count.
### An Example with Fake Data: Video Views Galore
Let's say we're interested in the number of views a video receives in a given
period. This is a perfect opportunity to use the negative binomial distribution
to model the underlying data-generating process. Here's how we can express this:
$$
\begin{aligned}
y_i & \sim \text{NB}(\mu_i, \phi) \\
log(\mu_i) & = \alpha + X\beta
\end{aligned}
$$
In this model, $\mu$ represents the mean (which must be positive, hence the log
link function), and the inverse of $\phi$ controls the overdispersion. A small
$\phi$ means the negative binomial distribution significantly deviates from a
Poisson distribution, while a large $\phi$ brings it closer to a Poisson. This
becomes clear when we look at the variance:
$$
Var(y_i) \sim \mu_i + \frac{\mu_i^2}{\phi}
$$
Let's whip up some fake data to illustrate:
```{r, message=FALSE, warning=FALSE}
set.seed(9782)
library(dplyr)
library(ggplot2)
N <- 1000
fake_data <-
tibble::tibble(x1 = runif(N, 2, 9), x2 = rnorm(N, 0, 1)) %>%
dplyr::mutate(
mu = exp(0.5 + 1.7 * x1 + 4.2 * x2),
y0 = rnbinom(N, size = 2, mu = mu), # here size is phi
y1 = y0 * 1.05,
t = sample(c(TRUE, FALSE), size = n(), replace = TRUE, prob = c(0.5, 0.5)),
y = case_when(t ~ as.integer(y1),
.default = as.integer(y0)
)
) %>%
filter(y > 0)
# Plotting the histogram using ggplot2
ggplot(fake_data, aes(x = y)) +
geom_histogram(
bins = 100,
color = "black",
alpha = 0.7
) +
facet_wrap(~t) +
labs(title = "Histogram of y", x = "y", y = "Frequency") +
xlim(0, 1e4) +
ylim(0, 20)
```
### OLS: A Common, But Flawed, Approach
A typical way to estimate the lift of an intervention in this scenario is to run
ordinary least squares (OLS) on the natural logarithm of the outcome. Then,
folks often look at the point estimate and 95% confidence interval for the
treatment effect. Doing this with our fake data might lead you to conclude that
the intervention is ineffective, as the 95% confidence interval includes zero.
```{r}
lm(data = fake_data, log(y) ~ x1 + x2 + t) %>% broom::tidy(conf.int = TRUE)
```
### Taming Overdispersion: The Bayesian Negative Binomial Advantage
Enter the Bayesian negative binomial model, easily implemented using the {im}
package in R. This approach has two key advantages. First, it better captures
the true data-generating process. Second, being Bayesian, it lets us incorporate
prior information and express our findings in a way that's more directly
relevant to business decisions.
```{r NB, message=FALSE, results = "hide"}
library(im)
nb <- negativeBinomial$new(
data = fake_data, y = "y", x = c("x1", "x2"),
treatment = "t", tau_mean = 0.0, tau_sd = 0.025
)
```
A quick check of the trace plot is always a good idea:
```{r tp}
nb$tracePlot()
```
Now for the payoff. We can readily obtain a point estimate for the impact using
`nb$pointEstimate()` (`r scales::percent(nb$pointEstimate())`) and a 95%
credible interval using `nb$credibleInterval(width = 0.95, round = 2)`
(`r nb$credibleInterval(width
= 0.95, round = 2)`). But what if we want to know
the probability that the impact is at least 1% (or any other threshold)? Easy
peasy! We use `nb$posteriorProb(threshold = 0.01)`
(`r nb$posteriorProb(threshold = 0.01)`). Finally, to visualize the evolution of
our understanding from prior to posterior, we employ `nb$vizdraws()`.
::: {.content-visible when-format="html"}
```{r vz, message=FALSE}
#| eval: !expr knitr::is_html_output()
nb$vizdraws(display_mode_name = TRUE, breaks = 0.01,
break_names = c("< 0.01", "> 0.01"))
```
:::