-
Notifications
You must be signed in to change notification settings - Fork 0
/
derivation.rmd
175 lines (108 loc) · 8.88 KB
/
derivation.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
---
title: "beta_gradient_boosting"
author: "Brayden Tang"
date: "06/01/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(lightgbm)
```
### Beta Gradient Boosting
Suppose we have a response variable $y$ that can take on any real value in the range (0, 1). Examples of such variables are proportions or rate/frequency distributions in which the exposures or counts are not known.
If we choose to arbitrarily fit such a response with an objective function such as least squares (equivalent to assuming the conditional distribution of Y | X is normally distributed) then there is no guarantee that the predicted response is positive and bounded in [0, 1]. A common workaround for this problem is to transform the response variable to the interval $[-\infty, \infty]$ using the logit link function, letting $Y_{transformed} = ln(\frac{Y}{1-Y})$ and then minimizing least squares on this transformed variable. However, there are shortcomings to this approach. For prediction purposes, the main problem is that setting $E[ln(\frac{Y}{1-Y})| X] = f(x)$ is not the same thing as setting $ln(\frac{E[Y | X]}{1 - E[Y | X]}) = f(x)$. The first methodlogy in particular will almost always underestimate $E[Y | X]$ after transformation back to the unit interval due to consequences of Jensen's inquality.
Unfortunately, there is currently no objective function in any of the popular gradient boosting packages that directly deal with the scenario above, despite beta regression being around conceptually for quite some time now.
### Maximum Likelihood Estimation
Gradient boosting frameworks like LightGBM and XGBoost both allow for a wide range of loss functions to be chosen for different response variables. Examples include the gamma, Tweedie, Poisson, and L2/Gaussian losses amongst many others. We focus here on these three particular loss functions because they can be derived by maximum likelihood in a similar fashion to that of generalized linear models.
XGBoost and LightGBM both determine their splits (and optimal weights to each terminal node) through closed form expressions using the gradient and Hessian of a loss function L, as a consequence of approximating the true loss function with a second order Taylor approximation. Thus, we can simply derive the likelihood function for a particular observation $y_i$ and we are done.
Let $y_i$ ~ Gamma with parameters $\alpha, \theta_{i}$. We are letting $\alpha$ be the same for all observations, for reasons to be discussed below. Then, the density function of $y_i$ is defined as:
$$f(y_i) = \frac{y_{i}^{\alpha-1} e^{\frac{-y_{i}}{\theta_{i}}}}{\Gamma(\alpha) \times \theta_{i}^{\alpha}}.$$
Now, $E[y_{i}] = \alpha \theta_{i}.$ If we let $\mu_{i} = \alpha \theta_{i}$, then rewrite $f(y_{i})$ as:
$$f(y_{i}) = \frac{y_{i}^{\alpha-1} e^{\frac{-y_{i} \alpha}{\mu_{i}}}}{\Gamma(\alpha) \times (\frac{\mu_{i}}{\alpha})^{\alpha}}.$$
Typically, we also replace the shape parameter $\alpha$ with the dispersion parameter that satisfies $\phi = \frac{Var[Y | X]}{V(\mu)} \rightarrow \phi = \frac{1}{\alpha}$. However, our predictions will of course just be $\mu$ and therefore we have no need to estimate $Var[Y | X]$ (for pure prediction purposes), so we can let $\alpha$ be any arbitrary constant. Let $\alpha = 1$ for convenience. Then:
$$f(y_{i}) = \frac{e^{\frac{-y_{i}}{\mu}}}{\mu}.$$ As an aside, this is just an exponential distribution since a Gamma with $\alpha = 1$ is an exponential.
Notice how this expression does not yet relate to the model output given by the model $f(X_{i})$. If we use the log link function which is common for the gamma since it respects the support of the gamma distribution,
$$\mu = e^{f(X_{i})},$$
then
$$ln(f(y_{i})) = \frac{-y_{i}}{\mu} - ln(\mu) \rightarrow \frac{y_{i}}{e^{f(X_{i})}} - f(X_i)$$
Hence,
$$-\nabla ln(f(y_{i})) = 1 - \frac{y_{i}}{e^{f(X_{i})}}$$
and
$$-\nabla^2 ln(f(y_{i})) = \frac{y_{i}}{e^{f(X_{i})}}.$$
These expressions are exactly what the package LightGBM uses when we specify `objective = "gamma"`. All of the other distributions can be derived in a similar fashion.
#### Derivation for Beta Distribution
First, we write the beta density in terms of the parameters $\mu$ and $\phi$ in a similar fashion to what we did above.
Let
$$f(y_{i}) = \frac{\Gamma (p + q)}{\Gamma (p) \Gamma (q)} y_{i}^{p - 1}(1-y_{i})^{q-1},$$
where $0 < y_{i} < 1.$
Since $\mu = \frac{p}{p + q}$ and $\phi = p + q,$
then:
$$f(y_{i}) = \frac{\Gamma (\phi)}{\Gamma (\mu_{i} \phi) \Gamma ((1-\mu_{i})\phi)} y_{i}^{\mu_{i} \phi - 1}(1-y_{i})^{(1-\mu_{i})\phi-1},$$
As before, we will let the dispersion parameter be some arbitrary constant since it is not needed if we just care about $\mu = E[Y | X].$ For convenience, set $\phi = 1$. Then we have
$$f(y_{i}) = \frac{y_{i}^{\mu_{i} - 1}(1-y_{i})^{-\mu_{i}}}{\Gamma (\mu_{i}) \Gamma (1-\mu_{i})}$$.
Thus,
$$ln(f(y_{i})) = (\mu_{i} - 1)ln(y_{i}) - \mu_{i} ln(1- y_{i}) - ln(\Gamma (\mu_{i})) - ln(\Gamma(1-\mu_{i})).$$
Using the link function $\mu_{i} = \frac{e^{f(X_{i})}}{1 + e^{f(X_{i})}}$, we end up with the following expression:
$$ln(f(y_{i})) = \left(\frac{-1}{1 + e^{f(X_{i})}}\right)ln(y_{i}) - \left(\frac{e^{f(X_{i})}}{1 + e^{f(X_{i})}}\right) ln(1- y_{i}) - ln\left(\Gamma \left(\frac{e^{f(X_{i})}}{1 + e^{f(X_{i})}}\right)\right) - ln\left(\Gamma\left(\frac{1}{1 + e^{f(X_{i})}}\right)\right).$$
First, we will derive some needed quantities.
$$\frac{d}{df(x_{i})} \frac{e^{f(X_i)}}{1 + e^{f(X_i)}} = \frac{e^{f(X_{i})}}{(1 + e^{f(X_{i})})^2}, $$
$$\frac{d}{df(x_{i})} \frac{1}{1 + e^{f(X_i)}} = -\frac{e^{f(X_{i})}}{(1 + e^{f(X_{i})})^2}, $$
$$\frac{d^2}{df(x_{i})^2} \frac{e^{f(X_{i})}}{(1 + e^{f(X_{i})})^2} = \frac{e^{f({X_{i}})}(1 - e^{f({X_{i}})})}{(1 + e^{f({X_{i}})})^3}.$$
We also need to take the derivative of $ln(\Gamma(x))$ which is denoted as $\psi^{(0)}(x)$. This is often referred to as the digamma function. For the Hessian, we will also need the second derivative of $ln(\Gamma(x))$, known as the trigamma function. Denote this as $\psi^{(1)}(x)$
We now calculate the gradient and Hessian as required.
$$\nabla ln(f(y_{i})) = \frac{e^{f(X_{i})}}{(1 + e^{f(X_{i})})^2} ln(y_{i}) - \frac{e^{f(X_{i})}}{(1 + e^{f(X_{i})})^2} ln(1 - y_{i}) - \psi^{(0)}\left(\frac{e^{f(X_i)}}{1 + e^{f(X_i)}}\right) \frac{e^{f(X_{i})}}{(1 + e^{f(X_{i})})^2} + \psi^{(0)}\left(\frac{1}{1 + e^{f(X_i)}}\right)\frac{e^{f(X_{i})}}{(1 + e^{f(X_{i})})^2} = $$
$$\frac{e^{f(X_{i})}}{(1 + e^{f(X_{i})})^2}\left(\psi^{(0)}\left(\frac{1}{1 + e^{f(X_i)}}\right) - \psi^{(0)}\left(\frac{e^{f(X_i)}}{1 + e^{f(X_i)}}\right) + ln\left(\frac{y_{i}}{1-y_{i}}\right)\right).$$
Hence,
$$-\nabla ln(f(y_{i})) = -\frac{e^{f(X_{i})}}{(1 + e^{f(X_{i})})^2}\left(\psi^{(0)}\left(\frac{1}{1 + e^{f(X_i)}}\right) - \psi^{(0)}\left(\frac{e^{f(X_i)}}{1 + e^{f(X_i)}}\right) + ln\left(\frac{y_{i}}{1-y_{i}}\right)\right).$$
For the Hessian,
$$-\nabla^2 ln(f(y_{i})) = \frac{e^{2f(X_{i})}}{(1 + e^{f(X_{i})})^4}\left(\psi^{(1)}\left(\frac{1}{1 + e^{f(X_i)}}\right) + \psi^{(1)}\left(\frac{e^{f(X_i)}}{1 + e^{f(X_i)}}\right)\right) -\frac{e^{f({X_{i}})}(1 - e^{f({X_{i}})})}{(1 + e^{f({X_{i}})})^3}\left(\psi^{(0)}\left(\frac{1}{1 + e^{f(X_i)}}\right) - \psi^{(0)}\left(\frac{e^{f(X_i)}}{1 + e^{f(X_i)}}\right) + ln\left(\frac{y_{i}}{1-y_{i}}\right)\right).$$
### Implementation in R/Python
We write functions that take in two arguments: a vector of real values preds = $f(X_{i})$ and a LightGBM matrix representing the training data set.
```{r, Function in R}
library(tidyverse)
library(lightgbm)
beta_loss <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
grad <- (exp(preds) / ((1 + exp(preds))^2)) * (
digamma(1 / (1 + exp(preds))) -
digamma(exp(preds) / ((1 + exp(preds))^2)) +
log(labels/ (1 - labels))
)
hess <- (exp(preds)*(1 - exp(preds)) / (1 + exp(preds))^3) * (
digamma(1 / (1 + exp(preds))) -
digamma(exp(preds) / ((1 + exp(preds))^2)) +
log(labels/ (1 - labels))
) -
(exp(2 * preds) / ((1 + exp(preds))^4)) * (
trigamma(1 / (1 + exp(preds))) +
trigamma(exp(preds) / (1 + exp(preds)))
)
return(list(grad = -grad, hess = -hess))
}
```
In Python:
```{python Beta Loss in Python}
from scipy.special import polygamma
import numpy as np
def beta_loss(preds, dtrain):
labels = dtrain.get_label()
grad = ((np.exp(preds)/((1 + np.exp(preds))**2)) * (
polygamma(0, 1/(1 + np.exp(preds))) -
polygamma(0, np.exp(preds)/((1 + np.exp(preds))**2)) +
np.log(labels/(1-labels))
)
)
hess = ((np.exp(preds)*(1 - np.exp(preds)) / (1 + np.exp(preds))**3) * (
polygamma(0, 1 / (1 + np.exp(preds))) -
polygamma(0, np.exp(preds) / ((1 + np.exp(preds))**2)) +
np.log(labels/ (1 - labels))
) -
(np.exp(2 * preds) / ((1 + np.exp(preds))**4)) * (
polygamma(1, 1 / (1 + np.exp(preds))) +
polygamma(1, np.exp(preds) / (1 + np.exp(preds)))
)
)
return -grad, -hess
```
### Testing on Simulated Data