Evidence lower bound (ELBO) maximization[JGJS1999] is a general family of algorithms that minimize the exclusive (or reverse) Kullback-Leibler (KL) divergence between the target distribution $\pi$ and a variational approximation $q_{\lambda}$. More generally, they aim to solve the following problem:
where $\mathcal{Q}$ is some family of distributions, often called the variational family. Since the target distribution $\pi$ is intractable in general, the KL divergence is also intractable. Instead, the ELBO maximization strategy maximizes a surrogate objective, the ELBO:
which serves as a lower bound to the KL. The ELBO and its gradient can be readily estimated through various strategies. Overall, ELBO maximization algorithms aim to solve the problem:
Currently, AdvancedVI only provides the approach known as black-box variational inference (also known as Monte Carlo VI, Stochastic Gradient VI). (Introduced independently by two groups [RGB2014][TL2014] in 2014.) In particular, AdvancedVI focuses on the reparameterization gradient estimator[TL2014][RMW2014][KW2014], which is generally superior compared to alternative strategies[XQKS2019], discussed in the following section:
JGJS1999Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37, 183-233.
TL2014Titsias, M., & Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning.
RMW2014Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning.
KW2014Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In International Conference on Learning Representations.
XQKS2019Xu, M., Quiroz, M., Kohn, R., & Sisson, S. A. (2019). Variance reduction properties of the reparameterization trick. In *The International Conference on Artificial Intelligence and Statistics.
RGB2014Ranganath, R., Gerrish, S., & Blei, D. (2014). Black box variational inference. In Artificial Intelligence and Statistics.
Settings
This document was generated with Documenter.jl version 0.27.25 on Friday 7 June 2024. Using Julia version 1.10.4.
Evidence lower bound (ELBO) maximization[JGJS1999] is a general family of algorithms that minimize the exclusive (or reverse) Kullback-Leibler (KL) divergence between the target distribution $\pi$ and a variational approximation $q_{\lambda}$. More generally, they aim to solve the following problem:
where $\mathcal{Q}$ is some family of distributions, often called the variational family. Since the target distribution $\pi$ is intractable in general, the KL divergence is also intractable. Instead, the ELBO maximization strategy maximizes a surrogate objective, the ELBO:
which serves as a lower bound to the KL. The ELBO and its gradient can be readily estimated through various strategies. Overall, ELBO maximization algorithms aim to solve the problem:
Currently, AdvancedVI only provides the approach known as black-box variational inference (also known as Monte Carlo VI, Stochastic Gradient VI). (Introduced independently by two groups [RGB2014][TL2014] in 2014.) In particular, AdvancedVI focuses on the reparameterization gradient estimator[TL2014][RMW2014][KW2014], which is generally superior compared to alternative strategies[XQKS2019], discussed in the following section:
JGJS1999Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37, 183-233.
TL2014Titsias, M., & Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning.
RMW2014Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning.
KW2014Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In International Conference on Learning Representations.
XQKS2019Xu, M., Quiroz, M., Kohn, R., & Sisson, S. A. (2019). Variance reduction properties of the reparameterization trick. In *The International Conference on Artificial Intelligence and Statistics.
RGB2014Ranganath, R., Gerrish, S., & Blei, D. (2014). Black box variational inference. In Artificial Intelligence and Statistics.
Settings
This document was generated with Documenter.jl version 0.27.25 on Thursday 13 June 2024. Using Julia version 1.10.4.
As mentioned in the docstring, the RepGradELBO objective assumes that the variational approximation $q_{\lambda}$ and the target distribution $\pi$ have the same support for all $\lambda \in \Lambda$.
However, in general, it is most convenient to use variational families that have the whole Euclidean space $\mathbb{R}^d$ as their support. This is the case for the location-scale distributions provided by AdvancedVI. For target distributions which the support is not the full $\mathbb{R}^d$, we can apply some transformation $b$ to $q_{\lambda}$ to match its support such that
As mentioned in the docstring, the RepGradELBO objective assumes that the variational approximation $q_{\lambda}$ and the target distribution $\pi$ have the same support for all $\lambda \in \Lambda$.
However, in general, it is most convenient to use variational families that have the whole Euclidean space $\mathbb{R}^d$ as their support. This is the case for the location-scale distributions provided by AdvancedVI. For target distributions which the support is not the full $\mathbb{R}^d$, we can apply some transformation $b$ to $q_{\lambda}$ to match its support such that
\[z \sim q_{b,\lambda} \qquad\Leftrightarrow\qquad
z \stackrel{d}{=} b^{-1}\left(\eta\right);\quad \eta \sim q_{\lambda},\]
where $b$ is often called a bijector, since it is often chosen among bijective transformations. This idea is known as automatic differentiation VI[KTRGB2017] and has subsequently been improved by Tensorflow Probability[DLTBV2017]. In Julia, Bijectors.jl[FXTYG2020] provides a comprehensive collection of bijections.
One caveat of ADVI is that, after applying the bijection, a Jacobian adjustment needs to be applied. That is, the objective is now
For the gradient of the entropy term, we provide three choices with varying requirements. The user can select the entropy estimator by passing it as a keyword argument when constructing the RepGradELBO objective.
Estimator
entropy(q)
logpdf(q)
Type
ClosedFormEntropy
required
Deterministic
MonteCarloEntropy
required
Monte Carlo
StickingTheLandingEntropy
required
Monte Carlo with control variate
The requirements mean that either Distributions.entropy or Distributions.logpdf need to be implemented for the choice of variational family. In general, the use of ClosedFormEntropy is recommended whenever possible. If entropy is not available, then StickingTheLandingEntropy is recommended. See the following section for more details.
It occasionally results in lower variance when $\pi \approx q_{\lambda}$, and higher variance when $\pi \not\approx q_{\lambda}$. The conditions for which the STL estimator results in lower variance is still an active subject for research.
The main downside of the STL estimator is that it needs to evaluate and differentiate the log density of $q_{\lambda}$, logpdf(q), in every iteration. Depending on the variational family, this might be computationally inefficient or even numerically unstable. For example, if $q_{\lambda}$ is a Gaussian with a full-rank covariance, a back-substitution must be performed at every step, making the per-iteration complexity $\mathcal{O}(d^3)$ and reducing numerical stability.
The STL control variate can be used by changing the entropy estimator using the following object:
Let us come back to the example in Examples, where a LogDensityProblem is given as model. In this example, the true posterior is contained within the variational family. This setting is known as "perfect variational family specification." In this case, the RepGradELBO estimator with StickingTheLandingEntropy is the only estimator known to converge exponentially fast ("linear convergence") to the true solution.
Recall that the original ADVI objective with a closed-form entropy (CFE) is given as follows:
For the gradient of the entropy term, we provide three choices with varying requirements. The user can select the entropy estimator by passing it as a keyword argument when constructing the RepGradELBO objective.
Estimator
entropy(q)
logpdf(q)
Type
ClosedFormEntropy
required
Deterministic
MonteCarloEntropy
required
Monte Carlo
StickingTheLandingEntropy
required
Monte Carlo with control variate
The requirements mean that either Distributions.entropy or Distributions.logpdf need to be implemented for the choice of variational family. In general, the use of ClosedFormEntropy is recommended whenever possible. If entropy is not available, then StickingTheLandingEntropy is recommended. See the following section for more details.
It occasionally results in lower variance when $\pi \approx q_{\lambda}$, and higher variance when $\pi \not\approx q_{\lambda}$. The conditions for which the STL estimator results in lower variance is still an active subject for research.
The main downside of the STL estimator is that it needs to evaluate and differentiate the log density of $q_{\lambda}$, logpdf(q), in every iteration. Depending on the variational family, this might be computationally inefficient or even numerically unstable. For example, if $q_{\lambda}$ is a Gaussian with a full-rank covariance, a back-substitution must be performed at every step, making the per-iteration complexity $\mathcal{O}(d^3)$ and reducing numerical stability.
The STL control variate can be used by changing the entropy estimator using the following object:
Let us come back to the example in Examples, where a LogDensityProblem is given as model. In this example, the true posterior is contained within the variational family. This setting is known as "perfect variational family specification." In this case, the RepGradELBO estimator with StickingTheLandingEntropy is the only estimator known to converge exponentially fast ("linear convergence") to the true solution.
Recall that the original ADVI objective with a closed-form entropy (CFE) is given as follows:
We can see that the noise of the repgradelbo estimator becomes smaller as VI converges. However, the speed of convergence may not always be significantly different. Also, due to noise, just looking at the ELBO may not be sufficient to judge which algorithm is better. This can be made apparent if we measure convergence through the distance to the optimum:
We can see that STL kicks-in at later stages of optimization. Therefore, when STL "works", it yields a higher accuracy solution even on large stepsizes. However, whether STL works or not highly depends on the problem[KMG2024]. Furthermore, in a lot of cases, a low-accuracy solution may be sufficient.
There are two major ways to customize the behavior of RepGradELBO
Customize the Distributions functions: rand(q), entropy(q), logpdf(q).
Customize AdvancedVI.reparam_with_entropy.
It is generally recommended to customize rand(q), entropy(q), logpdf(q), since it will easily compose with other functionalities provided by AdvancedVI.
The most advanced way is to customize AdvancedVI.reparam_with_entropy. In particular, reparam_with_entropy is the function that invokes rand(q), entropy(q), logpdf(q). Thus, it is the most general way to override the behavior of RepGradELBO.
To illustrate how we can customize the rand(q) function, we will implement quasi-Monte-Carlo variational inference[BWM2018]. Consider the case where we use the MeanFieldGaussian variational family. In this case, it suffices to override its rand specialization as follows:
using QuasiMonteCarlo
+nothing
We can see that the noise of the repgradelbo estimator becomes smaller as VI converges. However, the speed of convergence may not always be significantly different. Also, due to noise, just looking at the ELBO may not be sufficient to judge which algorithm is better. This can be made apparent if we measure convergence through the distance to the optimum:
We can see that STL kicks-in at later stages of optimization. Therefore, when STL "works", it yields a higher accuracy solution even on large stepsizes. However, whether STL works or not highly depends on the problem[KMG2024]. Furthermore, in a lot of cases, a low-accuracy solution may be sufficient.
There are two major ways to customize the behavior of RepGradELBO
Customize the Distributions functions: rand(q), entropy(q), logpdf(q).
Customize AdvancedVI.reparam_with_entropy.
It is generally recommended to customize rand(q), entropy(q), logpdf(q), since it will easily compose with other functionalities provided by AdvancedVI.
The most advanced way is to customize AdvancedVI.reparam_with_entropy. In particular, reparam_with_entropy is the function that invokes rand(q), entropy(q), logpdf(q). Thus, it is the most general way to override the behavior of RepGradELBO.
To illustrate how we can customize the rand(q) function, we will implement quasi-Monte-Carlo variational inference[BWM2018]. Consider the case where we use the MeanFieldGaussian variational family. In this case, it suffices to override its rand specialization as follows:
using QuasiMonteCarlo
using StatsFuns
qmcrng = SobolSample(R = OwenScramble(base = 2, pad = 32))
@@ -41,4 +41,4 @@
std_samples = norminvcdf.(unif_samples)
scale_diag.*std_samples .+ location
end
-nothing
(Note that this is a quick-and-dirty example, and there are more sophisticated ways to implement this.)
By plotting the ELBO, we can see the effect of quasi-Monte Carlo. We can see that quasi-Monte Carlo results in much lower variance than naive Monte Carlo. However, similarly to the STL example, just looking at the ELBO is often insufficient to really judge performance. Instead, let's look at the distance to the global optimum:
QMC yields an additional order of magnitude in accuracy. Also, unlike STL, it ever-so slightly accelerates convergence. This is because quasi-Monte Carlo uniformly reduces variance, unlike STL, which reduces variance only near the optimum.
TL2014Titsias, M., & Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning.
RMW2014Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning.
KW2014Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In International Conference on Learning Representations.
KTRGB2017Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic differentiation variational inference. Journal of Machine Learning Research.
DLTBV2017Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., ... & Saurous, R. A. (2017). Tensorflow distributions. arXiv.
FXTYG2020Fjelde, T. E., Xu, K., Tarek, M., Yalburgi, S., & Ge, H. (2020,. Bijectors. jl: Flexible transformations for probability distributions. In Symposium on Advances in Approximate Bayesian Inference.
RWD2017Roeder, G., Wu, Y., & Duvenaud, D. K. (2017). Sticking the landing: Simple, lower-variance gradient estimators for variational inference. Advances in Neural Information Processing Systems, 30.
KMG2024Kim, K., Ma, Y., & Gardner, J. (2024). Linear Convergence of Black-Box Variational Inference: Should We Stick the Landing?. In International Conference on Artificial Intelligence and Statistics (pp. 235-243). PMLR.
BWM2018Buchholz, A., Wenzel, F., & Mandt, S. (2018). Quasi-monte carlo variational inference. In International Conference on Machine Learning.
Settings
This document was generated with Documenter.jl version 0.27.25 on Friday 7 June 2024. Using Julia version 1.10.4.