diff --git a/dev/bbvi_example_elbo.svg b/dev/bbvi_example_elbo.svg index cfad19ba..95d11e11 100644 --- a/dev/bbvi_example_elbo.svg +++ b/dev/bbvi_example_elbo.svg @@ -1,46 +1,44 @@ - + - + - + - + - - + + - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/dev/elbo/advi_qmc_dist.svg b/dev/elbo/advi_qmc_dist.svg index 75087fcf..aedb7be2 100644 --- a/dev/elbo/advi_qmc_dist.svg +++ b/dev/elbo/advi_qmc_dist.svg @@ -1,42 +1,42 @@ - + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + diff --git a/dev/elbo/advi_qmc_elbo.svg b/dev/elbo/advi_qmc_elbo.svg index 631dc2a4..eb7979a0 100644 --- a/dev/elbo/advi_qmc_elbo.svg +++ b/dev/elbo/advi_qmc_elbo.svg @@ -1,48 +1,48 @@ - + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/dev/elbo/advi_stl_dist.svg b/dev/elbo/advi_stl_dist.svg index e7bbe44d..d3312711 100644 --- a/dev/elbo/advi_stl_dist.svg +++ b/dev/elbo/advi_stl_dist.svg @@ -1,40 +1,40 @@ - + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + diff --git a/dev/elbo/advi_stl_elbo.svg b/dev/elbo/advi_stl_elbo.svg index a51fbe57..e967cdeb 100644 --- a/dev/elbo/advi_stl_elbo.svg +++ b/dev/elbo/advi_stl_elbo.svg @@ -1,48 +1,48 @@ - + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/dev/elbo/overview/index.html b/dev/elbo/overview/index.html index 717d9dbe..e2e95f6f 100644 --- a/dev/elbo/overview/index.html +++ b/dev/elbo/overview/index.html @@ -1,2 +1,2 @@ -Overview · AdvancedVI.jl

Evidence Lower Bound Maximization

Introduction

Evidence lower bound (ELBO) maximization[JGJS1999] is a general family of algorithms that minimize the exclusive (or reverse) Kullback-Leibler (KL) divergence between the target distribution $\pi$ and a variational approximation $q_{\lambda}$. More generally, they aim to solve the following problem:

\[ \mathrm{minimize}_{q \in \mathcal{Q}}\quad \mathrm{KL}\left(q, \pi\right),\]

where $\mathcal{Q}$ is some family of distributions, often called the variational family. Since the target distribution $\pi$ is intractable in general, the KL divergence is also intractable. Instead, the ELBO maximization strategy maximizes a surrogate objective, the ELBO:

\[ \mathrm{ELBO}\left(q\right) \triangleq \mathbb{E}_{\theta \sim q} \log \pi\left(\theta\right) + \mathbb{H}\left(q\right),\]

which serves as a lower bound to the KL. The ELBO and its gradient can be readily estimated through various strategies. Overall, ELBO maximization algorithms aim to solve the problem:

\[ \mathrm{maximize}_{q \in \mathcal{Q}}\quad \mathrm{ELBO}\left(q\right).\]

Multiple ways to solve this problem exist, each leading to a different variational inference algorithm.

Algorithms

Currently, AdvancedVI only provides the approach known as black-box variational inference (also known as Monte Carlo VI, Stochastic Gradient VI). (Introduced independently by two groups [RGB2014][TL2014] in 2014.) In particular, AdvancedVI focuses on the reparameterization gradient estimator[TL2014][RMW2014][KW2014], which is generally superior compared to alternative strategies[XQKS2019], discussed in the following section:

  • JGJS1999Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37, 183-233.
  • TL2014Titsias, M., & Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning.
  • RMW2014Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning.
  • KW2014Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In International Conference on Learning Representations.
  • XQKS2019Xu, M., Quiroz, M., Kohn, R., & Sisson, S. A. (2019). Variance reduction properties of the reparameterization trick. In *The International Conference on Artificial Intelligence and Statistics.
  • RGB2014Ranganath, R., Gerrish, S., & Blei, D. (2014). Black box variational inference. In Artificial Intelligence and Statistics.
+Overview · AdvancedVI.jl

Evidence Lower Bound Maximization

Introduction

Evidence lower bound (ELBO) maximization[JGJS1999] is a general family of algorithms that minimize the exclusive (or reverse) Kullback-Leibler (KL) divergence between the target distribution $\pi$ and a variational approximation $q_{\lambda}$. More generally, they aim to solve the following problem:

\[ \mathrm{minimize}_{q \in \mathcal{Q}}\quad \mathrm{KL}\left(q, \pi\right),\]

where $\mathcal{Q}$ is some family of distributions, often called the variational family. Since the target distribution $\pi$ is intractable in general, the KL divergence is also intractable. Instead, the ELBO maximization strategy maximizes a surrogate objective, the ELBO:

\[ \mathrm{ELBO}\left(q\right) \triangleq \mathbb{E}_{\theta \sim q} \log \pi\left(\theta\right) + \mathbb{H}\left(q\right),\]

which serves as a lower bound to the KL. The ELBO and its gradient can be readily estimated through various strategies. Overall, ELBO maximization algorithms aim to solve the problem:

\[ \mathrm{maximize}_{q \in \mathcal{Q}}\quad \mathrm{ELBO}\left(q\right).\]

Multiple ways to solve this problem exist, each leading to a different variational inference algorithm.

Algorithms

Currently, AdvancedVI only provides the approach known as black-box variational inference (also known as Monte Carlo VI, Stochastic Gradient VI). (Introduced independently by two groups [RGB2014][TL2014] in 2014.) In particular, AdvancedVI focuses on the reparameterization gradient estimator[TL2014][RMW2014][KW2014], which is generally superior compared to alternative strategies[XQKS2019], discussed in the following section:

  • JGJS1999Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37, 183-233.
  • TL2014Titsias, M., & Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning.
  • RMW2014Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning.
  • KW2014Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In International Conference on Learning Representations.
  • XQKS2019Xu, M., Quiroz, M., Kohn, R., & Sisson, S. A. (2019). Variance reduction properties of the reparameterization trick. In *The International Conference on Artificial Intelligence and Statistics.
  • RGB2014Ranganath, R., Gerrish, S., & Blei, D. (2014). Black box variational inference. In Artificial Intelligence and Statistics.
diff --git a/dev/elbo/repgradelbo/index.html b/dev/elbo/repgradelbo/index.html index cc2eb815..c46cf838 100644 --- a/dev/elbo/repgradelbo/index.html +++ b/dev/elbo/repgradelbo/index.html @@ -7,7 +7,7 @@ \log \pi\left(z\right) \right] + \mathbb{H}\left(q_{\lambda}\right), -\end{aligned}\]

Arguments

  • n_samples::Int: Number of Monte Carlo samples used to estimate the ELBO.

Keyword Arguments

  • entropy: The estimator for the entropy term. (Type <: AbstractEntropyEstimator; Default: ClosedFormEntropy())

Requirements

  • The variational approximation $q_{\lambda}$ implements rand.
  • The target distribution and the variational approximation have the same support.
  • The target logdensity(prob, x) must be differentiable with respect to x by the selected AD backend.

Depending on the options, additional requirements on $q_{\lambda}$ may apply.

source

Handling Constraints with Bijectors

As mentioned in the docstring, the RepGradELBO objective assumes that the variational approximation $q_{\lambda}$ and the target distribution $\pi$ have the same support for all $\lambda \in \Lambda$.

However, in general, it is most convenient to use variational families that have the whole Euclidean space $\mathbb{R}^d$ as their support. This is the case for the location-scale distributions provided by AdvancedVI. For target distributions which the support is not the full $\mathbb{R}^d$, we can apply some transformation $b$ to $q_{\lambda}$ to match its support such that

\[z \sim q_{b,\lambda} \qquad\Leftrightarrow\qquad +\end{aligned}\]

Arguments

  • n_samples::Int: Number of Monte Carlo samples used to estimate the ELBO.

Keyword Arguments

  • entropy: The estimator for the entropy term. (Type <: AbstractEntropyEstimator; Default: ClosedFormEntropy())

Requirements

  • The variational approximation $q_{\lambda}$ implements rand.
  • The target distribution and the variational approximation have the same support.
  • The target logdensity(prob, x) must be differentiable with respect to x by the selected AD backend.

Depending on the options, additional requirements on $q_{\lambda}$ may apply.

source

Handling Constraints with Bijectors

As mentioned in the docstring, the RepGradELBO objective assumes that the variational approximation $q_{\lambda}$ and the target distribution $\pi$ have the same support for all $\lambda \in \Lambda$.

However, in general, it is most convenient to use variational families that have the whole Euclidean space $\mathbb{R}^d$ as their support. This is the case for the location-scale distributions provided by AdvancedVI. For target distributions which the support is not the full $\mathbb{R}^d$, we can apply some transformation $b$ to $q_{\lambda}$ to match its support such that

\[z \sim q_{b,\lambda} \qquad\Leftrightarrow\qquad z \stackrel{d}{=} b^{-1}\left(\eta\right);\quad \eta \sim q_{\lambda},\]

where $b$ is often called a bijector, since it is often chosen among bijective transformations. This idea is known as automatic differentiation VI[KTRGB2017] and has subsequently been improved by Tensorflow Probability[DLTBV2017]. In Julia, Bijectors.jl[FXTYG2020] provides a comprehensive collection of bijections.

One caveat of ADVI is that, after applying the bijection, a Jacobian adjustment needs to be applied. That is, the objective is now

\[\mathrm{ADVI}\left(\lambda\right) \triangleq \mathbb{E}_{\eta \sim q_{\lambda}}\left[ @@ -18,7 +18,7 @@ q = MeanFieldGaussian(μ, L) b = Bijectors.bijector(dist) binv = inverse(b) -q_transformed = Bijectors.TransformedDistribution(q, binv)

By passing q_transformed to optimize, the Jacobian adjustment for the bijector b is automatically applied. (See Examples for a fully working example.)

Entropy Estimators

For the gradient of the entropy term, we provide three choices with varying requirements. The user can select the entropy estimator by passing it as a keyword argument when constructing the RepGradELBO objective.

Estimatorentropy(q)logpdf(q)Type
ClosedFormEntropyrequiredDeterministic
MonteCarloEntropyrequiredMonte Carlo
StickingTheLandingEntropyrequiredMonte Carlo with control variate

The requirements mean that either Distributions.entropy or Distributions.logpdf need to be implemented for the choice of variational family. In general, the use of ClosedFormEntropy is recommended whenever possible. If entropy is not available, then StickingTheLandingEntropy is recommended. See the following section for more details.

The StickingTheLandingEntropy Estimator

The StickingTheLandingEntropy, or STL estimator, is a control variate approach [RWD2017].

AdvancedVI.StickingTheLandingEntropyType
StickingTheLandingEntropy()

The "sticking the landing" entropy estimator[RWD2017].

Requirements

  • The variational approximation q implements logpdf.
  • logpdf(q, η) must be differentiable by the selected AD framework.
source

It occasionally results in lower variance when $\pi \approx q_{\lambda}$, and higher variance when $\pi \not\approx q_{\lambda}$. The conditions for which the STL estimator results in lower variance is still an active subject for research.

The main downside of the STL estimator is that it needs to evaluate and differentiate the log density of $q_{\lambda}$, logpdf(q), in every iteration. Depending on the variational family, this might be computationally inefficient or even numerically unstable. For example, if $q_{\lambda}$ is a Gaussian with a full-rank covariance, a back-substitution must be performed at every step, making the per-iteration complexity $\mathcal{O}(d^3)$ and reducing numerical stability.

The STL control variate can be used by changing the entropy estimator using the following object:

Let us come back to the example in Examples, where a LogDensityProblem is given as model. In this example, the true posterior is contained within the variational family. This setting is known as "perfect variational family specification." In this case, the RepGradELBO estimator with StickingTheLandingEntropy is the only estimator known to converge exponentially fast ("linear convergence") to the true solution.

Recall that the original ADVI objective with a closed-form entropy (CFE) is given as follows:

n_montecarlo = 16;
+q_transformed = Bijectors.TransformedDistribution(q, binv)

By passing q_transformed to optimize, the Jacobian adjustment for the bijector b is automatically applied. (See Examples for a fully working example.)

Entropy Estimators

For the gradient of the entropy term, we provide three choices with varying requirements. The user can select the entropy estimator by passing it as a keyword argument when constructing the RepGradELBO objective.

Estimatorentropy(q)logpdf(q)Type
ClosedFormEntropyrequiredDeterministic
MonteCarloEntropyrequiredMonte Carlo
StickingTheLandingEntropyrequiredMonte Carlo with control variate

The requirements mean that either Distributions.entropy or Distributions.logpdf need to be implemented for the choice of variational family. In general, the use of ClosedFormEntropy is recommended whenever possible. If entropy is not available, then StickingTheLandingEntropy is recommended. See the following section for more details.

The StickingTheLandingEntropy Estimator

The StickingTheLandingEntropy, or STL estimator, is a control variate approach [RWD2017].

AdvancedVI.StickingTheLandingEntropyType
StickingTheLandingEntropy()

The "sticking the landing" entropy estimator[RWD2017].

Requirements

  • The variational approximation q implements logpdf.
  • logpdf(q, η) must be differentiable by the selected AD framework.
source

It occasionally results in lower variance when $\pi \approx q_{\lambda}$, and higher variance when $\pi \not\approx q_{\lambda}$. The conditions for which the STL estimator results in lower variance is still an active subject for research.

The main downside of the STL estimator is that it needs to evaluate and differentiate the log density of $q_{\lambda}$, logpdf(q), in every iteration. Depending on the variational family, this might be computationally inefficient or even numerically unstable. For example, if $q_{\lambda}$ is a Gaussian with a full-rank covariance, a back-substitution must be performed at every step, making the per-iteration complexity $\mathcal{O}(d^3)$ and reducing numerical stability.

The STL control variate can be used by changing the entropy estimator using the following object:

Let us come back to the example in Examples, where a LogDensityProblem is given as model. In this example, the true posterior is contained within the variational family. This setting is known as "perfect variational family specification." In this case, the RepGradELBO estimator with StickingTheLandingEntropy is the only estimator known to converge exponentially fast ("linear convergence") to the true solution.

Recall that the original ADVI objective with a closed-form entropy (CFE) is given as follows:

n_montecarlo = 16;
 b            = Bijectors.bijector(model);
 binv         = inverse(b)
 
@@ -26,7 +26,7 @@
 
 cfe = AdvancedVI.RepGradELBO(n_montecarlo)
 nothing

The repgradelbo estimator can instead be created as follows:

repgradelbo = AdvancedVI.RepGradELBO(n_montecarlo; entropy = AdvancedVI.StickingTheLandingEntropy());
-nothing

We can see that the noise of the repgradelbo estimator becomes smaller as VI converges. However, the speed of convergence may not always be significantly different. Also, due to noise, just looking at the ELBO may not be sufficient to judge which algorithm is better. This can be made apparent if we measure convergence through the distance to the optimum:

We can see that STL kicks-in at later stages of optimization. Therefore, when STL "works", it yields a higher accuracy solution even on large stepsizes. However, whether STL works or not highly depends on the problem[KMG2024]. Furthermore, in a lot of cases, a low-accuracy solution may be sufficient.

Advanced Usage

There are two major ways to customize the behavior of RepGradELBO

  • Customize the Distributions functions: rand(q), entropy(q), logpdf(q).
  • Customize AdvancedVI.reparam_with_entropy.

It is generally recommended to customize rand(q), entropy(q), logpdf(q), since it will easily compose with other functionalities provided by AdvancedVI.

The most advanced way is to customize AdvancedVI.reparam_with_entropy. In particular, reparam_with_entropy is the function that invokes rand(q), entropy(q), logpdf(q). Thus, it is the most general way to override the behavior of RepGradELBO.

AdvancedVI.reparam_with_entropyFunction
reparam_with_entropy(rng, q, n_samples, ent_est)

Draw n_samples from q and compute its entropy.

Arguments

  • rng::Random.AbstractRNG: Random number generator.
  • q: Variational approximation.
  • n_samples::Int: Number of Monte Carlo samples
  • ent_est: The entropy estimation strategy. (See estimate_entropy.)

Returns

  • samples: Monte Carlo samples generated through reparameterization. Their support matches that of the target distribution.
  • entropy: An estimate (or exact value) of the differential entropy of q.
source

To illustrate how we can customize the rand(q) function, we will implement quasi-Monte-Carlo variational inference[BWM2018]. Consider the case where we use the MeanFieldGaussian variational family. In this case, it suffices to override its rand specialization as follows:

using QuasiMonteCarlo
+nothing

We can see that the noise of the repgradelbo estimator becomes smaller as VI converges. However, the speed of convergence may not always be significantly different. Also, due to noise, just looking at the ELBO may not be sufficient to judge which algorithm is better. This can be made apparent if we measure convergence through the distance to the optimum:

We can see that STL kicks-in at later stages of optimization. Therefore, when STL "works", it yields a higher accuracy solution even on large stepsizes. However, whether STL works or not highly depends on the problem[KMG2024]. Furthermore, in a lot of cases, a low-accuracy solution may be sufficient.

Advanced Usage

There are two major ways to customize the behavior of RepGradELBO

  • Customize the Distributions functions: rand(q), entropy(q), logpdf(q).
  • Customize AdvancedVI.reparam_with_entropy.

It is generally recommended to customize rand(q), entropy(q), logpdf(q), since it will easily compose with other functionalities provided by AdvancedVI.

The most advanced way is to customize AdvancedVI.reparam_with_entropy. In particular, reparam_with_entropy is the function that invokes rand(q), entropy(q), logpdf(q). Thus, it is the most general way to override the behavior of RepGradELBO.

AdvancedVI.reparam_with_entropyFunction
reparam_with_entropy(rng, q, n_samples, ent_est)

Draw n_samples from q and compute its entropy.

Arguments

  • rng::Random.AbstractRNG: Random number generator.
  • q: Variational approximation.
  • n_samples::Int: Number of Monte Carlo samples
  • ent_est: The entropy estimation strategy. (See estimate_entropy.)

Returns

  • samples: Monte Carlo samples generated through reparameterization. Their support matches that of the target distribution.
  • entropy: An estimate (or exact value) of the differential entropy of q.
source

To illustrate how we can customize the rand(q) function, we will implement quasi-Monte-Carlo variational inference[BWM2018]. Consider the case where we use the MeanFieldGaussian variational family. In this case, it suffices to override its rand specialization as follows:

using QuasiMonteCarlo
 using StatsFuns
 
 qmcrng = SobolSample(R = OwenScramble(base = 2, pad = 32))
@@ -41,4 +41,4 @@
     std_samples  = norminvcdf.(unif_samples)
     scale_diag.*std_samples .+ location
 end
-nothing

(Note that this is a quick-and-dirty example, and there are more sophisticated ways to implement this.)

By plotting the ELBO, we can see the effect of quasi-Monte Carlo. We can see that quasi-Monte Carlo results in much lower variance than naive Monte Carlo. However, similarly to the STL example, just looking at the ELBO is often insufficient to really judge performance. Instead, let's look at the distance to the global optimum:

QMC yields an additional order of magnitude in accuracy. Also, unlike STL, it ever-so slightly accelerates convergence. This is because quasi-Monte Carlo uniformly reduces variance, unlike STL, which reduces variance only near the optimum.

  • TL2014Titsias, M., & Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning.
  • RMW2014Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning.
  • KW2014Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In International Conference on Learning Representations.
  • KTRGB2017Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic differentiation variational inference. Journal of Machine Learning Research.
  • DLTBV2017Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., ... & Saurous, R. A. (2017). Tensorflow distributions. arXiv.
  • FXTYG2020Fjelde, T. E., Xu, K., Tarek, M., Yalburgi, S., & Ge, H. (2020,. Bijectors. jl: Flexible transformations for probability distributions. In Symposium on Advances in Approximate Bayesian Inference.
  • RWD2017Roeder, G., Wu, Y., & Duvenaud, D. K. (2017). Sticking the landing: Simple, lower-variance gradient estimators for variational inference. Advances in Neural Information Processing Systems, 30.
  • KMG2024Kim, K., Ma, Y., & Gardner, J. (2024). Linear Convergence of Black-Box Variational Inference: Should We Stick the Landing?. In International Conference on Artificial Intelligence and Statistics (pp. 235-243). PMLR.
  • BWM2018Buchholz, A., Wenzel, F., & Mandt, S. (2018). Quasi-monte carlo variational inference. In International Conference on Machine Learning.
+nothing

(Note that this is a quick-and-dirty example, and there are more sophisticated ways to implement this.)

By plotting the ELBO, we can see the effect of quasi-Monte Carlo. We can see that quasi-Monte Carlo results in much lower variance than naive Monte Carlo. However, similarly to the STL example, just looking at the ELBO is often insufficient to really judge performance. Instead, let's look at the distance to the global optimum:

QMC yields an additional order of magnitude in accuracy. Also, unlike STL, it ever-so slightly accelerates convergence. This is because quasi-Monte Carlo uniformly reduces variance, unlike STL, which reduces variance only near the optimum.

  • TL2014Titsias, M., & Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning.
  • RMW2014Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning.
  • KW2014Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In International Conference on Learning Representations.
  • KTRGB2017Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic differentiation variational inference. Journal of Machine Learning Research.
  • DLTBV2017Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., ... & Saurous, R. A. (2017). Tensorflow distributions. arXiv.
  • FXTYG2020Fjelde, T. E., Xu, K., Tarek, M., Yalburgi, S., & Ge, H. (2020,. Bijectors. jl: Flexible transformations for probability distributions. In Symposium on Advances in Approximate Bayesian Inference.
  • RWD2017Roeder, G., Wu, Y., & Duvenaud, D. K. (2017). Sticking the landing: Simple, lower-variance gradient estimators for variational inference. Advances in Neural Information Processing Systems, 30.
  • KMG2024Kim, K., Ma, Y., & Gardner, J. (2024). Linear Convergence of Black-Box Variational Inference: Should We Stick the Landing?. In International Conference on Artificial Intelligence and Statistics (pp. 235-243). PMLR.
  • BWM2018Buchholz, A., Wenzel, F., & Mandt, S. (2018). Quasi-monte carlo variational inference. In International Conference on Machine Learning.
diff --git a/dev/examples/index.html b/dev/examples/index.html index f75c07f1..ccd6922c 100644 --- a/dev/examples/index.html +++ b/dev/examples/index.html @@ -66,4 +66,4 @@ y = [stat.elbo for stat ∈ stats] plot(t, y, label="BBVI", xlabel="Iteration", ylabel="ELBO") savefig("bbvi_example_elbo.svg") -nothing

Further information can be gathered by defining your own callback!.

The final ELBO can be estimated by calling the objective directly with a different number of Monte Carlo samples as follows:

estimate_objective(objective, q_trans, model; n_samples=10^4)
-0.025986522302263282
+nothing

Further information can be gathered by defining your own callback!.

The final ELBO can be estimated by calling the objective directly with a different number of Monte Carlo samples as follows:

estimate_objective(objective, q_trans, model; n_samples=10^4)
0.020088261982820654
diff --git a/dev/general/index.html b/dev/general/index.html index cdb711cd..10393c9e 100644 --- a/dev/general/index.html +++ b/dev/general/index.html @@ -1,2 +1,2 @@ -General Usage · AdvancedVI.jl

General Usage

Each VI algorithm provides the followings:

  1. Variational families supported by each VI algorithm.
  2. A variational objective corresponding to the VI algorithm.

Note that each variational family is subject to its own constraints. Thus, please refer to the documentation of the variational inference algorithm of interest.

Optimizing a Variational Objective

After constructing a variational objective objective and initializing a variational approximation, one can optimize objective by calling optimize:

AdvancedVI.optimizeFunction
optimize(problem, objective, q_init, max_iter, objargs...; kwargs...)

Optimize the variational objective objective targeting the problem problem by estimating (stochastic) gradients.

The trainable parameters in the variational approximation are expected to be extractable through Optimisers.destructure. This requires the variational approximation to be marked as a functor through Functors.@functor.

Arguments

  • objective::AbstractVariationalObjective: Variational Objective.
  • q_init: Initial variational distribution. The variational parameters must be extractable through Optimisers.destructure.
  • max_iter::Int: Maximum number of iterations.
  • objargs...: Arguments to be passed to objective.

Keyword Arguments

  • adtype::ADtypes.AbstractADType: Automatic differentiation backend.
  • optimizer::Optimisers.AbstractRule: Optimizer used for inference. (Default: Adam.)
  • rng::AbstractRNG: Random number generator. (Default: Random.default_rng().)
  • show_progress::Bool: Whether to show the progress bar. (Default: true.)
  • callback: Callback function called after every iteration. See further information below. (Default: nothing.)
  • prog: Progress bar configuration. (Default: ProgressMeter.Progress(n_max_iter; desc="Optimizing", barlen=31, showspeed=true, enabled=prog).)
  • state::NamedTuple: Initial value for the internal state of optimization. Used to warm-start from the state of a previous run. (See the returned values below.)

Returns

  • params: Variational parameters optimizing the variational objective.
  • stats: Statistics gathered during optimization.
  • state: Collection of the final internal states of optimization. This can used later to warm-start from the last iteration of the corresponding run.

Callback

The callback function callback has a signature of

callback(; stat, state, params, restructure, gradient)

The arguments are as follows:

  • stat: Statistics gathered during the current iteration. The content will vary depending on objective.
  • state: Collection of the internal states used for optimization.
  • params: Variational parameters.
  • restructure: Function that restructures the variational approximation from the variational parameters. Calling restructure(param) reconstructs the variational approximation.
  • gradient: The estimated (possibly stochastic) gradient.

cb can return a NamedTuple containing some additional information computed within cb. This will be appended to the statistic of the current corresponding iteration. Otherwise, just return nothing.

source

Estimating the Objective

In some cases, it is useful to directly estimate the objective value. This can be done by the following funciton:

AdvancedVI.estimate_objectiveFunction
estimate_objective([rng,] obj, q, prob; kwargs...)

Estimate the variational objective obj targeting prob with respect to the variational approximation q.

Arguments

  • rng::Random.AbstractRNG: Random number generator.
  • obj::AbstractVariationalObjective: Variational objective.
  • prob: The target log-joint likelihood implementing the LogDensityProblem interface.
  • q: Variational approximation.

Keyword Arguments

Depending on the objective, additional keyword arguments may apply. Please refer to the respective documentation of each variational objective for more info.

Returns

  • obj_est: Estimate of the objective value.
source
Info

Note that estimate_objective is not expected to be differentiated through, and may not result in optimal statistical performance.

Advanced Usage

Each variational objective is a subtype of the following abstract type:

AdvancedVI.AbstractVariationalObjectiveType
AbstractVariationalObjective

Abstract type for the VI algorithms supported by AdvancedVI.

Implementations

To be supported by AdvancedVI, a VI algorithm must implement AbstractVariationalObjective and estimate_objective. Also, it should provide gradients by implementing the function estimate_gradient!. If the estimator is stateful, it can implement init to initialize the state.

source

Furthermore, AdvancedVI only interacts with each variational objective by querying gradient estimates. Therefore, to create a new custom objective to be optimized through AdvancedVI, it suffices to implement the following function:

AdvancedVI.estimate_gradient!Function
estimate_gradient!(rng, obj, adtype, out, prob, λ, restructure, obj_state)

Estimate (possibly stochastic) gradients of the variational objective obj targeting prob with respect to the variational parameters λ

Arguments

  • rng::Random.AbstractRNG: Random number generator.
  • obj::AbstractVariationalObjective: Variational objective.
  • adtype::ADTypes.AbstractADType: Automatic differentiation backend.
  • out::DiffResults.MutableDiffResult: Buffer containing the objective value and gradient estimates.
  • prob: The target log-joint likelihood implementing the LogDensityProblem interface.
  • λ: Variational parameters to evaluate the gradient on.
  • restructure: Function that reconstructs the variational approximation from λ.
  • obj_state: Previous state of the objective.

Returns

  • out::MutableDiffResult: Buffer containing the objective value and gradient estimates.
  • obj_state: The updated state of the objective.
  • stat::NamedTuple: Statistics and logs generated during estimation.
source

If an objective needs to be stateful, one can implement the following function to inialize the state.

AdvancedVI.initFunction
init(rng, obj, prob, params, restructure)

Initialize a state of the variational objective obj given the initial variational parameters λ. This function needs to be implemented only if obj is stateful.

Arguments

  • rng::Random.AbstractRNG: Random number generator.
  • obj::AbstractVariationalObjective: Variational objective.
  • params: Initial variational parameters.
  • restructure: Function that reconstructs the variational approximation from λ.
source
+General Usage · AdvancedVI.jl

General Usage

Each VI algorithm provides the followings:

  1. Variational families supported by each VI algorithm.
  2. A variational objective corresponding to the VI algorithm.

Note that each variational family is subject to its own constraints. Thus, please refer to the documentation of the variational inference algorithm of interest.

Optimizing a Variational Objective

After constructing a variational objective objective and initializing a variational approximation, one can optimize objective by calling optimize:

AdvancedVI.optimizeFunction
optimize(problem, objective, q_init, max_iter, objargs...; kwargs...)

Optimize the variational objective objective targeting the problem problem by estimating (stochastic) gradients.

The trainable parameters in the variational approximation are expected to be extractable through Optimisers.destructure. This requires the variational approximation to be marked as a functor through Functors.@functor.

Arguments

  • objective::AbstractVariationalObjective: Variational Objective.
  • q_init: Initial variational distribution. The variational parameters must be extractable through Optimisers.destructure.
  • max_iter::Int: Maximum number of iterations.
  • objargs...: Arguments to be passed to objective.

Keyword Arguments

  • adtype::ADtypes.AbstractADType: Automatic differentiation backend.
  • optimizer::Optimisers.AbstractRule: Optimizer used for inference. (Default: Adam.)
  • rng::AbstractRNG: Random number generator. (Default: Random.default_rng().)
  • show_progress::Bool: Whether to show the progress bar. (Default: true.)
  • callback: Callback function called after every iteration. See further information below. (Default: nothing.)
  • prog: Progress bar configuration. (Default: ProgressMeter.Progress(n_max_iter; desc="Optimizing", barlen=31, showspeed=true, enabled=prog).)
  • state::NamedTuple: Initial value for the internal state of optimization. Used to warm-start from the state of a previous run. (See the returned values below.)

Returns

  • params: Variational parameters optimizing the variational objective.
  • stats: Statistics gathered during optimization.
  • state: Collection of the final internal states of optimization. This can used later to warm-start from the last iteration of the corresponding run.

Callback

The callback function callback has a signature of

callback(; stat, state, params, restructure, gradient)

The arguments are as follows:

  • stat: Statistics gathered during the current iteration. The content will vary depending on objective.
  • state: Collection of the internal states used for optimization.
  • params: Variational parameters.
  • restructure: Function that restructures the variational approximation from the variational parameters. Calling restructure(param) reconstructs the variational approximation.
  • gradient: The estimated (possibly stochastic) gradient.

cb can return a NamedTuple containing some additional information computed within cb. This will be appended to the statistic of the current corresponding iteration. Otherwise, just return nothing.

source

Estimating the Objective

In some cases, it is useful to directly estimate the objective value. This can be done by the following funciton:

AdvancedVI.estimate_objectiveFunction
estimate_objective([rng,] obj, q, prob; kwargs...)

Estimate the variational objective obj targeting prob with respect to the variational approximation q.

Arguments

  • rng::Random.AbstractRNG: Random number generator.
  • obj::AbstractVariationalObjective: Variational objective.
  • prob: The target log-joint likelihood implementing the LogDensityProblem interface.
  • q: Variational approximation.

Keyword Arguments

Depending on the objective, additional keyword arguments may apply. Please refer to the respective documentation of each variational objective for more info.

Returns

  • obj_est: Estimate of the objective value.
source
Info

Note that estimate_objective is not expected to be differentiated through, and may not result in optimal statistical performance.

Advanced Usage

Each variational objective is a subtype of the following abstract type:

AdvancedVI.AbstractVariationalObjectiveType
AbstractVariationalObjective

Abstract type for the VI algorithms supported by AdvancedVI.

Implementations

To be supported by AdvancedVI, a VI algorithm must implement AbstractVariationalObjective and estimate_objective. Also, it should provide gradients by implementing the function estimate_gradient!. If the estimator is stateful, it can implement init to initialize the state.

source

Furthermore, AdvancedVI only interacts with each variational objective by querying gradient estimates. Therefore, to create a new custom objective to be optimized through AdvancedVI, it suffices to implement the following function:

AdvancedVI.estimate_gradient!Function
estimate_gradient!(rng, obj, adtype, out, prob, λ, restructure, obj_state)

Estimate (possibly stochastic) gradients of the variational objective obj targeting prob with respect to the variational parameters λ

Arguments

  • rng::Random.AbstractRNG: Random number generator.
  • obj::AbstractVariationalObjective: Variational objective.
  • adtype::ADTypes.AbstractADType: Automatic differentiation backend.
  • out::DiffResults.MutableDiffResult: Buffer containing the objective value and gradient estimates.
  • prob: The target log-joint likelihood implementing the LogDensityProblem interface.
  • λ: Variational parameters to evaluate the gradient on.
  • restructure: Function that reconstructs the variational approximation from λ.
  • obj_state: Previous state of the objective.

Returns

  • out::MutableDiffResult: Buffer containing the objective value and gradient estimates.
  • obj_state: The updated state of the objective.
  • stat::NamedTuple: Statistics and logs generated during estimation.
source

If an objective needs to be stateful, one can implement the following function to inialize the state.

AdvancedVI.initFunction
init(rng, obj, prob, params, restructure)

Initialize a state of the variational objective obj given the initial variational parameters λ. This function needs to be implemented only if obj is stateful.

Arguments

  • rng::Random.AbstractRNG: Random number generator.
  • obj::AbstractVariationalObjective: Variational objective.
  • params: Initial variational parameters.
  • restructure: Function that reconstructs the variational approximation from λ.
source
diff --git a/dev/index.html b/dev/index.html index 5ff09bad..8c8f1648 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -AdvancedVI · AdvancedVI.jl
+AdvancedVI · AdvancedVI.jl
diff --git a/dev/locscale/index.html b/dev/locscale/index.html index 3b67d07e..80ba6e21 100644 --- a/dev/locscale/index.html +++ b/dev/locscale/index.html @@ -2,7 +2,7 @@ Location-Scale Variational Family · AdvancedVI.jl

Location-Scale Variational Family

Introduction

The location-scale variational family is a family of probability distributions, where their sampling process can be represented as

\[z \sim q_{\lambda} \qquad\Leftrightarrow\qquad z \stackrel{d}{=} C u + m;\quad u \sim \varphi\]

where $C$ is the scale, $m$ is the location, and $\varphi$ is the base distribution. $m$ and $C$ form the variational parameters $\lambda = (m, C)$ of $q_{\lambda}$. The location-scale family encompases many practical variational families, which can be instantiated by setting the base distribution of $u$ and the structure of $C$.

The probability density is given by

\[ q_{\lambda}(z) = {|C|}^{-1} \varphi(C^{-1}(z - m))\]

and the entropy is given as

\[ \mathbb{H}(q_{\lambda}) = \mathbb{H}(\varphi) + \log |C|,\]

where $\mathbb{H}(\varphi)$ is the entropy of the base distribution. Notice the $\mathbb{H}(\varphi)$ does not depend on $\log |C|$. The derivative of the entropy with respect to $\lambda$ is thus independent of the base distribution.

Constructors

Note

For stable convergence, the initial scale needs to be sufficiently large and well-conditioned. Initializing scale to have small eigenvalues will often result in initial divergences and numerical instabilities.

AdvancedVI.MvLocationScaleType
MvLocationScale(location, scale, dist) <: ContinuousMultivariateDistribution

The location scale variational family broadly represents various variational families using location and scale variational parameters.

It generally represents any distribution for which the sampling path can be represented as follows:

  d = length(location)
   u = rand(dist, d)
-  z = scale*u + location
source
AdvancedVI.FullRankGaussianFunction
FullRankGaussian(location, scale; check_args = true)

Construct a Gaussian variational approximation with a dense covariance matrix.

Arguments

  • location::AbstractVector{T}: Mean of the Gaussian.
  • scale::LinearAlgebra.AbstractTriangular{T}: Cholesky factor of the covariance of the Gaussian.

Keyword Arguments

  • check_args: Check the conditioning of the initial scale (default: true).
source
AdvancedVI.MeanFieldGaussianFunction
MeanFieldGaussian(location, scale; check_args = true)

Construct a Gaussian variational approximation with a diagonal covariance matrix.

Arguments

  • location::AbstractVector{T}: Mean of the Gaussian.
  • scale::Diagonal{T}: Diagonal Cholesky factor of the covariance of the Gaussian.

Keyword Arguments

  • check_args: Check the conditioning of the initial scale (default: true).
source

Gaussian Variational Families

using AdvancedVI, LinearAlgebra, Distributions;
+  z = scale*u + location
source
AdvancedVI.FullRankGaussianFunction
FullRankGaussian(location, scale; check_args = true)

Construct a Gaussian variational approximation with a dense covariance matrix.

Arguments

  • location::AbstractVector{T}: Mean of the Gaussian.
  • scale::LinearAlgebra.AbstractTriangular{T}: Cholesky factor of the covariance of the Gaussian.

Keyword Arguments

  • check_args: Check the conditioning of the initial scale (default: true).
source
AdvancedVI.MeanFieldGaussianFunction
MeanFieldGaussian(location, scale; check_args = true)

Construct a Gaussian variational approximation with a diagonal covariance matrix.

Arguments

  • location::AbstractVector{T}: Mean of the Gaussian.
  • scale::Diagonal{T}: Diagonal Cholesky factor of the covariance of the Gaussian.

Keyword Arguments

  • check_args: Check the conditioning of the initial scale (default: true).
source

Gaussian Variational Families

using AdvancedVI, LinearAlgebra, Distributions;
 μ = zeros(2);
 
 L = diagm(ones(2)) |> LowerTriangular;
@@ -28,4 +28,4 @@
 
 # Mean-Field
 L = ones(2) |> Diagonal;
-q = MvLocationScale(μ, L, Laplace())
+q = MvLocationScale(μ, L, Laplace()) diff --git a/dev/search/index.html b/dev/search/index.html index e3b7d2cf..9208c8be 100644 --- a/dev/search/index.html +++ b/dev/search/index.html @@ -1,2 +1,2 @@ -Search · AdvancedVI.jl

Loading search...

    +Search · AdvancedVI.jl

    Loading search...