From 3dde5ee7274e57b80f083c21f8cfe326bb2f7caa Mon Sep 17 00:00:00 2001 From: kkew3 Date: Tue, 27 Aug 2024 23:22:29 +0800 Subject: [PATCH] Added conditioning of cvae variational posterior blog. --- ...ditioning-of-cvae-variational-posterior.md | 117 ++++++++++++++++++ 1 file changed, 117 insertions(+) create mode 100644 _posts/2024-08-27-conditioning-of-cvae-variational-posterior.md diff --git a/_posts/2024-08-27-conditioning-of-cvae-variational-posterior.md b/_posts/2024-08-27-conditioning-of-cvae-variational-posterior.md new file mode 100644 index 000000000..fee373b39 --- /dev/null +++ b/_posts/2024-08-27-conditioning-of-cvae-variational-posterior.md @@ -0,0 +1,117 @@ +--- +layout: post +title: "Conditioning of the variational posterior in CVAE (Sohn, 2015)" +date: 2024-08-27 19:03:07 +0800 +tags: math--prob ml--bayes +--- + +## Abstract + +This post explores alternative conditioning of the variational posterior $q$ in CVAE ([Sohn et al. 2015][sohn15]), and concludes that the conditioning of $q$ on $y$ is important to predictive inference. + +[sohn15]: https://papers.nips.cc/paper_files/paper/2015/hash/8d55a249e6baa5c06772297520da2051-Abstract.html + +## Introduction + +CVAE models the likelihood $p(y \mid x)$ as a continuous mixture of latent $z$: + +
+ +$$ +p(y \mid x) = \int p_\theta(z \mid x) p_\theta(y \mid x,z)\,\mathrm dz\,. \tag{1} +$$ + +Since (1) is intractable, Sohn et al. instead optimize its evidence lower bound (ELBO): + +$$ +\mathcal L_{\text{CVAE}}(x,y;\theta,\phi) = \mathbb E_q[\log p_\theta(y \mid x,z) - \log q_\phi(z \mid x,y)]\,. \tag{2} +$$ + +Here, the variational posterior $q$ conditions on $x$ and $y$. +At test time, the authors propose to use importance sampling leveraging the trained variational posterior: + +
+ +$$ +p(y \mid x) \approx \frac{1}{S} \sum_{s=1}^S \frac{p_\theta(y \mid x,z_s) p_\theta(z_s \mid x)}{q_\phi(z_s \mid x,y)}\,, \tag{3} +$$ + +where $z_s \sim q_\phi(z \mid x,y)$. + +What if $q$ conditions on $x$ only? +This post explores this possibility, and reaches the conclusion that without conditioning on $y$, $q$ at optimum won't ever attain the true posterior $p(z \mid x,y)$, and should not be otherwise better in terms of reducing the variance in importance sampling. + +## Warm up: proving the effecacy of importance sampling + +We assume that infinite data is available for learning, and $q$ is from a flexible enough probability family. +The data are drawn from the joint data distribution $p_D(x,y)$, where we have stressed with a subscript $D$. +We assume that $x$ is continuous and $y$ is discrete. +The goal is to maximize the expected ELBO in terms of $p_D(x,y)$. +However, we assume that $p_\theta(y \mid x,z)$ won't approaches to $p_D(y \mid x)$ whatever value $\theta$ picks. +We will drop $\theta$ and $\phi$ below for brevity. + +We may easily pose this setup as a constrained maximization problem: +$\max \mathbb E[\log p(y,z \mid x) - \log q(z \mid x,y)]$ subject to $q$ being a probability, where the expectation is taken with respect to $p_D(x,y) q(z \mid x,y)$. + +The Lagrangian is: + +
+ +$$ +\int \sum_y p_D(x,y) \int q(z \mid x,y) \log \frac{p(y,z \mid x)}{q(z \mid x,y)}\,\mathrm dz\,\mathrm dx + \int \sum_y \mu(x,y) \left(\int q(z \mid x,y)\,\mathrm dz - 1\right)\,\mathrm dx\,, \tag{4} +$$ + +where $\mu(x,y)$ is the Lagrange multiplier. +Now find the [Gateaux derivative][gateaux] and let it equal zero: + +$$ +0 = p_D(x,y) (\log p(y,z \mid x) - (1 + \log q(z \mid x,y)) + \mu(x,y))\,. +$$ + +Absorbing $p_D(x,y) > 0$ and the constant 1 into $\mu(x,y)$ yields: + +$$ +\log q(z \mid x,y) = \mu(x,y) + \log p(y,z \mid x)\,, +$$ + +where $\mu(x,y) = -\log \int p(y,z \mid x)\,\mathrm dz = -\log p_D(y \mid x)$. +It thus follows that, at optimum, $q(z \mid x,y) = p(z \mid x,y)$. +Hence, when evaluating Equation (3), at optimum, the right hand side equals the left hand side with zero variance. + +[gateaux]: https://www.youtube.com/watch?v=6VvmMkAx5Jc&t=982s + +## Conditioning only on x gives worse approximation + +Following the same setup as the previous section, we start from the Lagrangian (4). +Note that now we assume $q \triangleq q(z \mid x)$, and that the Lagrange multiplier is $\mu(x)$ instead of $\mu(x,y)$. +Rearranging the terms: + +$$ +\begin{multline} + \int p_D(x) \int q(z \mid x) \left(\sum_y p_D(y \mid x) \log p(y,z \mid x) - \log q(z \mid x)\right)\,\mathrm dz\,\mathrm dz \\ + + \int \mu(x) \left(\int q(z \mid x)\,\mathrm dz - 1\right)\,\mathrm dx\,. +\end{multline} +$$ + +Let its Gateaux derivative with respect to $q$ equal zero: + +$$ +0 = p_D(x) \left(\sum_y p_D(y \mid x) \log p(y,z \mid x) - (1 + \log q(z \mid x))\right) + \mu(x)\,. +$$ + +Absorbing $p_D(x) > 0$ and the constant 1 into $\mu(x)$ yields: + +$$ +\log q(z \mid x) = \mu(x) + \sum_y p_D(y \mid x) \log p(z \mid x,y) - \mathbb H(p_D(y \mid x))\,, +$$ + +where $\mathbb H(p_D(y \mid x)) = -\sum_y p_D(y \mid x) \log p_D(y \mid x)$ is the entropy. +We see immediately that: + +
+ +$$ +q(z \mid x) \propto \exp(\mathbb E_{p_D(y \mid x)}[\log p(z \mid x,y)])\,. \tag{5} +$$ + +This means that when not conditioning on $y$, $q(z \mid x)$ can never achieve the true posterior $p(z \mid x,y)$, unless $\mathbb H(p_D(y \mid x)) = 0$, which is unlikely to occur.