Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LKJ distribution for cholesky factor of correlation matrix #1336

Closed
sethaxen opened this issue Jun 1, 2021 · 14 comments
Closed

Add LKJ distribution for cholesky factor of correlation matrix #1336

sethaxen opened this issue Jun 1, 2021 · 14 comments

Comments

@sethaxen
Copy link
Contributor

sethaxen commented Jun 1, 2021

Distributions already has an LKJ(d, η) whose support is d × d correlation matrices. More useful for probabilistic programming is the equivalent LKJ distribution on whose support is the corresponding cholesky factor L (or equivalently U), because then the cholesky factor can be directly sampled and used, e.g. in the multivariate normal density, without needing to compute the cholesky decomposition.

There are d(d-1)/2 unique elements in the strict lower triangle tril₋(R) of a correlation matrix R=L L^T. Likewise, there are d(d-1)/2 unique elements in the strict lower triangle tril₋(L) of its cholesky factor L. The map ϕ: tril₋(R) → tril₋(L) is bijective, and the log determinant of its Jacobian is logdet(J) = \sum_{i=2}^d (i - d) \log(L_{ii}). The resulting logkernel is only adjusted by -logdet(J) and works out to logkernel(L) = \sum_{i=2}^d (d - i + 2η - 2)) \log(L_{ii}). logc0 is unchanged. As a result, we can trivially adapt the code from LKJ to define this distribution.

It would also be useful to have a convenient constructor for MvNormal that could take a mean vector, the cholesky factor of a correlation matrix, and a vector of standard deviations, or alternatively, a mean vector and the cholesky factor of the covariance matrix.

Related work

AltDistributions.jl implements this distribution as LKJL (however, it pre-dates LKJ in this package). Likewise, this distribution is defined in several probabilistic programming languages, notably Stan (see documentation and implementation).

See also discussions in TuringLang/Turing.jl#1629 (comment) and TuringLang/Bijectors.jl#134 (comment).

@devmotion
Copy link
Member

It would also be useful to have a convenient constructor for MvNormal that could take a mean vector, the cholesky factor of a correlation matrix, and a vector of standard deviations, or alternatively, a mean vector and the cholesky factor of the covariance matrix.

Isn't this already possible? MvNormal supports arbitrary AbstractPDMats as covariance matrix and even converts matrices to AbstractPDMat. And PDMats seems to support constructing a PDMat from a Cholesky decomposition: https://github.com/JuliaStats/PDMats.jl/blob/84bdd885a4e9f22a1832e63f54894760c632b312/src/pdmat.jl#L21

@sethaxen
Copy link
Contributor Author

sethaxen commented Jun 1, 2021

Isn't this already possible? MvNormal supports arbitrary AbstractPDMats as covariance matrix and even converts matrices to AbstractPDMat. And PDMats seems to support constructing a PDMat from a Cholesky decomposition: https://github.com/JuliaStats/PDMats.jl/blob/84bdd885a4e9f22a1832e63f54894760c632b312/src/pdmat.jl#L21

Yes, you're right. But right now, if a user has μ::Vector, σ::Vector, and L::LowerTriangular, they need to do something like

using PDMats
MvNormal(μ, PDMat(Cholesky.* L.data, 'L', 0)))

This isn't ideal for several reasons. First, it requires the user to specify args for Cholesky (on Julia 1.7, they can do MvNormal(μ, PDMat(Cholesky(σ .* L))), which is a little better). Second, it requires them to know how the cholesky factor of the correlation matrix is related to the covariance matrix, which isn't terrible, but it does make it harder to use. Third, depending on the application, it may be more efficient to not multiply σ by L.

A new MvNormal constructor could make this simpler, but perhaps a better alternative is a new AbstractPDMat type.

@devmotion
Copy link
Member

perhaps a better alternative is a new AbstractPDMat type.

Yes, I think this would be the better approach since it is more general and hence could be used also for other AbstractMvNormal distributions. Additionally, it would not increase the complexity of the MvNormal constructors any further.

One also has to think about the VariateForm for such a distribution (maybe just CholeskyVariate?) since its samples would not be of type Real or Array such as the samples of the distributions with ArrayLikeVariate VariateForm (e.g., Univariate, Multivariate, and Matrixvariate) currently implemented in Distributions.

@sethaxen
Copy link
Contributor Author

sethaxen commented Jun 1, 2021

Yes, I think this would be the better approach since it is more general and hence could be used also for other AbstractMvNormal distributions. Additionally, it would not increase the complexity of the MvNormal constructors any further.

Yes, these are good points.

One also has to think about the VariateForm for such a distribution (maybe just CholeskyVariate?) since its samples would not be of type Real or Array such as the samples of the distributions with ArrayLikeVariate VariateForm (e.g., Univariate, Multivariate, and Matrixvariate) currently implemented in Distributions.

My idea was just to add a distribution whose VariateForm was Matrixvariate to avoid adding a more complicated VariateForm here. So random samples would be Matrixes (or perhaps LowerTriangular and also UpperTriangular if we have an uplo field). It could be useful if random samples where Cholesky types, but I'm also thinking of how this would be used in PPLs, and that could cause confusion.

@devmotion
Copy link
Member

So random samples would be Matrixes (or perhaps LowerTriangular and also UpperTriangular if we have an uplo field).

IIRC ArrayLikeVariate is supposed to cover only samples of type Real and Array, so Matrix would be covered but special triangular matrix types not. IMO it would be a bit unfortunate to use Matrix and not exploit the special matrix structure.

that could cause confusion.

What problems do you have in mind?

@johnczito
Copy link
Member

At some point I sketched out some types I called CholeskyDecomposedMatrixDistribution, SVDecomposedMatrixDistribution, etc. They were just wrappers for the MatrixDistribution types that made sure logpdf handled the jacobian and rand returned instances of SVD, Cholesky, whatever. In general rand was totally naive, but it could be overloaded on a distribution-by-distribution basis if you had a way to sample the decomposition directly.

Would that be helpful here? Or is the best solution a completely separate type?

@sethaxen
Copy link
Contributor Author

sethaxen commented Jun 2, 2021

IIRC ArrayLikeVariate is supposed to cover only samples of type Real and Array, so Matrix would be covered but special triangular matrix types not.

Oh, that is unfortunate. We should definitely confirm this though, because it would be nice if we could use ArrayLikeVariate.

IMO it would be a bit unfortunate to use Matrix and not exploit the special matrix structure.

Yes, but LowerTriangular uses dense storage anyways, so even if we returned a lower-triangular Matrix, a user could wrap it in LowerTriangular if doing something like solving.

What problems do you have in mind?

I'm thinking first a user comes from e.g. Stan, PyMC3, or one of the other popular PPLs and uses an LKJCholesky distribution assuming it behaves like Stan's equivalent (i.e. is supported on lower triangular cholesky factors), so they multiply a sample by a vector, but if it's a Cholesky, IIRC it multiplies like the correlation matrix, not its cholesky factor. This could also be jarring because it's the only distribution whose support is not a scalar or array. Then one would need a bijector that maps from a Cholesky to ideally a vector, which is even more complicated than mapping from a matrix to a vector, which is not (yet) supported. These are not major problems, but they are downsides.

And in fact, while writing this, I'm liking more the idea of an LKJCholesky whose samples are Cholesky objects.

At some point I sketched out some types I called CholeskyDecomposedMatrixDistribution, SVDecomposedMatrixDistribution, etc. They were just wrappers for the MatrixDistribution types that made sure logpdf handled the jacobian and rand returned instances of SVD, Cholesky, whatever. In general rand was totally naive, but it could be overloaded on a distribution-by-distribution basis if you had a way to sample the decomposition directly.

Would that be helpful here? Or is the best solution a completely separate type?

Good question. I lean towards a completely separate type for simplicity, unless there are ample use cases for these special wrappers. e.g. the only three uses I know of for CholeskyDecomposedMatrixDistribution would be Wishart, InverseWishart, and LKJ, and I'm not certain how high the demand would be for cholesky parameterizations of Wishart/InverseWishart.

@johnczito
Copy link
Member

I'm liking more the idea of an LKJCholesky whose samples are Cholesky objects.

Second this.

unless there are ample use cases for these special wrappers. e.g. the only three uses I know of...would be Wishart, InverseWishart, and LKJ...

We also have MatrixFDist and MatrixBeta that are random posdef matrices. Not that they necessarily rise to the level of "use cases" because I doubt they're much used at this point.

I lean towards a completely separate type for simplicity

Right. I can see how the wrapper construction might just get in the way in the PPL context. Not that these things are mutually exclusive.

Anyhow, I read but did not really understand the Stan source code for sampling LKJ-Cholesky directly (here and here). The comments say "The implementation is Ben Goodrich's Cholesky factor-based approach to the C-vine method of [LKJ (JMA 2009)]." Is there a write-up of this anyplace, or would we just have to port the Stan code? Does that pose a licensing issue?

@sethaxen
Copy link
Contributor Author

sethaxen commented Jun 2, 2021

I'm liking more the idea of an LKJCholesky whose samples are Cholesky objects.

Second this.

If we go this route, we should definitely have something like a CholeskyVariateForm.

We also have MatrixFDist and MatrixBeta that are random posdef matrices. Not that they necessarily rise to the level of "use cases" because I doubt they're much used at this point.

I lean towards a completely separate type for simplicity

Right. I can see how the wrapper construction might just get in the way in the PPL context. Not that these things are mutually exclusive.

Yeah, I guess my real priority is that there be a simple constructor. But even defining LKJCholesky as an alias for something like CholeskyDistribution{<:LKJ} with a convenient constructor gets us that, so I don't have a major preference here.

Anyhow, I read but did not really understand the Stan source code for sampling LKJ-Cholesky directly (here and here). The comments say "The implementation is Ben Goodrich's Cholesky factor-based approach to the C-vine method of [LKJ (JMA 2009)]." Is there a write-up of this anyplace, or would we just have to port the Stan code? Does that pose a licensing issue?

I think it's just using part of the algorithm in the paper to generate random canonical partial correlations, and then instead of pushing those through a map to a correlation matrix, it pushes them through a map directly to the cholesky factor. I think that's what their read_corr_L function is doing. I don't know of a write-up anywhere, but I'll look for one. I'd like to understand how it works, so I may implement it from scratch, but I think we could also port. Stan math uses a BSD-3 license, which is just slightly more restrictive than MIT (see https://opensource.stackexchange.com/a/582), so we would I think need to make just that code have a BSD-3 license.

@devmotion
Copy link
Member

Can't it just be based on the paper and the documentation to avoid any licensing issues?

@johnczito
Copy link
Member

But even defining LKJCholesky as an alias for something like CholeskyDistribution{<:LKJ} with a convenient constructor gets us that, so I don't have a major preference here.

Good to know.

...it pushes them through a map directly to the cholesky factor. I think that's what their read_corr_L function is doing.

Yeah, that's the bit I don't understand yet. And note that Distributions does not include any of that partial correlation, vine stuff. It currently uses the onion sampler.

@devmotion
Copy link
Member

The bijective mapping in Bijectors is based on the partial correlations: https://github.com/TuringLang/Bijectors.jl/blob/master/src/bijectors/corr.jl

It's documented in the docstrings and based on https://mc-stan.org/docs/2_26/reference-manual/correlation-matrix-transform-section.html (IIRC one formula in the Stan docs was incorrect though).

@johnczito
Copy link
Member

It's documented in the docstrings and based on https://mc-stan.org/docs/2_26/reference-manual/correlation-matrix-transform-section.html (IIRC one formula in the Stan docs was incorrect though).

Just what the doctor ordered. Thanks!

@sethaxen
Copy link
Contributor Author

sethaxen commented Jun 3, 2021

Great! I'll try to put together a PR implementing this next week. Also, after conversations with @cscherrer in JuliaMath/MeasureTheory.jl#101 (comment), I think it's worthwhile including an uplo parameter to specify whether the cholesky factor is stored in the upper or lower part of the Cholesky object, so that the power user can avoid an unnecessary copy. (cholesky always uses an uplo='U', but functions that use Cholesky allow both). I suspect uplo='L' is more useful for more users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants