Feature request: Add LKJCholesky #1629

kar9222 · 2021-06-01T06:19:13Z

Issues

Hello, this issue is

not only related to multi-level varying slopes with two clusters, cross classification I posted on Julia discourse Multi-level varying slopes with two clusters, cross classification
but also related to another similar issues I found, for example

I have spent many hours tweaking more than 10 variants of the model, reading up many Stan's docs, PyMC3's docs, and posting the issue on Julia's discourse. I believe the main issue that I, and many other people (the issues linked above) face is that the LKJ is numerically unstable. As per 24.1 LKJ Correlation Distribution | Stan Functions Reference, with reference to LKJ distribution (without cholesky factorisation),

The log of the LKJ density for the correlation matrix y given nonnegative shape eta. The only reason to use this density function is if you want the code to run slower and consume more memory with more risk of numerical errors. Use its Cholesky factor as described in the next section.
...However, it is much better computationally to work directly with the Cholesky factor of Σ, so this distribution should never be explicitly used in practice.

Me and many people in Turing have been using LKJ directly without the Cholesky decomposition, resulting in those issues. I follow some workaround in those issues, such as using Symmetric but it doesn't solve the issue for more complex models.

We need a cholesky decomposition of LKJ, something called LKJCholesky or similar names. It's implemented in Stan, PyMC3 and also Soss.jl

Stan lkj_corr_cholesky 24.2 Cholesky LKJ Correlation Distribution | Stan Functions Reference
PyMC3: pymc3.LKJCholeskyCov LKJ Cholesky Covariance Priors for Multivariate Normal Models — PyMC3 3.10.0 documentation
Soss.jl LKJL (implemented in MeasureTheory.jl) cscherrer/MeasureTheory.jl

LKJ Cholesky

For example, Stan's implementation of lkj_corr_cholesky

As per 24 Correlation Matrix Distributions | Stan Functions Reference

The correlation matrix distributions have support on the (Cholesky factors of) correlation matrices. A Cholesky factor L for a K × K correlation matrix Σ of dimension K has rows of unit length so that the diagonal of L L ⊤ is the unit K -vector. Even though models are usually conceptualized in terms of correlation matrices, it is better to operationalize them in terms of their Cholesky factors.

As per 24.1 LKJ Correlation Distribution | Stan Functions Reference, with reference to LKJ distribution (without cholesky factorisation),

However, it is much better computationally to work directly with the Cholesky factor of Σ, so this distribution should never be explicitly used in practice.

As per 24.2 Cholesky LKJ Correlation Distribution | Stan Functions Reference

Stan provides an implicit parameterization of the LKJ correlation matrix density in terms of its Cholesky factor, which you should use rather than the explicit parameterization in the previous section. For example, if L is a Cholesky factor of a correlation matrix, then
L ~ lkj_corr_cholesky(2.0); # implies L * L' ~ lkj_corr(2.0);

So instead of the "raw version"

@model function gdemo(...)
    n_coef, n_cluster = 4, 7  # For example
    σ ~ filldist(Exponential(1), n_coef)
    ρ ~ LKJ(n_coef,2)
    Σ = Diagonal(σ) * ρ * Diagonal (σ)
    α ~ MvNormal(..., Σ)
    ......
end

or a slightly better version, using the codes by Seth Axen, which uses the quad_form_diag implemented in Stan

quad_form_diag(M, v) = Symmetric((v .* v') .* (M .+ M') ./ 2)

@model function gdemo(...)
    n_coef, n_cluster = 4, 7  # For example
    σ ~ filldist(Exponential(1), n_coef)
    ρ ~ LKJ(n_coef,2)

    # Use of quad_form_diag implemented in Stan
    Σ = quad_form_diag(ρ, σ)

    α ~ MvNormal(..., Σ)
    ......
end

Stan/PyMC3's version of LKJ Cholesky with reparameterisation (LKJ Cholesky can also be used without reparameterisation). For example, Stan's implementation

parameters{
    .............
    cholesky_factor_corr[4] L_Rho_block;
}
transformed parameters{
    ............
    alpha = (diag_pre_multiply(sigma_actor, L_Rho_actor) * z_actor)';
}
model{
    ...............
    L_Rho_actor ~ lkj_corr_cholesky( 2 );
}

If Turing implements LKJCholesky, it should look something like this

@model function gdemo(...)
    n_coef, n_cluster = 4, 7  # For example
    σ ~ filldist(Exponential(1), n_coef)

    # standardised prior for reparameterisation
    z_α ~ filldist(Normal(0, 1), n_coef, n_cluster)  # 4 x 7 matrix

    # Stan's Cholesky LKJ: L_Rho_actor ~ lkj_corr_cholesky( 2 );
    L_α ~ LKJCholesky(n_coef,2)

    # Stan's alpha = (diag_pre_multiply(sigma_actor, L_Rho_actor) * z_actor)';
    α = σ .* L_α * z_α # matrix 4 x 7
    ......
end

or a Soss.jl implementation, full example coded up by Seth Axen using Soss.jl

# using Pkg
# Pkg.add(["Soss", "MeasureTheory", "TuringModels", "CSV", "DataFrames", "DynamicHMC", "LogDensityProblems"])

using Soss, MeasureTheory, LinearAlgebra, Random
using Soss: TransformVariables

# work around LKJL issues, see https://github.com/cscherrer/MeasureTheory.jl/issues/100
# note that we're making LKJL behave like an LKJU
Soss.xform(d::LKJL, _data::NamedTuple=NamedTuple()) = TransformVariables.as(d);
function Base.rand(rng::AbstractRNG, ::Type, d::LKJL{k}) where {k}
    return cholesky(rand(rng, MeasureTheory.Dists.LKJ(k, d.η))).U
end

# get data
using TuringModels, CSV, DataFrames
data_path = joinpath(TuringModels.project_root, "data", "chimpanzees.csv");
df = CSV.read(data_path, DataFrame; delim=';');
df.block_id = df.block;
df.treatment = 1 .+ df.prosoc_left .+ 2 * df.condition;

# build the model
model = @model actor, block, treatment, n_treatment, n_actor, n_block begin
    σ_block ~ Exponential(1) |> iid(n_treatment)
    σ_actor ~ Exponential(1) |> iid(n_treatment)
    U_ρ_block ~ LKJL(n_treatment, 2)
    U_ρ_actor ~ LKJL(n_treatment, 2)
    g ~ Normal(0, 1) |> iid(n_treatment)
    z_block ~ Normal(0, 1) |> iid(n_treatment, n_block)
    z_actor ~ Normal(0, 1) |> iid(n_treatment, n_actor)
    β = σ_block .* (U_ρ_block' * z_block)
    α = σ_actor .* (U_ρ_actor' * z_actor)
    p = broadcast(treatment, actor, block) do t, a, b
        return @inbounds logistic(g[t] + α[t, a] + β[t, b])
    end
    pulled_left .~ Binomial.(1, p)
end

Performance

Comparing Stan's implementation and current use of LKJ in Turing, which is Parts of my issues posted here

The Stan model, sampling 4 chains, finished in about 6 seconds per chain (total = Warm-up + sampling). Summarized elapsed time

Chain 1:  Elapsed Time: 3.27789 seconds (Warm-up)
Chain 1:                2.63135 seconds (Sampling)
Chain 1:                5.90924 seconds (Total)

Chain 2:  Elapsed Time: 3.27674 seconds (Warm-up)
Chain 2:                2.77611 seconds (Sampling)
Chain 2:                6.05285 seconds (Total)

Chain 3:  Elapsed Time: 3.44 seconds (Warm-up)
Chain 3:                2.66887 seconds (Sampling)
Chain 3:                6.10887 seconds (Total)

Chain 4:  Elapsed Time: 3.4264 seconds (Warm-up)
Chain 4:                2.63844 seconds (Sampling)
Chain 4:                6.06484 seconds (Total)

Stan model

Stan code saved in “nonCentered.stan”

data{
    int L[504];
    int block_id[504];
    int actor[504];
    int tid[504];
}
parameters{
    matrix[4,7] z_actor;
    matrix[4,6] z_block;
    vector[4] g;
    cholesky_factor_corr[4] L_Rho_actor;
    cholesky_factor_corr[4] L_Rho_block;
    vector<lower=0>[4] sigma_actor;
    vector<lower=0>[4] sigma_block;
}
transformed parameters{
    matrix[7,4] alpha;
    matrix[6,4] beta;
    beta = (diag_pre_multiply(sigma_block, L_Rho_block) * z_block)';
    alpha = (diag_pre_multiply(sigma_actor, L_Rho_actor) * z_actor)';
}
model{
    vector[504] p;
    sigma_block ~ exponential( 1 );
    sigma_actor ~ exponential( 1 );
    L_Rho_block ~ lkj_corr_cholesky( 2 );
    L_Rho_actor ~ lkj_corr_cholesky( 2 );
    g ~ normal( 0 , 1 );
    to_vector( z_block ) ~ normal( 0 , 1 );
    to_vector( z_actor ) ~ normal( 0 , 1 );
    for ( i in 1:504 ) {
        p[i] = g[tid[i]] + alpha[actor[i], tid[i]] + beta[block_id[i], tid[i]];
        p[i] = inv_logit(p[i]);
    }
    L ~ binomial( 1 , p );
}
generated quantities{
    vector[504] log_lik;
    vector[504] p;
    matrix[4,4] Rho_actor;
    matrix[4,4] Rho_block;
    Rho_block = multiply_lower_tri_self_transpose(L_Rho_block);
    Rho_actor = multiply_lower_tri_self_transpose(L_Rho_actor);
    for ( i in 1:504 ) {
        p[i] = g[tid[i]] + alpha[actor[i], tid[i]] + beta[block_id[i], tid[i]];
        p[i] = inv_logit(p[i]);
    }
    for ( i in 1:504 ) log_lik[i] = binomial_lpmf( L[i] | 1 , p[i] );
}

Sampling Stan model

library(rethinking)
data(chimpanzees)
d = chimpanzees
d$block_id = d$block
d$treatment = 1L + d$prosoc_left + 2L*d$condition

dat = list(
    L = d$pulled_left,
    tid = d$treatment,
    actor = d$actor,
    block_id = as.integer(d$block_id) )

nonCentered = stan_model("nonCentered.stan")

fit = sampling(nonCentered, data = dat, seed = 803214053, control = list(adapt_delta = 0.9))



fit = sampling(nonCentered, data = dat, seed = 803214053, control = list(adapt_delta = 0.9))

SAMPLING FOR MODEL 'nonCentered' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.000133 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.33 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1:
Chain 1:  Elapsed Time: 3.27789 seconds (Warm-up)
Chain 1:                2.63135 seconds (Sampling)
Chain 1:                5.90924 seconds (Total)
Chain 1:

SAMPLING FOR MODEL 'nonCentered' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 9.7e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.97 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2:
Chain 2:  Elapsed Time: 3.27674 seconds (Warm-up)
Chain 2:                2.77611 seconds (Sampling)
Chain 2:                6.05285 seconds (Total)
Chain 2:

SAMPLING FOR MODEL 'nonCentered' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 9.7e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.97 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3:
Chain 3:  Elapsed Time: 3.44 seconds (Warm-up)
Chain 3:                2.66887 seconds (Sampling)
Chain 3:                6.10887 seconds (Total)
Chain 3:

SAMPLING FOR MODEL 'nonCentered' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 0.000111 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 1.11 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4:
Chain 4:  Elapsed Time: 3.4264 seconds (Warm-up)
Chain 4:                2.63844 seconds (Sampling)
Chain 4:                6.06484 seconds (Total)
Chain 4:

Turing's reference model

To implement it in Turing, I found a similar model here StatisticalRethinkingJulia/TuringModels.jl - non-centered chimpanzees.md.

using TuringModels
using LinearAlgebra

# This script requires latest LKJ bijectors support.
# `] add Bijectors#master` to get latest Bijectors.

data_path = joinpath(@__DIR__, "..", "..", "data", "chimpanzees.csv")
delim = ";"
d = CSV.read(data_path, DataFrame; delim)

d.block_id = d.block

# m13.6nc1 is equivalent to m13.6nc in following Turing model

@model m13_6_nc(actor, block_id, condition, prosoc_left, pulled_left) = begin
    # fixed priors
    Rho_block ~ LKJ(3, 4.)
    Rho_actor ~ LKJ(3, 4.)
    sigma_block ~ filldist(truncated(Cauchy(0, 2), 0, Inf), 3)
    sigma_actor ~ filldist(truncated(Cauchy(0, 2), 0, Inf), 3)
    a ~ Normal(0, 1)
    bp ~ Normal(0, 1)
    bpc ~ Normal(0, 1)

    # adaptive NON-CENTERED priors
    Rho_block = (Rho_block' + Rho_block) / 2
    Rho_actor = (Rho_actor' + Rho_actor) / 2

    L_Rho_block = cholesky(Rho_block).L
    L_Rho_actor = cholesky(Rho_actor).L

    z_N_block ~ filldist(Normal(0, 1), 3, 6)
    z_N_actor ~ filldist(Normal(0, 1), 3, 7)

    # @show size(L_Rho_block) size(sigma_block) size(z_N_block_id)

    a_block_bp_block_bpc_block = sigma_block .* L_Rho_block * z_N_block
    a_actor_bp_actor_bpc_actor = sigma_actor .* L_Rho_actor * z_N_actor

    a_block = a_block_bp_block_bpc_block[1, :]
    bp_block = a_block_bp_block_bpc_block[2, :]
    bpc_block = a_block_bp_block_bpc_block[3, :]
    a_actor = a_actor_bp_actor_bpc_actor[1, :]
    bp_actor = a_actor_bp_actor_bpc_actor[2, :]
    bpc_actor = a_actor_bp_actor_bpc_actor[3, :]

    # linear models
    BPC = bpc .+ bpc_actor[actor] + bpc_block[block_id]
    BP = bp .+ bp_actor[actor] + bp_block[block_id]
    A = a .+ a_actor[actor] + a_block[block_id]
    logit_p = A + (BP + BPC .* condition) .* prosoc_left

    # likelihood
    pulled_left .~ BinomialLogit.(1, logit_p)
end

chns = sample(
    m13_6_nc(d.actor, d.block_id, d.condition, d.prosoc_left, d.pulled_left),
    Turing.NUTS(0.95),
    1000
)

The model I implemented to replicate Stan's model above

To implement the Stan's model and when doing this, I also refer to @sethaxen 's implementation above and the model of TuringModel.jl. Again, I coded up like more than 10 variants, but same issues

using TuringModels, CSV, DataFrames, Turing, LinearAlgebra, Random, ReverseDiff, Memoization

Turing.setadbackend(:reversediff)
Turing.setrdcache(true)

data_path = joinpath(TuringModels.project_root, "data", "chimpanzees.csv");
df = CSV.read(data_path, DataFrame; delim=';');

df.block_id = df.block;
df.treatment = 1 .+ df.prosoc_left .+ 2*df.condition;

@model function nonCentered(actor, block, treatment, pulled_left)
    n_treatment, n_actor, n_block = 4, 7, 6

    # fixed priors
    g ~ MvNormal(n_treatment, 1)
    ρ_block ~ LKJ(n_treatment, 2)
    ρ_actor ~ LKJ(n_treatment, 2)
    σ_block ~ filldist(Exponential(1), n_treatment)
    σ_actor ~ filldist(Exponential(1), n_treatment)

    # adaptive NON-CENTERED priors
    # I have no idea what does these two lines do. I just copied from the model of TuringModels.jl above
    ρ_block = (ρ_block' + ρ_block) / 2
    ρ_actor = (ρ_actor' + ρ_actor) / 2

    # Cholesky factor
    L_block = cholesky(ρ_block).L
    L_actor = cholesky(ρ_actor).L

    # Standardised prior
    z_block ~ filldist(Normal(0, 1), n_treatment, n_block)
    z_actor ~ filldist(Normal(0, 1), n_treatment, n_actor)

    # Again, copied from the model of TuringModels.jl above
    β = σ_block .* L_block * z_block # matrix 4 x 6
    α = σ_actor .* L_actor * z_actor # matrix 4 x 7

    # Use the code from @sethaxen
    logit_p = broadcast(treatment, actor, block) do t, a, b
        return @inbounds g[t] + α[t,a] + β[t,b]
    end
    pulled_left .~ BinomialLogit.(1, logit_p)
end

rng = MersenneTwister(3702)
@time chns = sample(
    rng,
    nonCentered(df.actor, df.block_id, df.treatment, df.pulled_left),
    NUTS(),  # Use Turing's default settings
    1_000
);

julia> @time chns = sample(
           rng,
           nonCentered(df.actor, df.block_id, df.treatment, df.pulled_left),
           NUTS(),  # Use Turing's default settings
           1_000
       );
┌ Info: Found initial step size
└   ϵ = 0.2
┌ Warning: The current proposal will be rejected due to numerical error(s).
│   isfinite.((θ, r, ℓπ, ℓκ)) = (true, false, false, false)
└ @ AdvancedHMC ~/.julia/packages/AdvancedHMC/MIxdK/src/hamiltonian.jl:47
┌ Warning: The current proposal will be rejected due to numerical error(s).
│   isfinite.((θ, r, ℓπ, ℓκ)) = (true, false, false, false)
└ @ AdvancedHMC ~/.julia/packages/AdvancedHMC/MIxdK/src/hamiltonian.jl:47
Sampling 100%|██████████████████████████████████| Time: 0:01:30
ERROR: DomainError with -5.831779503751022e-11:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
  [1] sqrt
    @ ./math.jl:582 [inlined]
  [2] sqrt
    @ ~/.julia/packages/ForwardDiff/QOqCN/src/dual.jl:203 [inlined]
  [3] derivative!
    @ ~/.julia/packages/ForwardDiff/QOqCN/src/derivative.jl:46 [inlined]
  [4] unary_scalar_forward_exec!(f::typeof(sqrt), output::ReverseDiff.TrackedReal{Float64, Float64, Nothing}, input::ReverseDiff.TrackedReal{Float64, Float64, Nothing}, cache::Base.RefValue{Float64})
    @ ReverseDiff ~/.julia/packages/ReverseDiff/E4Tzn/src/derivatives/scalars.jl:91
  [5] scalar_forward_exec!(instruction::ReverseDiff.ScalarInstruction{typeof(sqrt), ReverseDiff.TrackedReal{Float64, Float64, Nothing}, ReverseDiff.TrackedReal{Float64, Float64, Nothing}, Base.RefValue{Float64}})
    @ ReverseDiff ~/.julia/packages/ReverseDiff/E4Tzn/src/derivatives/scalars.jl:81
  [6] forward_exec!(instruction::ReverseDiff.ScalarInstruction{typeof(sqrt), ReverseDiff.TrackedReal{Float64, Float64, Nothing}, ReverseDiff.TrackedReal{Float64, Float64, Nothing}, Base.RefValue{Float64}})
    @ ReverseDiff ~/.julia/packages/ReverseDiff/E4Tzn/src/tape.jl:82
  [7] ForwardExecutor
    @ ~/.julia/packages/ReverseDiff/E4Tzn/src/api/tape.jl:76 [inlined]
  [8] (::FunctionWrappers.CallWrapper{Nothing})(f::ReverseDiff.ForwardExecutor{ReverseDiff.ScalarInstruction{typeof(sqrt), ReverseDiff.TrackedReal{Float64, Float64, Nothing}, ReverseDiff.TrackedReal{Float64, Float64, Nothing}, Base.RefValue{Float64}}})
    @ FunctionWrappers ~/.julia/packages/FunctionWrappers/8xdVB/src/FunctionWrappers.jl:58
  [9] macro expansion
    @ ~/.julia/packages/FunctionWrappers/8xdVB/src/FunctionWrappers.jl:130 [inlined]
 [10] do_ccall
    @ ~/.julia/packages/FunctionWrappers/8xdVB/src/FunctionWrappers.jl:118 [inlined]
 [11] FunctionWrapper
    @ ~/.julia/packages/FunctionWrappers/8xdVB/src/FunctionWrappers.jl:137 [inlined]
 [12] forward_pass!(compiled_tape::ReverseDiff.CompiledTape{ReverseDiff.GradientTape{Turing.Core.var"#f#26"{DynamicPPL.TypedVarInfo{NamedTuple{(:g, :ρ_block, :ρ_actor, :σ_block, :σ_actor, :z_block, :z_actor), Tuple{DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:g, Tuple{}}, Int64}, Vector{

I am looking into defining a custom distribution in Turing as per Advanced Usage - How to Define a Customized Distribution, with references to Stan's implementation of lkj_corr_cholesky, I found these

stan-dev/math - lkj_corr_cholesky_log.hpp

stan-dev/math - lkj_corr_cholesky_rng.hpp

stan-dev/math - lkj_corr_cholesky_lpdf.hpp

But everything is written in C/Cpp which I have not learned at all. Seems like I am stucked.

An implementation of LKJCholesky in Turing would help solve many of the issues. With that implementation, I can help writing a tutorial for Turing at Tutorials. It's one of the commonly used distribution for modeling covariance matrix in mutlivariate Gaussian. It seems like many Turing users are using LKJ directly. It's also a good practice to let other Turing and Julia PPL users know that LKJCholesky is a much better version, as quoted by Stan, with reference to LKJ (without cholesky factorisation)

The only reason to use this density function is if you want the code to run slower and consume more memory with more risk of numerical errors. Use its Cholesky factor as described in the next section.

The text was updated successfully, but these errors were encountered:

devmotion · 2021-06-01T06:39:35Z

There is already a related open issue in Bijectors: TuringLang/Bijectors.jl#134 (comment)

sethaxen · 2021-06-01T09:02:47Z

I think this is somewhat more involved than the suggestion in TuringLang/Bijectors.jl#134 (comment). I do not think it can be resolved only by changing/adding a bijector. It seems we are missing 3 things:

a distribution (e.g. LKJCholesky(d, η) whose support is a (say, lower triangular) Cholesky factor L of a correlation matrix such that L*L' ~ LKJ(d, η), ideally contributed to Distributions.jl.
an associated bijector to Bijectors.jl, ideally to a vector, as proposed in the linked issue, though I'm not certain Bijectors currently supports inputs and outputs having different sizes and dimensions.
A new PDMats.AbstractPDMat subtype in PDMats.jl that takes a cholesky factor of a correlation matrix and a vector of standard deviations, say, PDCholMat, for MvNormal. Potentially also a convenience constructor for MvNormal so the user does not need to explicitly load PDMats.

This would enable a user to do something like

using Turing, PDMats

@model function gdemo(y, N = length(y))
    L ~ LKJCholesky(N, 2)
    σ ~ filldist(truncated(Normal(), 0, Inf), N)
    y ~ MvNormal(PDCholMat(L, σ))
end

devmotion · 2021-06-01T09:11:16Z

I think this is somewhat more involved than the suggestion in TuringLang/Bijectors.jl#134 (comment).

Sure, it requires the LKJCholesky distribution and the bijector, and support for Cholesky factors in whatever distribution you want to use the samples.

The quick comment was merely to point out that this has been discussed (also in some of the linked issues IIRC) and would require also a specific bijector.

Probably it would also require a more general variable structure (VarInfo) in DynamicPPL/AbstractPPL since IIRC currently it can only deal with vectors, matrices, etc. as samples (so "standard" arrays) and both DynamicPPL and Bijectors require that constrained and unconstrained samples are of the same length and even of the same type (Bijectors at least). The last restriction might be removed in the refactoring of Bijectors that @torfjelde is working on.

sethaxen · 2021-06-01T20:05:15Z

Okay, well in the meantime, I'll try to push things along on the Distributions side so when the DynamicPPL/Bijectors is finished, it's ready to use.

torfjelde · 2021-06-02T00:54:31Z

Probably it would also require a more general variable structure (VarInfo) in DynamicPPL/AbstractPPL since IIRC currently it can only deal with vectors, matrices, etc. as samples (so "standard" arrays) and both DynamicPPL and Bijectors require that constrained and unconstrained samples are of the same length and even of the same type (Bijectors at least). The last restriction might be removed in the refactoring of Bijectors that @torfjelde is working on.

Yep, I'm working on this as we speak. Also, just to clear the record here, we don't require that the input and output are the same, but yeah strange things can possibly happen if you start breaking this assumption in compositions, batches, etc.

ParadaCarleton · 2021-06-03T20:09:41Z

It might be a good idea to check on MeasureTheory.jl about this. Since it's optimized to work specifically with PPLs, it might have some version of this already in place.

devmotion · 2021-06-03T20:16:18Z

It's mentioned in the OP, together with a link to JuliaMath/MeasureTheory.jl#100.

sethaxen · 2021-06-03T23:21:39Z

In JuliaStats/Distributions.jl#1336 we're discussing possibly defining an LKJCholesky distribution whose support are objects of type Cholesky. Would this be problematic for Turing? I'm wondering e.g. how samples would be stored in MCMCChains. Does Turing have a way to handle non-scalar/array parameters now?

devmotion · 2021-06-03T23:30:04Z

No, it's not natively supported but I would prefer samples of type Cholesky nevertheless. I guess we can always fall back to storing the full matrix or some other linearization and treat LKJCholesky in a special way with a custom linearization and reconstruction functionality (https://github.com/TuringLang/DynamicPPL.jl/blob/2fa9258d9e159859e11dac1a0aab97d1035b63e1/src/utils.jl#L88 and https://github.com/TuringLang/DynamicPPL.jl/blob/2fa9258d9e159859e11dac1a0aab97d1035b63e1/src/utils.jl#L98-L100). And hopefully support for non-scalar/array parameters will improve (TuringLang/DynamicPPL.jl#107).

sethaxen · 2021-06-09T21:58:34Z

I created JuliaStats/PDMats.jl#132 to discuss modifications to PDMats for easy creation of a PDMat from a Cholesky factor of a correlation matrix and a vector of standard deviations.

mschauer · 2021-10-15T08:00:41Z

JuliaStats/Distributions.jl#1339 was just merged

devmotion · 2021-10-15T08:05:09Z

Oh I missed that this was not the ossue in Distributions. I'll reopen 😄

devmotion · 2021-10-15T08:27:04Z

LKJCholesky is available in Distributions 0.25.20. However, as discussed above, support for it in Turing requires (probably major) changes in Bijectors and DynamicPPL, so it will take a while until it is supported I assume.

yebai · 2021-10-15T08:36:28Z

Really nice updates - @devmotion I'm slightly surprised that we need changes from the Turing side to support new features in Distributions. Do you know what changes are needed from DynamicPPL/Bijectors to support this?

sethaxen · 2021-10-15T08:53:07Z

@mgmverburg was able to get something working in https://github.com/mgmverburg/Turing_examples/blob/f639302c5b28ecc30f6b05e90b9f95adf97be027/Utilities/lkjcholesky.jl#L277-L385. Is that too hackish to adapt for this specific distribution? I agree, ideally we have a general solution for distributions whose support are not arrays (or are arrays with different dimensionality than the distribution), but we've gotten by without that so far, and support for this distribution in Turing seems to be in high demand.

devmotion · 2021-10-15T08:56:27Z

This distribution operates with and samples Cholesky object and therefore a new CholeskyVariate was introduced in Distributions. As far as I can see, the main problems are

Bijectors currently only works with AbstractArray and expects in- and outputs to contain the same number of elements and even to be of the same type (which didn't allow us to use special matrix types in the LKJBijector)
VarInfo only works with AbstractArray samples (likely not even in full generality but mainly with Array but I've never tried), performs the mapping in-place and expects the constrained und unconstrained arrays to have the same number of elements.

I assume @torfjelde's rewrite of Bijectors will address and solve the problems with Bijectors. And I guess a, probably more generally useful, approach in VarInfo would be to perform linearization and mapping to unconstrained spaces only when requested explicitly and needed, and hence decouple it a bit more from Bijectors and the assumption that every sample is an array.

devmotion · 2021-10-15T09:02:41Z

@sethaxen It is definitely possible to ignore the current assumptions in the Bijectors API and just implement a bijection by reusing the existing code for LKJBijector. But this will break downstream code that relies on the current design choices such as Turing (and it will break tests but this seems like a minor issue). Ie, it still won't be possible to use LKJCholesky in Turing.

yebai · 2021-10-15T09:05:56Z

And I guess a, probably more generally useful, approach in VarInfo would be to perform linearization and mapping to unconstrained spaces only when requested explicitly and needed, and hence decouple it a bit more from Bijectors and the assumption that every sample is an array.

yes, @phipsgabler and I are currently discussing this in the AbstractPPL VarInfo interface PR. See https://github.com/TuringLang/AbstractPPL.jl/pull/36/files#diff-bdd813b64855d047f0f00d89b77eeb6dc63c9ccca9766d1d33c964279d1ba901R74

sethaxen · 2021-11-08T08:30:10Z

Perhaps for the short term, Turing could include non-exported functions for 1) defining the number of unconstrained variables in an nxn correlation matrix 2) mapping a vector of unconstrained variables to the lower triangle of a Cholesky factor and 3) computing the logdetjac of that transformation.

(2) is more or less handled by Bijectors._inv_link_chol_lkj, except that expects the unconstrained variables to already be packed in a triangle.
(1,3) would be new functions. Perhaps these should live in Bijectors anyways.

This way the user who wants speed could manually create the vector of unconstrained variables drawn from a Flat() and increment the log-probability with the logdetjac and the logpdf of LKJCholesky.

yebai · 2022-03-17T18:06:41Z

@sethaxen Is this now fixed?

sethaxen · 2022-03-17T19:48:33Z

@sethaxen Is this now fixed?

No. Distributions now has LKJCholesky, but it's not yet usable inside Turing. My understanding from previous Slack discussions is that the reasons are:

Bijectors doesn't support mapping from a vector to a Cholesky object; rather it expects input and output types to be the same.
Turing assumes all parameters are numbers or arrays. It has no way (currently) of handling parameters of type Cholesky.
MCMCChains.Chains requires all parameters be flattened into a numeric vector, so one would need to define some invertible vectorization of Cholesky, with the ability to map from the vector representation back to Cholesky when using the Chains with the model, e.g. when calling log_likelihood or generated_quantities.

ParadaCarleton · 2022-05-06T02:25:32Z

@devmotion any updates on this?

yebai · 2022-11-14T13:47:23Z

Duplicate of #1870

sethaxen mentioned this issue Jun 1, 2021

Add LKJ distribution for cholesky factor of correlation matrix JuliaStats/Distributions.jl#1336

Closed

sethaxen mentioned this issue Jun 8, 2021

Constructor for PDMat with Cholesky pre- and post- multiplied by diagonal matrix JuliaStats/PDMats.jl#132

Open

devmotion closed this as completed Oct 15, 2021

devmotion reopened this Oct 15, 2021

This was referenced Nov 3, 2021

TuringGLM.jl Roadmap and Internals/API Discussion TuringLang/TuringGLM.jl#1

Closed

Feature list for 0.1.0 release TuringLang/TuringGLM.jl#2

Closed

storopoli mentioned this issue Jan 2, 2022

Roadmap for Hierarchical Models TuringLang/TuringGLM.jl#21

Open

9 tasks

devmotion mentioned this issue Mar 24, 2022

Issue with hierarchical model with LKJCholesky sampling! #1807

Closed

yebai marked this as a duplicate of #1870 Nov 14, 2022

yebai closed this as completed Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Add LKJCholesky #1629

Feature request: Add LKJCholesky #1629

kar9222 commented Jun 1, 2021

devmotion commented Jun 1, 2021

sethaxen commented Jun 1, 2021

devmotion commented Jun 1, 2021

sethaxen commented Jun 1, 2021

torfjelde commented Jun 2, 2021

ParadaCarleton commented Jun 3, 2021

devmotion commented Jun 3, 2021

sethaxen commented Jun 3, 2021

devmotion commented Jun 3, 2021

sethaxen commented Jun 9, 2021

mschauer commented Oct 15, 2021

devmotion commented Oct 15, 2021

devmotion commented Oct 15, 2021

yebai commented Oct 15, 2021

sethaxen commented Oct 15, 2021

devmotion commented Oct 15, 2021

devmotion commented Oct 15, 2021 •

edited

Loading

yebai commented Oct 15, 2021 •

edited

Loading

sethaxen commented Nov 8, 2021

yebai commented Mar 17, 2022

sethaxen commented Mar 17, 2022

ParadaCarleton commented May 6, 2022

yebai commented Nov 14, 2022

Feature request: Add LKJCholesky #1629

Feature request: Add LKJCholesky #1629

Comments

kar9222 commented Jun 1, 2021

Issues

LKJ Cholesky

Performance

Stan model

Sampling Stan model

Turing's reference model

The model I implemented to replicate Stan's model above

devmotion commented Jun 1, 2021

sethaxen commented Jun 1, 2021

devmotion commented Jun 1, 2021

sethaxen commented Jun 1, 2021

torfjelde commented Jun 2, 2021

ParadaCarleton commented Jun 3, 2021

devmotion commented Jun 3, 2021

sethaxen commented Jun 3, 2021

devmotion commented Jun 3, 2021

sethaxen commented Jun 9, 2021

mschauer commented Oct 15, 2021

devmotion commented Oct 15, 2021

devmotion commented Oct 15, 2021

yebai commented Oct 15, 2021

sethaxen commented Oct 15, 2021

devmotion commented Oct 15, 2021

devmotion commented Oct 15, 2021 • edited Loading

yebai commented Oct 15, 2021 • edited Loading

sethaxen commented Nov 8, 2021

yebai commented Mar 17, 2022

sethaxen commented Mar 17, 2022

ParadaCarleton commented May 6, 2022

yebai commented Nov 14, 2022

devmotion commented Oct 15, 2021 •

edited

Loading

yebai commented Oct 15, 2021 •

edited

Loading