adapt a diagonal mass matrix by default #35

tpapp · 2019-01-10T14:11:26Z

In practice, even a regularized covariance matrix is just estimating noise for the off-diagonal elements for a short sample. Either use a diagonal matrix always, or allow customizations of this but make diagonal the default.

AFAIK this is what Stan does, too.

cscherrer · 2019-01-14T01:29:40Z

Here is some discussion indicating this may not be the case.

I'd really like the ability to specify or update the mass matrix, along the lines of this suggestion from 5 years ago, AFAIK the Stan folks haven't yet gotten to it :)

chriselrod · 2019-02-27T19:25:29Z

A little off topic, but I think the mass matrix could be made a little more efficient.
Right now, from here:

GaussianKE(Minv::AbstractMatrix) = GaussianKE(Minv, cholesky(inv(Minv)).L)
rand(rng::AbstractRNG, κ::GaussianKE, q = nothing) = κ.W * randn(rng, size(κ.W, 1))

this could be:

GaussianKE(Minv::AbstractMatrix) = GaussianKE(Minv, inv(cholesky(Minv).U))
rand(rng::AbstractRNG, κ::GaussianKE, q = nothing) = κ.W * randn(rng, size(κ.W, 1))

W would then be an upper triangular matrix such that W*W' = M, instead of a lower triangular matrix as it is now. Either way, if z ~ Normal(0, I), then W * z = x ~ Normal(0, W*W').

The former inverse calculates the cholesky decomposition, and then ptotri!, which essentially calculates the triangular inverse while multiplying the result by it's own transpose. Doing just the inverse of a triangular matrix will generally be faster.

Of course, it only gets called once, so the impact on performance is going to be negligible.

tpapp · 2019-02-28T13:40:13Z

@chriselrod: good point, I will address this as part of #30. Yes, it is only called maybe 5-8 times (during the adaptation), but there is no reason we should not do this elegantly.

chriselrod · 2019-04-05T11:59:06Z

What would it take to implement this? Is it just using var instead of cov, or do other parts of the algorithm change?
Any differences in regularization?

I'm struggling with a model that converges fine in Stan, and realized I can reproduce Julia's failure to converge in Stan by specifying "metric = dense_e".
Using more default settings (just adapt_delta = 0.995 and max_treedepth = 15):

sampler_params <- get_sampler_params(itp_res, inc_warmup = FALSE)
(mean_accept_stat_by_chain <- sapply(sampler_params, function(x) mean(x[, "accept_stat__"])))
# [1] 0.9973673 0.9930246
(mean_stepsize_stat_by_chain <- sapply(sampler_params, function(x) mean(x[, "stepsize__"])))
# [1] 0.006663567 0.011519709
(max_treedepth_by_chain <- sapply(sampler_params, function(x) max(x[, "treedepth__"])))
# [1] 10  9

and I'm trying to find out how to get it to converge in Julia:

julia> NUTS_statistics(mcmc_chain2)
Hamiltonian Monte Carlo sample of length 500
  acceptance rate mean: 0.82, min/25%/median/75%/max: 0.29 0.74 0.86 0.94 1.0
  termination: MaxDepth => 100%
  depth: 15 => 100%

julia> tuned_sampler
NUTS sampler in 94 dimensions
  stepsize (ϵ) ≈ 4.44e-5
  maximum depth = 15
  Gaussian kinetic energy, √diag(M⁻¹): [1.895813313637011e-5, 0.0001115812362165227, 9.157446164012355e-5, 4.46418426218662e-5, 1.8307063191189693e-5, 3.200949304693278e-5, 2.1938155774762758e-5, 5.9087940848096736e-5, 3.73472302349929e-5, 5.274846520513566e-5, 5.367501293396515e-5, 3.1397511915424646e-5, 6.08221582019119e-5, 5.4267538547114185e-5, 5.29947670788417e-5, 0.00011075177985935453, 7.609173377511799e-5, 0.00010532538379063398, 8.51499318202174e-5, 2.478363462668975e-5, 3.658987692014494e-5, 2.007888754739028e-5, 0.00011127385432189406, 1.3223131609749478e-5, 0.00016837339824462473, 1.1839460215858603e-5, 6.585270064600971e-5, 4.19335185808455e-5, 5.544623760676068e-5, 9.784890108486913e-5, 3.311545937725982e-5, 2.471979711349106e-5, 3.112412949389102e-5, 9.570724240006883e-5, 3.49521924945658e-5, 6.0333831183082444e-5, 4.840991138763101e-5, 2.783856460714573e-5, 4.831540994951892e-5, 4.439850328987186e-5, 6.071328215069777e-5, 7.347164675861388e-5, 4.974277394209337e-5, 9.255844061671983e-5, 2.957893020284297e-5, 3.0813197939751596e-5, 0.00011275072683006586, 0.00016486450332804875, 2.436513209762428e-5, 7.000045840614247e-5, 4.893351242935738e-5, 1.9587454417079087e-5, 3.789299805871494e-5, 8.862310005896023e-5, 6.167247733285533e-5, 7.58130680565668e-5, 4.376895615887048e-5, 5.861454040079281e-5, 0.0001342476133258934, 3.45328862714302e-5, 7.746663993962745e-5, 6.656366056200958e-5, 1.4143934443648073e-5, 8.328992141907225e-6, 2.274360501496452e-5, 5.523437355505674e-5, 2.6838929735689244e-5, 2.070253693705773e-5, 7.425841165122632e-5, 3.64062126857918e-5, 8.442014624480622e-5, 5.496281774664461e-5, 3.87778371731205e-5, 9.127710121098522e-5, 1.3767075754649895e-5, 8.13280725468518e-5, 9.41620226522111e-5, 2.3713593495284312e-5, 7.345259532704505e-5, 2.5847927065962024e-5, 3.4417125925637836e-5, 4.685867281541214e-5, 4.993549553072007e-5, 1.181268068574348e-5, 0.0001727602477423264, 3.904524248961542e-5, 0.0001584036741565338, 5.5143612458872284e-5, 9.972190958460448e-5, 7.388235798203804e-5, 1.5004410870109371e-5, 2.0118912968850808e-5, 5.401727489332943e-5, 2.3884781934174396e-5]

I get similar tiny step sizes and convergence failure in Stan with the dense metric. Admittedly, this was based on a single simulation, but I have started 8 more to confirm that pattern. I'll report later.

There are of course other differences, eg
#25
which Stan seems to do, based on slide 14.

Hopefully just a diagonal matrix will help. I'm struck by this difference (and even though Stan failed with a dense matrix, I'm still combing through my Julia code, hunting for bugs).
Stan had a treedepth of at most 10, far larger step sizes, and had great acceptance probabilities. adapt_delta was to 0.995 to avoid divergent transitions. I had 3 divergent transitions with adapt_delta = 0.95, and the two chain's step sizes were 0.039 and 0.0227. Stan converged to the correct values, and had great effective sample sizes relative to the number of samples. The Julia version did not converge at all (the highest effective sample size I achieved was 10).

FWIW, the log likelihood and gradient evaluation in Julia is so much faster that even pegging out the maximum tree depth, it still took less time per iteration. Of course, less time per sample is meaningless if you have approximately zero effective samples/sample.

tpapp · 2019-04-05T12:07:07Z

What would it take to implement this?

Time. 😉

I am currently working on #30 and related, and postponed all of these changes. This may not be ideal though, so perhaps I should address this and organize a nice API for it later.

Will look into this soon because I am also running into these problems.

I would be grateful if you could contribute the aforementioned problem as a test, a gist is fine.

chriselrod · 2019-04-07T04:22:43Z

Well, embarrassingly, the problem was a bug in my gradients!

The code I was running has about a dozen unregistered dependencies, so I wrote an example simply using common libraries to share. That one converged.

Stan does struggle when specifying a dense energy matrix.
Of the 8 simulations I started before making my comment, only one has finished (2000 total iterations) -- with effective sample sizes of 1 for most parameters.
It converges well in Stan if given a high adapt delta, but even better in DynamicHMC with simply the defaults.

If you'd like it as another test anyway:

using ForwardDiff, TransformVariables, LogDensityProblems, DynamicHMC, Distributions, Parameters, Random, LinearAlgebra

struct LKJ{T}
    η::T
end

function Distributions.logpdf(lkj::LKJ, L::AbstractMatrix{T}) where {T}
    out = zero(T)
    K = size(L,1)
    η = lkj.η
    for k ∈ 2:K
        out += (K + 2η - k - 2) * log(L[k,k])
    end
    out
end

@with_kw struct ITPData
    Y₁::Array{Float64,3}
    Y₂::Array{Float64,3}
    time::Vector{Float64}
    δtime::Vector{Float64}
    domains::Matrix{Float64}
end
# Could use for loops (more efficient in general), but don't want
# autodiff to track individual values because of downstream
# matrix arithmetic.
function create_domain_matrix(domains)
    domains_matrix = zeros(sum(domains), length(domains))
    ind = 0
    for (i,domain) ∈ enumerate(domains)
        for d ∈ 1:domain
            ind += 1
            domains_matrix[ind,i] = 1.0
        end
    end
    domains_matrix
end
function ITPData(Y₁, Y₂, time, domains::Union{Vector{Int},NTuple{N,Int} where N})
    ITPData(Y₁, Y₂, time, diff(time), create_domain_matrix(domains))
end

(data::ITPData)(t::NamedTuple) = data(; t...)
function (data::ITPData)(; ρ, logκ, θ, U, μₕ₁, μₕ₂, μᵣ₁, μᵣ₂, βᵣ₁, βᵣ₂, σₕ, σᵦ, logσ)
    @unpack Y₁, Y₂, time, δtime, domains = data
    
    K, N₁, T = size(Y₁)
    K, N₂, T = size(Y₂)
    N = N₁ + N₂

    # Priors
    target = logpdf(Beta(2,2), ρ)
    κ = exp.(logκ)
    target += sum(logκ) # add log jacobian for ℝ → ℝ₊ transform
    target += sum(κᵢ -> logpdf(Gamma(0.1, 10.0), κᵢ), κ) 
    target -= 0.005sum(abs2, θ) # Normal(0, 10)
    target += logpdf(LKJ(2), U)
    σ = exp.(logσ)
    target += sum(logσ) # add log jacobian for ℝ → ℝ₊ transform
    target += sum(σᵢ -> logpdf(Gamma(0.1, 10), σᵢ), σ)
    target -= 0.005σₕ^2 # (Half-)Normal(0, 10)
    target -= 0.005σᵦ^2 # (Half-)Normal(0, 10)

    target -= 0.02μₕ₁^2 # Normal(0, 5)
    target -= 0.02μₕ₂^2 # Normal(0, 5)
    target -= 0.5sum(abs2, μᵣ₁) # Normal(0, 1) 
    target -= 0.5sum(abs2, μᵣ₂) # Normal(0, 1) 
    target -= 0.5sum(abs2, βᵣ₁) # Normal(0, 1) 
    target -= 0.5sum(abs2, βᵣ₂) # Normal(0, 1)
    
    # Center the uncentered parameters
    μᵦ₁ = domains * (μᵣ₁ .* σₕ .+ μₕ₁)
    μᵦ₂ = domains * (μᵣ₂ .* σₕ .+ μₕ₂)
    β₁ = @. μᵦ₁ + σᵦ * βᵣ₁
    β₂ = @. μᵦ₂ + σᵦ * βᵣ₂


    # AR1 matrix diagonal and negative offdiagonal (dropping first element of diagonal, which equals 1)
    if ρ > 0.5
        ρᵗ = (2ρ-1) .^ δtime
    else
        ρᵗ = -1 .* (1-2ρ) .^ δtime
    end
    ARdiag = @. 1 / sqrt(1 - ρᵗ * ρᵗ)
    nARoffdiag = @. ARdiag * ρᵗ
    
    # Expected values
    μ₁ = @. θ + β₁ * (1 - exp(-κ * time'))
    μ₂ = @. θ + β₂ * (1 - exp(-κ * time'))

    L = σ .* U'
    # Matrix Normal log determinants
    # ARdiag is the diaognal of the cholesky factor of the inverse AutoRegressive 1 matrix
    target += N * K * sum(log, ARdiag)
    target -= N * T * sum(log, diag(L))
    
    
    # Matrix Normal Kernel
    # L⁻¹ = inv(U')

    δY₁ = Y₁ .- reshape(μ₁, (K,1,T))
    δY₂ = Y₂ .- reshape(μ₂, (K,1,T))
    local L⁻¹δY₁, L⁻¹δY₂
    try
        L⁻¹δY₁ = reshape(L \ reshape(δY₁, (size(Y₁,1), size(Y₁,2) * size(Y₁,3))), size(Y₁))
        L⁻¹δY₂ = reshape(L \ reshape(δY₂, (size(Y₂,1), size(Y₂,2) * size(Y₂,3))), size(Y₂))
    catch err # There's probably a better way.
        if isa(err, SingularException)
            return -Inf
        else
            rethrow(err)
        end
    end
    target -= 0.5sum(abs2, L⁻¹δY₁[:,:,1])
    for t ∈ 2:T
        target -= 0.5sum( (@. (L⁻¹δY₁[:,:,t] * ARdiag[t-1] - L⁻¹δY₁[:,:,t-1] * nARoffdiag[t-1] ) ^ 2 ) )
    end

    target -= 0.5sum(abs2, L⁻¹δY₂[:,:,1])
    for t ∈ 2:T
        target -= 0.5sum( (@. (L⁻¹δY₂[:,:,t] * ARdiag[t-1] - L⁻¹δY₂[:,:,t-1] * nARoffdiag[t-1] ) ^ 2 ) )
    end
    
    isfinite(target) ? target : -Inf
end


function ITP_transform(D, K, T)
    as(
        (ρ = as𝕀, logκ = as(Array, K), θ = as(Array, K), U = CorrCholeskyFactor(K),
        μₕ₁ = asℝ, μₕ₂ = asℝ, μᵣ₁ = as(Array, D), μᵣ₂ = as(Array, D),
        βᵣ₁ = as(Array, K), βᵣ₂ = as(Array, K), σₕ = asℝ₊, σᵦ = asℝ₊, logσ = as(Array, K))
    )
end


ITP_D4_K9_T24 = ITP_transform(4, 9, 24);


domain_set = (2,2,2,3)
T = 24; K = sum(domain_set); D = length(domain_set);
domain_mat = create_domain_matrix(domain_set);
ρ = 0.7;
κ = 0.03125 .* ( +([randexp(K) for i ∈ 1:8]...) ); # ~ Gamma(8, 1/32)
σd = 0.0625 * sum(randexp(4));
θ = 2.0 .* randn(K);
S = randn(K,4K) |> x -> x * x';
L = cholesky(S).L ./ 4;
μₕ₁, μₕ₂ = -3.0, 9.0;
μᵦ₁ = domain_mat * (μₕ₁ .+ randn(D))
μᵦ₂ = domain_mat * (μₕ₂ .+ randn(D))
β₁ = μᵦ₁ .+ σd .* randn(K);
β₂ = μᵦ₂ .+ σd .* randn(K);

δtime = 0.06125 .* ( +([randexp(T-1) for i ∈ 1:8]...) ); # ~ Gamma(8, 1/16)
time = vcat(0.0, cumsum(δtime));

μ₁ = @. θ + β₁ * (1 - exp(-κ * time'))
μ₂ = @. θ + β₂ * (1 - exp(-κ * time'))

ρᵗ = ρ .^ δtime;
ARdiag = @. 1 / (1 - ρᵗ * ρᵗ);
ARoffdiag = @. - ARdiag * ρᵗ;
AR1_chol_inverse = Bidiagonal(vcat(1.0,ARdiag), ARoffdiag, :U);
AR1_chol = inv(AR1_chol_inverse);

N₁, N₂ = 55, 55;
Y₁ = reshape(reshape(L * randn(K, N₁ * T), (K * N₁, T)) * AR1_chol, (K, N₁, T)) .+ reshape(μ₁, (K, 1, T));
Y₂ = reshape(reshape(L * randn(K, N₂ * T), (K * N₂, T)) * AR1_chol, (K, N₂, T)) .+ reshape(μ₂, (K, 1, T));

itpdata = ITPData(Y₁, Y₂, time, δtime, domain_mat);


TLD = TransformedLogDensity(ITP_D4_K9_T24, itpdata);
# aditp_reverse = ADgradient(Val(:ReverseDiff), TLD); # slow
aditp_forward = ADgradient(Val(:ForwardDiff), TLD);
# aditp_flux = ADgradient(Val(:Flux), TLD); # error; doesn't work with TransformVariables

x = randn(94);
logdensity(LogDensityProblems.ValueGradient, aditp_forward, x)

@time mcmc_chain, tuned_sampler = NUTS_init_tune_mcmc(aditp_forward, 1000);

NUTS_statistics(mcmc_chain)
tuned_sampler
using MCMCDiagnostics
chain_matrix = get_position_matrix(mcmc_chain);
[effective_sample_size(chain_matrix[:,i]) for i in 1:10]'

It takes a couple hours to run.

The optimized version runs in about half a minute.
I still have the impression that the effective sample sizes aren't quite as good as with the above version relative to the number of samples. But if the gradients and densities match, I don't see why that would be the case.

The basic idea I have been working on is providing reasonably optimized probability functions that also return their analytical gradients. Other functions / transformations can be given optimized Jacobians, often with their own types so that it can be much more efficient than dense matrix multiplication. The simplest (and perhaps most common) example is a diagonal matrix -- although that (or the equivalent) is something I think all the AD libraries are already doing (eg, with broadcasting).

tpapp · 2019-04-09T15:09:14Z

@chriselrod: just to make sure I get this right:

there is no issue with this specific model now, and
the problem was miscalculated gradients?

Do you think that some utility function for testing gradient calculations (eg against finite differences or ForwardDiff) in LogDensityProblems would help with issues like this?

tpapp · 2019-04-09T15:26:22Z

I opened tpapp/LogDensityProblems.jl#42 for AD testing.

tpapp mentioned this issue Jan 10, 2019

RFC: API reorganization #30

Closed

tpapp mentioned this issue Apr 9, 2019

Request for help tpapp/DynamicHMCExamples.jl#12

Open

This was referenced Aug 5, 2019

warmup and diagnostics API #54

Open

major rewrite #44

Merged

tpapp closed this as completed in #44 Aug 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adapt a diagonal mass matrix by default #35

adapt a diagonal mass matrix by default #35

tpapp commented Jan 10, 2019

cscherrer commented Jan 14, 2019

chriselrod commented Feb 27, 2019 •

edited

Loading

tpapp commented Feb 28, 2019

chriselrod commented Apr 5, 2019

tpapp commented Apr 5, 2019

chriselrod commented Apr 7, 2019 •

edited

Loading

tpapp commented Apr 9, 2019

tpapp commented Apr 9, 2019

adapt a diagonal mass matrix by default #35

adapt a diagonal mass matrix by default #35

Comments

tpapp commented Jan 10, 2019

cscherrer commented Jan 14, 2019

chriselrod commented Feb 27, 2019 • edited Loading

tpapp commented Feb 28, 2019

chriselrod commented Apr 5, 2019

tpapp commented Apr 5, 2019

chriselrod commented Apr 7, 2019 • edited Loading

tpapp commented Apr 9, 2019

tpapp commented Apr 9, 2019

chriselrod commented Feb 27, 2019 •

edited

Loading

chriselrod commented Apr 7, 2019 •

edited

Loading