-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adapt a diagonal mass matrix by default #35
Comments
Here is some discussion indicating this may not be the case. I'd really like the ability to specify or update the mass matrix, along the lines of this suggestion from 5 years ago, AFAIK the Stan folks haven't yet gotten to it :) |
A little off topic, but I think the mass matrix could be made a little more efficient. GaussianKE(Minv::AbstractMatrix) = GaussianKE(Minv, cholesky(inv(Minv)).L)
rand(rng::AbstractRNG, κ::GaussianKE, q = nothing) = κ.W * randn(rng, size(κ.W, 1)) this could be: GaussianKE(Minv::AbstractMatrix) = GaussianKE(Minv, inv(cholesky(Minv).U))
rand(rng::AbstractRNG, κ::GaussianKE, q = nothing) = κ.W * randn(rng, size(κ.W, 1))
The former inverse calculates the cholesky decomposition, and then ptotri!, which essentially calculates the triangular inverse while multiplying the result by it's own transpose. Doing just the inverse of a triangular matrix will generally be faster. Of course, it only gets called once, so the impact on performance is going to be negligible. |
@chriselrod: good point, I will address this as part of #30. Yes, it is only called maybe 5-8 times (during the adaptation), but there is no reason we should not do this elegantly. |
What would it take to implement this? Is it just using I'm struggling with a model that converges fine in Stan, and realized I can reproduce Julia's failure to converge in Stan by specifying "metric = dense_e". sampler_params <- get_sampler_params(itp_res, inc_warmup = FALSE)
(mean_accept_stat_by_chain <- sapply(sampler_params, function(x) mean(x[, "accept_stat__"])))
# [1] 0.9973673 0.9930246
(mean_stepsize_stat_by_chain <- sapply(sampler_params, function(x) mean(x[, "stepsize__"])))
# [1] 0.006663567 0.011519709
(max_treedepth_by_chain <- sapply(sampler_params, function(x) max(x[, "treedepth__"])))
# [1] 10 9 and I'm trying to find out how to get it to converge in Julia: julia> NUTS_statistics(mcmc_chain2)
Hamiltonian Monte Carlo sample of length 500
acceptance rate mean: 0.82, min/25%/median/75%/max: 0.29 0.74 0.86 0.94 1.0
termination: MaxDepth => 100%
depth: 15 => 100%
julia> tuned_sampler
NUTS sampler in 94 dimensions
stepsize (ϵ) ≈ 4.44e-5
maximum depth = 15
Gaussian kinetic energy, √diag(M⁻¹): [1.895813313637011e-5, 0.0001115812362165227, 9.157446164012355e-5, 4.46418426218662e-5, 1.8307063191189693e-5, 3.200949304693278e-5, 2.1938155774762758e-5, 5.9087940848096736e-5, 3.73472302349929e-5, 5.274846520513566e-5, 5.367501293396515e-5, 3.1397511915424646e-5, 6.08221582019119e-5, 5.4267538547114185e-5, 5.29947670788417e-5, 0.00011075177985935453, 7.609173377511799e-5, 0.00010532538379063398, 8.51499318202174e-5, 2.478363462668975e-5, 3.658987692014494e-5, 2.007888754739028e-5, 0.00011127385432189406, 1.3223131609749478e-5, 0.00016837339824462473, 1.1839460215858603e-5, 6.585270064600971e-5, 4.19335185808455e-5, 5.544623760676068e-5, 9.784890108486913e-5, 3.311545937725982e-5, 2.471979711349106e-5, 3.112412949389102e-5, 9.570724240006883e-5, 3.49521924945658e-5, 6.0333831183082444e-5, 4.840991138763101e-5, 2.783856460714573e-5, 4.831540994951892e-5, 4.439850328987186e-5, 6.071328215069777e-5, 7.347164675861388e-5, 4.974277394209337e-5, 9.255844061671983e-5, 2.957893020284297e-5, 3.0813197939751596e-5, 0.00011275072683006586, 0.00016486450332804875, 2.436513209762428e-5, 7.000045840614247e-5, 4.893351242935738e-5, 1.9587454417079087e-5, 3.789299805871494e-5, 8.862310005896023e-5, 6.167247733285533e-5, 7.58130680565668e-5, 4.376895615887048e-5, 5.861454040079281e-5, 0.0001342476133258934, 3.45328862714302e-5, 7.746663993962745e-5, 6.656366056200958e-5, 1.4143934443648073e-5, 8.328992141907225e-6, 2.274360501496452e-5, 5.523437355505674e-5, 2.6838929735689244e-5, 2.070253693705773e-5, 7.425841165122632e-5, 3.64062126857918e-5, 8.442014624480622e-5, 5.496281774664461e-5, 3.87778371731205e-5, 9.127710121098522e-5, 1.3767075754649895e-5, 8.13280725468518e-5, 9.41620226522111e-5, 2.3713593495284312e-5, 7.345259532704505e-5, 2.5847927065962024e-5, 3.4417125925637836e-5, 4.685867281541214e-5, 4.993549553072007e-5, 1.181268068574348e-5, 0.0001727602477423264, 3.904524248961542e-5, 0.0001584036741565338, 5.5143612458872284e-5, 9.972190958460448e-5, 7.388235798203804e-5, 1.5004410870109371e-5, 2.0118912968850808e-5, 5.401727489332943e-5, 2.3884781934174396e-5] I get similar tiny step sizes and convergence failure in Stan with the dense metric. Admittedly, this was based on a single simulation, but I have started 8 more to confirm that pattern. I'll report later. There are of course other differences, eg Hopefully just a diagonal matrix will help. I'm struck by this difference (and even though Stan failed with a dense matrix, I'm still combing through my Julia code, hunting for bugs). FWIW, the log likelihood and gradient evaluation in Julia is so much faster that even pegging out the maximum tree depth, it still took less time per iteration. Of course, less time per sample is meaningless if you have approximately zero effective samples/sample. |
Time. 😉 I am currently working on #30 and related, and postponed all of these changes. This may not be ideal though, so perhaps I should address this and organize a nice API for it later. Will look into this soon because I am also running into these problems. I would be grateful if you could contribute the aforementioned problem as a test, a gist is fine. |
Well, embarrassingly, the problem was a bug in my gradients! The code I was running has about a dozen unregistered dependencies, so I wrote an example simply using common libraries to share. That one converged. Stan does struggle when specifying a dense energy matrix. If you'd like it as another test anyway: using ForwardDiff, TransformVariables, LogDensityProblems, DynamicHMC, Distributions, Parameters, Random, LinearAlgebra
struct LKJ{T}
η::T
end
function Distributions.logpdf(lkj::LKJ, L::AbstractMatrix{T}) where {T}
out = zero(T)
K = size(L,1)
η = lkj.η
for k ∈ 2:K
out += (K + 2η - k - 2) * log(L[k,k])
end
out
end
@with_kw struct ITPData
Y₁::Array{Float64,3}
Y₂::Array{Float64,3}
time::Vector{Float64}
δtime::Vector{Float64}
domains::Matrix{Float64}
end
# Could use for loops (more efficient in general), but don't want
# autodiff to track individual values because of downstream
# matrix arithmetic.
function create_domain_matrix(domains)
domains_matrix = zeros(sum(domains), length(domains))
ind = 0
for (i,domain) ∈ enumerate(domains)
for d ∈ 1:domain
ind += 1
domains_matrix[ind,i] = 1.0
end
end
domains_matrix
end
function ITPData(Y₁, Y₂, time, domains::Union{Vector{Int},NTuple{N,Int} where N})
ITPData(Y₁, Y₂, time, diff(time), create_domain_matrix(domains))
end
(data::ITPData)(t::NamedTuple) = data(; t...)
function (data::ITPData)(; ρ, logκ, θ, U, μₕ₁, μₕ₂, μᵣ₁, μᵣ₂, βᵣ₁, βᵣ₂, σₕ, σᵦ, logσ)
@unpack Y₁, Y₂, time, δtime, domains = data
K, N₁, T = size(Y₁)
K, N₂, T = size(Y₂)
N = N₁ + N₂
# Priors
target = logpdf(Beta(2,2), ρ)
κ = exp.(logκ)
target += sum(logκ) # add log jacobian for ℝ → ℝ₊ transform
target += sum(κᵢ -> logpdf(Gamma(0.1, 10.0), κᵢ), κ)
target -= 0.005sum(abs2, θ) # Normal(0, 10)
target += logpdf(LKJ(2), U)
σ = exp.(logσ)
target += sum(logσ) # add log jacobian for ℝ → ℝ₊ transform
target += sum(σᵢ -> logpdf(Gamma(0.1, 10), σᵢ), σ)
target -= 0.005σₕ^2 # (Half-)Normal(0, 10)
target -= 0.005σᵦ^2 # (Half-)Normal(0, 10)
target -= 0.02μₕ₁^2 # Normal(0, 5)
target -= 0.02μₕ₂^2 # Normal(0, 5)
target -= 0.5sum(abs2, μᵣ₁) # Normal(0, 1)
target -= 0.5sum(abs2, μᵣ₂) # Normal(0, 1)
target -= 0.5sum(abs2, βᵣ₁) # Normal(0, 1)
target -= 0.5sum(abs2, βᵣ₂) # Normal(0, 1)
# Center the uncentered parameters
μᵦ₁ = domains * (μᵣ₁ .* σₕ .+ μₕ₁)
μᵦ₂ = domains * (μᵣ₂ .* σₕ .+ μₕ₂)
β₁ = @. μᵦ₁ + σᵦ * βᵣ₁
β₂ = @. μᵦ₂ + σᵦ * βᵣ₂
# AR1 matrix diagonal and negative offdiagonal (dropping first element of diagonal, which equals 1)
if ρ > 0.5
ρᵗ = (2ρ-1) .^ δtime
else
ρᵗ = -1 .* (1-2ρ) .^ δtime
end
ARdiag = @. 1 / sqrt(1 - ρᵗ * ρᵗ)
nARoffdiag = @. ARdiag * ρᵗ
# Expected values
μ₁ = @. θ + β₁ * (1 - exp(-κ * time'))
μ₂ = @. θ + β₂ * (1 - exp(-κ * time'))
L = σ .* U'
# Matrix Normal log determinants
# ARdiag is the diaognal of the cholesky factor of the inverse AutoRegressive 1 matrix
target += N * K * sum(log, ARdiag)
target -= N * T * sum(log, diag(L))
# Matrix Normal Kernel
# L⁻¹ = inv(U')
δY₁ = Y₁ .- reshape(μ₁, (K,1,T))
δY₂ = Y₂ .- reshape(μ₂, (K,1,T))
local L⁻¹δY₁, L⁻¹δY₂
try
L⁻¹δY₁ = reshape(L \ reshape(δY₁, (size(Y₁,1), size(Y₁,2) * size(Y₁,3))), size(Y₁))
L⁻¹δY₂ = reshape(L \ reshape(δY₂, (size(Y₂,1), size(Y₂,2) * size(Y₂,3))), size(Y₂))
catch err # There's probably a better way.
if isa(err, SingularException)
return -Inf
else
rethrow(err)
end
end
target -= 0.5sum(abs2, L⁻¹δY₁[:,:,1])
for t ∈ 2:T
target -= 0.5sum( (@. (L⁻¹δY₁[:,:,t] * ARdiag[t-1] - L⁻¹δY₁[:,:,t-1] * nARoffdiag[t-1] ) ^ 2 ) )
end
target -= 0.5sum(abs2, L⁻¹δY₂[:,:,1])
for t ∈ 2:T
target -= 0.5sum( (@. (L⁻¹δY₂[:,:,t] * ARdiag[t-1] - L⁻¹δY₂[:,:,t-1] * nARoffdiag[t-1] ) ^ 2 ) )
end
isfinite(target) ? target : -Inf
end
function ITP_transform(D, K, T)
as(
(ρ = as𝕀, logκ = as(Array, K), θ = as(Array, K), U = CorrCholeskyFactor(K),
μₕ₁ = asℝ, μₕ₂ = asℝ, μᵣ₁ = as(Array, D), μᵣ₂ = as(Array, D),
βᵣ₁ = as(Array, K), βᵣ₂ = as(Array, K), σₕ = asℝ₊, σᵦ = asℝ₊, logσ = as(Array, K))
)
end
ITP_D4_K9_T24 = ITP_transform(4, 9, 24);
domain_set = (2,2,2,3)
T = 24; K = sum(domain_set); D = length(domain_set);
domain_mat = create_domain_matrix(domain_set);
ρ = 0.7;
κ = 0.03125 .* ( +([randexp(K) for i ∈ 1:8]...) ); # ~ Gamma(8, 1/32)
σd = 0.0625 * sum(randexp(4));
θ = 2.0 .* randn(K);
S = randn(K,4K) |> x -> x * x';
L = cholesky(S).L ./ 4;
μₕ₁, μₕ₂ = -3.0, 9.0;
μᵦ₁ = domain_mat * (μₕ₁ .+ randn(D))
μᵦ₂ = domain_mat * (μₕ₂ .+ randn(D))
β₁ = μᵦ₁ .+ σd .* randn(K);
β₂ = μᵦ₂ .+ σd .* randn(K);
δtime = 0.06125 .* ( +([randexp(T-1) for i ∈ 1:8]...) ); # ~ Gamma(8, 1/16)
time = vcat(0.0, cumsum(δtime));
μ₁ = @. θ + β₁ * (1 - exp(-κ * time'))
μ₂ = @. θ + β₂ * (1 - exp(-κ * time'))
ρᵗ = ρ .^ δtime;
ARdiag = @. 1 / (1 - ρᵗ * ρᵗ);
ARoffdiag = @. - ARdiag * ρᵗ;
AR1_chol_inverse = Bidiagonal(vcat(1.0,ARdiag), ARoffdiag, :U);
AR1_chol = inv(AR1_chol_inverse);
N₁, N₂ = 55, 55;
Y₁ = reshape(reshape(L * randn(K, N₁ * T), (K * N₁, T)) * AR1_chol, (K, N₁, T)) .+ reshape(μ₁, (K, 1, T));
Y₂ = reshape(reshape(L * randn(K, N₂ * T), (K * N₂, T)) * AR1_chol, (K, N₂, T)) .+ reshape(μ₂, (K, 1, T));
itpdata = ITPData(Y₁, Y₂, time, δtime, domain_mat);
TLD = TransformedLogDensity(ITP_D4_K9_T24, itpdata);
# aditp_reverse = ADgradient(Val(:ReverseDiff), TLD); # slow
aditp_forward = ADgradient(Val(:ForwardDiff), TLD);
# aditp_flux = ADgradient(Val(:Flux), TLD); # error; doesn't work with TransformVariables
x = randn(94);
logdensity(LogDensityProblems.ValueGradient, aditp_forward, x)
@time mcmc_chain, tuned_sampler = NUTS_init_tune_mcmc(aditp_forward, 1000);
NUTS_statistics(mcmc_chain)
tuned_sampler
using MCMCDiagnostics
chain_matrix = get_position_matrix(mcmc_chain);
[effective_sample_size(chain_matrix[:,i]) for i in 1:10]' It takes a couple hours to run. The optimized version runs in about half a minute. The basic idea I have been working on is providing reasonably optimized probability functions that also return their analytical gradients. Other functions / transformations can be given optimized Jacobians, often with their own types so that it can be much more efficient than dense matrix multiplication. The simplest (and perhaps most common) example is a diagonal matrix -- although that (or the equivalent) is something I think all the AD libraries are already doing (eg, with broadcasting). |
@chriselrod: just to make sure I get this right:
Do you think that some utility function for testing gradient calculations (eg against finite differences or ForwardDiff) in LogDensityProblems would help with issues like this? |
I opened tpapp/LogDensityProblems.jl#42 for AD testing. |
In practice, even a regularized covariance matrix is just estimating noise for the off-diagonal elements for a short sample. Either use a diagonal matrix always, or allow customizations of this but make diagonal the default.
AFAIK this is what Stan does, too.
The text was updated successfully, but these errors were encountered: