Replies: 5 comments 3 replies
-
I neglected to mention the most emphatic lesson from the English Lexicon project - the PIRLS algorithm should be refined, especially the starting estimates for each PIRLS iteration. The verbose output shows that the initial steps are far away from the conditional estimates and sometimes lead to |
Beta Was this translation helpful? Give feedback.
-
So I've got good news and bad news. The good news is that it seems it is faster to do the rank-k update for the upper triangular blocked Cholesky factor than for the lower triangular factor. I created an interface to The strange thing is that the MKL code is quite a bit slower than the not-highly-optimized code we already have in MixedModels.jl for doing the update. (We only have code for the lower triangular form at present.) In the benchmarks shown below julia> size(b12)
(1590, 161552)
julia> @benchmark MixedModels.rankUpdate!($C, $b12, 1.0, 0.0)
BenchmarkTools.Trial: 10 samples with 1 evaluation.
Range (min … max): 507.377 ms … 554.522 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 508.259 ms ┊ GC (median): 0.00%
Time (mean ± σ): 512.949 ms ± 14.625 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
██
██▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
507 ms Histogram: frequency by time 555 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark objective(updateL!(m4))
BenchmarkTools.Trial: 8 samples with 1 evaluation.
Range (min … max): 683.161 ms … 687.051 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 685.542 ms ┊ GC (median): 0.00%
Time (mean ± σ): 685.528 ms ± 1.292 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ █ █ █ █ ██
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁██ ▁
683 ms Histogram: frequency by time 687 ms <
Memory estimate: 1.25 MiB, allocs estimate: 81599.
julia> @benchmark syrkd!($cc, $A64, 'T', 1.0, 0.0)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
Range (min … max): 1.498 s … 1.503 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.501 s ┊ GC (median): 0.00%
Time (mean ± σ): 1.501 s ± 2.060 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ █ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
1.5 s Histogram: frequency by time 1.5 s <
Memory estimate: 24 bytes, allocs estimate: 2. |
Beta Was this translation helpful? Give feedback.
-
So I have good news and bad news. The good news is that I won't be embarking on rewriting the code to use the upper Cholesky factor instead of the lower factor in the penalized least squares calculation. I had hoped that by doing so the downdate of the [2,2] block from the [1,2] block (upper triangular factor) would be faster than downdating from the [2,1] block (lower triangular factor). As It doesn't have to be that slow. The equivalent operation using julia> @benchmark syrkd!($cc, $AT64, 'N', 1.0, 0.0)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
Range (min … max): 1.312 s … 1.410 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.384 s ┊ GC (median): 0.00%
Time (mean ± σ): 1.372 s ± 46.373 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
1.31 s Histogram: frequency by time 1.41 s <
Memory estimate: 24 bytes, allocs estimate: 2. but neither orientation using MKL is as fast as the current pure Julia code. There may still be some speedups if the form of the BlockedSparse matrix is modified to use a three-dimensional array in the |
Beta Was this translation helpful? Give feedback.
-
Another approach would be to pre-compute and cache the positions of the row intersections in the [1,2] block. The reason that the update of [2,2] from [1,2] is slow is because the code is determining for each pair of columns whether the nonzero positions overlap. But that information doesn't depend on the values of the parameters, it is simply a function of the pattern of the grouping factors. If we are doing thousands of function evaluations that information can be computed just once and stored then the actual accumulation of the inner products will, I think, be much faster. I will try it out. |
Beta Was this translation helpful? Give feedback.
-
I have a couple of methods added in |
Beta Was this translation helpful? Give feedback.
-
Data from the English Lexicon Project, kindly provided by Melvin Yap and David Balota, who also answered numerous questions about it, gave us a chance to examine some large-scale fits of responses in a subject-item type of experiment. For the lexical decision task the reduced data (eliminate inaccurate responses and response times less than 250 ms or greater than 4 s) consists of 2,278,960 responses on 80776 items by 795 subjects. Models with vector-valued random effects for both subject and item require about 1 second for each evaluation of the objective using MKL. The majority of the time is spent in the update of the [2,2] block of L after evaluating the [2,1] block.
SparseMatrixCSC
the natural approach in updating [2,2] from [2,1] involves assigning to positions all over the [2,2] dense block and is limited by memory bandwidth. The natural approach in updating from the [1,2] CSC block is more localized.rmulλ!
operation down the first column of blocks then evaluate the diagonal blocks in [1,1] block then solve triangular systems. We can delay thermulλ!
on the blocks below [1,1], evaluate the diagonal blocks then combine the solution and multiplication into a single operation.Beta Was this translation helpful? Give feedback.
All reactions