Lessons from the EnglishLexicon.jl project #581

dmbates · 2021-12-09T18:52:33Z

dmbates
Dec 9, 2021
Maintainer

Data from the English Lexicon Project, kindly provided by Melvin Yap and David Balota, who also answered numerous questions about it, gave us a chance to examine some large-scale fits of responses in a subject-item type of experiment. For the lexical decision task the reduced data (eliminate inaccurate responses and response times less than 250 ms or greater than 4 s) consists of 2,278,960 responses on 80776 items by 795 subjects. Models with vector-valued random effects for both subject and item require about 1 second for each evaluation of the objective using MKL. The majority of the time is spent in the update of the [2,2] block of L after evaluating the [2,1] block.

It seems likely that switching from the lower triangular Cholesky factor to the upper triangular factor can give a speedup because of the ordering of the update operation. For SparseMatrixCSC the natural approach in updating [2,2] from [2,1] involves assigning to positions all over the [2,2] dense block and is limited by memory bandwidth. The natural approach in updating from the [1,2] CSC block is more localized.
It may be time to revisit the use of SparseBLAS in MKL, in particular mkl_sparse_?_syrkd to see if it can provide a boost in speed.
Especially if we use the upper triangular Cholesky factor we may want to combine two operations in the update from the [1,1] block. Right now we do the rmulλ! operation down the first column of blocks then evaluate the diagonal blocks in [1,1] block then solve triangular systems. We can delay the rmulλ! on the blocks below [1,1], evaluate the diagonal blocks then combine the solution and multiplication into a single operation.
There are some interesting patterns in the shrinkage plots for some of the models with vector-valued random effects.
It is reasonably fast to evaluate the BLUPs but can be incredibly slow to evaluate the conditional standard deviations to get a caterpillar plot when there are large numbers of vector-valued random effects. Perhaps we should allow for a qqnorm plot of the BLUPs instead of a qqcaterpillar plot in these cases.

dmbates · 2021-12-10T14:35:47Z

dmbates
Dec 10, 2021
Maintainer Author

I neglected to mention the most emphatic lesson from the English Lexicon project - the PIRLS algorithm should be refined, especially the starting estimates for each PIRLS iteration. The verbose output shows that the initial steps are far away from the conditional estimates and sometimes lead to Inf as the objective value. The can cause failure of the overall optimization.

1 reply

palday Dec 10, 2021
Maintainer

I think refining PIRLS might also help with families with dispersion parameter. If I recall correctly, my various attempts to fix that have frequently run into problems with optimization failing in PIRLS.

dmbates · 2021-12-10T21:39:42Z

dmbates
Dec 10, 2021
Maintainer Author

So I've got good news and bad news. The good news is that it seems it is faster to do the rank-k update for the upper triangular blocked Cholesky factor than for the lower triangular factor. I created an interface to mkl_sparse_d_syrkd in my MKLsparseinspector.jl repository and tried both orientations. Doing the update from the tall-thin form of the sparse matrix, as in the upper triangular block [1,2], is faster than the short-wide form in the [2,1] block to update the [2,2] block.

The strange thing is that the MKL code is quite a bit slower than the not-highly-optimized code we already have in MixedModels.jl for doing the update. (We only have code for the lower triangular form at present.) In the benchmarks shown below b12 is the [1,2] block in a model with 795 subjects and 80776 items each with 2 random effects. A64 is the same matrix with 64-bit index vectors. I used alpha=1 and beta=0 so that successive benchmark evaluations did not inflate C, possibly causing overflow. (In practice the update uses alpha = -1 and beta = 1, so it is actually a downdate.)

julia> size(b12)
(1590, 161552)

julia> @benchmark MixedModels.rankUpdate!($C, $b12, 1.0, 0.0)
BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (min … max):  507.377 ms … 554.522 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     508.259 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   512.949 ms ±  14.625 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██                                                             
  ██▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  507 ms           Histogram: frequency by time          555 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark objective(updateL!(m4))
BenchmarkTools.Trial: 8 samples with 1 evaluation.
 Range (min … max):  683.161 ms … 687.051 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     685.542 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   685.528 ms ±   1.292 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                       █    █ █           █    █          ██  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁██ ▁
  683 ms           Histogram: frequency by time          687 ms <

 Memory estimate: 1.25 MiB, allocs estimate: 81599.

julia> @benchmark syrkd!($cc, $A64, 'T', 1.0, 0.0)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.498 s …  1.503 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.501 s             ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.501 s ± 2.060 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                 █                    █               █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.5 s         Histogram: frequency by time         1.5 s <

 Memory estimate: 24 bytes, allocs estimate: 2.

0 replies

dmbates · 2021-12-12T18:29:33Z

dmbates
Dec 12, 2021
Maintainer Author

So I have good news and bad news. The good news is that I won't be embarking on rewriting the code to use the upper Cholesky factor instead of the lower factor in the penalized least squares calculation. I had hoped that by doing so the downdate of the [2,2] block from the [1,2] block (upper triangular factor) would be faster than downdating from the [2,1] block (lower triangular factor). As SparseMatrixCSC objects, the [1,2] block has many columns whose outer products are evaluated to downdate the [2,2] block. This has the effect of jumping around the lower triangle of the [2,2] block at each assignment. The other orientation in the [1,2] block for the upper triangle evaluates inner products of long columns. I was able to use SparseArrays._spdot to do that but it turns out that this is considerably slower than updating from the [2,1] block. In the example benchmarked above the current code takes about 500 ms for the downdate. In my now, "improved" approach the downdate takes about 30 s.

It doesn't have to be that slow. The equivalent operation using mkl_sparse_d_syrkd is similar in speed to the MKL-based downdate of the lower triangular factor.

julia> @benchmark syrkd!($cc, $AT64, 'N', 1.0, 0.0)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.312 s …   1.410 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.384 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.372 s ± 46.373 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁                          ▁                            █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.31 s         Histogram: frequency by time        1.41 s <

 Memory estimate: 24 bytes, allocs estimate: 2.

but neither orientation using MKL is as fast as the current pure Julia code.

There may still be some speedups if the form of the BlockedSparse matrix is modified to use a three-dimensional array in the nzval property. That is, the sparse part of the array refers only to blocks of the same size in the nzval property.

0 replies

dmbates · 2021-12-13T16:22:40Z

dmbates
Dec 13, 2021
Maintainer Author

Another approach would be to pre-compute and cache the positions of the row intersections in the [1,2] block. The reason that the update of [2,2] from [1,2] is slow is because the code is determining for each pair of columns whether the nonzero positions overlap. But that information doesn't depend on the values of the parameters, it is simply a function of the pattern of the grouping factors. If we are doing thousands of function evaluations that information can be computed just once and stored then the actual accumulation of the inner products will, I think, be much faster.

I will try it out.

2 replies

dmbates Dec 13, 2021
Maintainer Author

Turns out I was wrong. I pre-computed the intersections, many of which were empty but still could not get a faster version with the inner products of fewer long columns than with the current (i.e. update the lower triangle) method of many outer products of short columns. My guess is that it is because of cache misses when taking the inner products of the long columns. The accumulator is local but the accesses to the nonzero values jump around a lot. When working with the [2,1] block for updates the outer product accesses to data are localized whereas writing to memory jumps around, but on a smaller, dense array.

palday Dec 13, 2021
Maintainer

Have you pushed this branch somewhere?

dmbates · 2021-12-13T21:07:29Z

dmbates
Dec 13, 2021
Maintainer Author

I have a couple of methods added in src/utilities.jl in the upperChol branch now. My benchmarks were custom code based on a model fit to the EnglishLexicon data. Everything is rather perfunctory and not well documented at present. Let me know if you want me to flesh things out a bit.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lessons from the EnglishLexicon.jl project #581

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Lessons from the EnglishLexicon.jl project #581

dmbates Dec 9, 2021 Maintainer

Replies: 5 comments · 3 replies

dmbates Dec 10, 2021 Maintainer Author

palday Dec 10, 2021 Maintainer

dmbates Dec 10, 2021 Maintainer Author

dmbates Dec 12, 2021 Maintainer Author

dmbates Dec 13, 2021 Maintainer Author

dmbates Dec 13, 2021 Maintainer Author

palday Dec 13, 2021 Maintainer

dmbates Dec 13, 2021 Maintainer Author

dmbates
Dec 9, 2021
Maintainer

Replies: 5 comments 3 replies

dmbates
Dec 10, 2021
Maintainer Author

palday Dec 10, 2021
Maintainer

dmbates
Dec 10, 2021
Maintainer Author

dmbates
Dec 12, 2021
Maintainer Author

dmbates
Dec 13, 2021
Maintainer Author

dmbates Dec 13, 2021
Maintainer Author

palday Dec 13, 2021
Maintainer

dmbates
Dec 13, 2021
Maintainer Author