Multi-threaded parameter estimation #49

cortner · 2023-06-12T16:40:41Z

Distributed has a lot of overhead. I think we should return to providing also multi-threaded assembly of linear systems

wcwitt · 2023-06-12T16:57:28Z

Very open to this, but I would prefer to see some scalability tests first. Could be something for a student to do.

cortner · 2023-06-12T17:00:07Z

I was actually going to try myself because I'm struggling with the distributed assembly.

cortner · 2023-06-20T21:09:32Z

I wrote a first version of this - not yet exported. My distributed assembly problems are immediately gone and I can now assemble larger datasets fully utilising all cores I have. I'll make a PR so you can take a look.

There is a potential issue with how the slices of the design matrix are written into the design matrix using a lock. But something like that will always have to be done - distributed or not. This problem probably only arises when there are many very small structures. So then maybe one needs to write in batches...

wcwitt · 2023-06-20T21:21:26Z

This problem probably only arises when there are many very small structures

This makes me wonder if adjusting the pmap batch size argument might also fix your problem. I'm happy to merge a PR if it improves things, although if a one-size-fits-all distributed option can work I slightly prefer that

cortner · 2023-06-20T21:23:13Z

I appreciate that, but the other thing to keep in mind is that for people like me who don't work with clusters but large shared memory nodes, the overhead of working with processes is pretty annoying.

wcwitt · 2023-06-20T21:37:21Z

the overhead of working with processes is pretty annoying

Is it starting up the processes that is annoying? Or just the throughput? If the throughput, I think we might be able to address that by increasing the batch size to reduce communication latency. I also work on single nodes most of the time.

But by all means do submit the PR if important

cortner · 2023-06-20T23:05:33Z

Starting up really is the issue - and for some reason Julia seems to want to precompile everything again on every process. Otherwise I wouldn't mind too much.

wcwitt · 2023-06-26T13:30:51Z

I started to dig into this today. The first thing I wanted to test are drop-in threaded replacements for map from ThreadedIterables.jl and ThreadsX.jl. Some experiments in that direction are here and here.

I'm running some scalability tests now and will report the results here. Happy to do the same for your solution, and whatever else we come up with.

cortner · 2023-06-26T13:55:44Z

Make sure to test with datasets that have a lot of variation in structure size. I believe that's where we get the worst performance. A few huge structures and many unit cells.

wcwitt · 2023-06-26T17:09:59Z

Here's what I could manage today. It's for the PRX Silicon dataset (2475 configs of somewhat varying size), with a 4,16 basis (528978, 1445 design matrix). The threaded implementation is @cortner's ... the others I tried were comparable but slower.

Both plots are for the matrix assembly. The first plot includes Julia startup and compilation time (to look out for any extra distributed overhead). The second plot shows the time for a subsequent assembly (so, not affected by startup or compilation).

The main takeaways are how similar they are and that both hit diminishing returns fairly quickly. The distributed does seem to have marginally higher startup cost, but then slightly outperforms the threaded when that aspect is removed.

wcwitt · 2023-06-26T17:12:08Z

Do you want me to try one of your datasets? If the trend there is the same, then I suppose we need to look for system-dependent explanations?

cortner · 2023-06-26T17:15:52Z

Thank you for looking into this. I'll have to ask to share the dataset I'm using.

tjjarvinen · 2023-06-26T19:53:37Z

Starting up really is the issue - and for some reason Julia seems to want to precompile everything again on every process. Otherwise I wouldn't mind too much.

Because precompile does not work (for P4ML) Julia will compile everything for every process.

From what I read for performance above and here it seems that performance is limited by memory bandwidth. Most likely by random writes. This means that there is no difference between multithreading and multiprocessing.

In fact, if there is svd that takes a lot of time also. The option is to have multithreading with normal Lapack or multiprocessing with ScaLapack (this means also MPI). Basically having only multithreading is probably just easier.

cortner · 2023-06-26T22:11:13Z

seems that performance is limited by memory bandwidth

I didn't say this is always true and I would in fact be surprised if it is. I think this varies a lot between datasets.

cortner · 2023-06-26T22:14:56Z

In fact, if there is svd that takes a lot of time also

The SVD comes later. This discussion is only about assembling the design matrix. For most of our problems the SVD is much faster than the design matrix assembly.

cortner · 2023-06-27T16:13:44Z

Do you want me to try one of your datasets?

It was public after all. Would you mind trying this one?

https://archive.materialscloud.org/record/2022.102

If the performance is the same again, then I'll rerun your tests on my machine.

wcwitt · 2023-06-27T16:15:49Z

Sure - can you give me rough basis settings? Or even just the matrix size?

tjjarvinen · 2023-06-27T16:21:01Z

We should add proper benchmarking using PkgBenchmark.jl. I can add basics for of it, if you give me an example input. After that you can add more tests on the file.

cortner · 2023-06-27T22:27:53Z

TDEGS_4 = [ 
      [ 18, 14, 10, 6 ],  # 183
      [ 20, 16, 12, 8 ],  # 314 
      [ 22, 18, 14, 10],  # 539 
      [ 24, 20, 16, 12 ], # 929
      [ 26, 22, 18, 14 ], # 1609 
      [ 28, 24, 20, 16 ], # 2773 
      [ 30, 26, 22, 18 ], # 4789
   ]

make_model(tdeg; 
         transform = (:agnesi, 2, 4),
         pair_transform = (:agnesi, 1, 4),
         pure = false, ) = 
   ACE1x.acemodel(; 
         elements = [:Fe], 
         order = length(tdeg), 
         totaldegree = tdeg,
         transform = transform, 
         pair_transform = pair_transform,
         pure = pure, 
         rcut = r_cut,
         Eref = Dict(:Fe => -3455.6995339),  )

make_model(TDEGS_4[6])

cortner · 2023-06-27T22:28:56Z

That's the model that should work.

The one that failed is

make_model(TDEGS_4[3]; pure=true)

Don't do this with level larger than 3 since generating the model then takes very very long.

tjjarvinen · 2023-06-28T10:12:33Z

I started to dig into this today. The first thing I wanted to test are drop-in threaded replacements for map from ThreadedIterables.jl and ThreadsX.jl.

Use Folds.jl rather than ThreadsX.jl. Both use Transdusers.jl on the background, but Folds.jl is newer and can use distributed excecution also.

wcwitt · 2023-06-28T11:08:42Z

Ah, thank you! I was wondering about that, actually.

wcwitt · 2023-06-28T13:51:15Z

@cortner, I can confirm your distributed vs threading observation on the iron dataset, with one caveat, which is that I can't find the original data (from the earlier paper - whose links to the data appear broken), so I am only fitting to the "extensions" in the paper you linked (2306 configs, matrix size 36725,2773). Here are plots analogous to the above. I can't really explain them at this point.

wcwitt · 2023-06-28T13:53:36Z

We should add proper benchmarking using PkgBenchmark.jl.

I'm open to this, but don't have time to clean things up at the moment

cortner · 2023-06-28T14:06:37Z

that's much much smaller than the full dataset I'm working with, but perfectly fine if it repreduces the problem :)

wcwitt · 2023-08-10T16:07:46Z

I spent most of last night and today (without much success) trying to improve the parallel scaling of assembly, looking at both multiprocess and multithreaded. In particular, I set a goal to reduce assembly time for the silicon dataset (same model parameters as in the plots above) to below one minute using 50 threads and/or processes. I tried a lot of things, but I never got below 100 seconds.

A few observations:

At first, I thought the bottleneck might be that we have all processors/threads write to a single matrix, but that's wrong. If I simply compute the matrix elements and throw them away immediately - eliminating any synchronization concerns - the timings don't improve.
I expected threaded evaluation of the assembly loop over data (without writing/saving any of the results) to be extremely fast and parallelizable - but not so. I believe the explanation is related to garbage collection, as I saw high GC times; see also www.github.com/JuliaLang/julia/issues/32304.
I learned about CachingPools for pmap, which didn't solve my problem but may be useful in the future (www.github.com/JuliaLang/julia/pull/33892).
I did see some improvement from batching, although it wasn't dramatic.

All together, I think I will put add a batch_size option to the main assembly routine, but I'm running out of ideas for dramatic improvement.

Edit: I should also try with a newer Julia version that has multi-threaded GC.

tjjarvinen · 2023-08-10T21:13:40Z

That means that the reason for the bad performance is in feature_matrix, target_vector and/or weight_vector execution. They do calculate energy, forces and virials? So, it might be that those are the cause the issues.

Do those energy, forces and virial calls go back to JuLIP?

cortner · 2023-08-11T04:36:40Z

They do - and in the past we ensured that the JuLIP calculators switch to single-threaded mode. But maybe we need to double-check this here.

tjjarvinen · 2023-08-11T04:42:08Z

Also with JuLIP retirement incoming, we need to reimplement them to at some point. So, we could consider doing it now.

cortner · 2023-08-11T04:49:52Z

That's a very risky and huge change. We cannot do this before publishing. The problem is that JuLIP.Atoms deeply permeates ACE1.jl, ACE1x.jl and ACE1pack.jl.

Off-topic : ... but please do come up with a plan how this transition could be managed.

wcwitt · 2023-08-14T21:23:58Z

That means that the reason for the bad performance is in feature_matrix, target_vector and/or weight_vector execution.

Agree - at one point I confirmed it was basically all feature_matrix. If I were going to push this further (not likely in the short term), I would probably try to reduce the allocations there

cortner mentioned this issue Jun 26, 2023

Alternative Assembly Routine #56

Open

wcwitt mentioned this issue Jul 22, 2023

WIP: Multi-threaded assembly #54

Closed

wcwitt mentioned this issue Sep 19, 2023

Acefit extension ACEsuit/ACEmd.jl#46

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threaded parameter estimation #49

Multi-threaded parameter estimation #49

cortner commented Jun 12, 2023

wcwitt commented Jun 12, 2023

cortner commented Jun 12, 2023

cortner commented Jun 20, 2023 •

edited

Loading

wcwitt commented Jun 20, 2023

cortner commented Jun 20, 2023

wcwitt commented Jun 20, 2023

cortner commented Jun 20, 2023

wcwitt commented Jun 26, 2023

cortner commented Jun 26, 2023

wcwitt commented Jun 26, 2023

wcwitt commented Jun 26, 2023

cortner commented Jun 26, 2023

tjjarvinen commented Jun 26, 2023

cortner commented Jun 26, 2023

cortner commented Jun 26, 2023 •

edited

Loading

cortner commented Jun 27, 2023 •

edited

Loading

wcwitt commented Jun 27, 2023

tjjarvinen commented Jun 27, 2023

cortner commented Jun 27, 2023

cortner commented Jun 27, 2023

tjjarvinen commented Jun 28, 2023

wcwitt commented Jun 28, 2023

wcwitt commented Jun 28, 2023

wcwitt commented Jun 28, 2023

cortner commented Jun 28, 2023

wcwitt commented Aug 10, 2023 •

edited

Loading

tjjarvinen commented Aug 10, 2023

cortner commented Aug 11, 2023

tjjarvinen commented Aug 11, 2023

cortner commented Aug 11, 2023 •

edited

Loading

wcwitt commented Aug 14, 2023

Multi-threaded parameter estimation #49

Multi-threaded parameter estimation #49

Comments

cortner commented Jun 12, 2023

wcwitt commented Jun 12, 2023

cortner commented Jun 12, 2023

cortner commented Jun 20, 2023 • edited Loading

wcwitt commented Jun 20, 2023

cortner commented Jun 20, 2023

wcwitt commented Jun 20, 2023

cortner commented Jun 20, 2023

wcwitt commented Jun 26, 2023

cortner commented Jun 26, 2023

wcwitt commented Jun 26, 2023

wcwitt commented Jun 26, 2023

cortner commented Jun 26, 2023

tjjarvinen commented Jun 26, 2023

cortner commented Jun 26, 2023

cortner commented Jun 26, 2023 • edited Loading

cortner commented Jun 27, 2023 • edited Loading

wcwitt commented Jun 27, 2023

tjjarvinen commented Jun 27, 2023

cortner commented Jun 27, 2023

cortner commented Jun 27, 2023

tjjarvinen commented Jun 28, 2023

wcwitt commented Jun 28, 2023

wcwitt commented Jun 28, 2023

wcwitt commented Jun 28, 2023

cortner commented Jun 28, 2023

wcwitt commented Aug 10, 2023 • edited Loading

tjjarvinen commented Aug 10, 2023

cortner commented Aug 11, 2023

tjjarvinen commented Aug 11, 2023

cortner commented Aug 11, 2023 • edited Loading

wcwitt commented Aug 14, 2023

cortner commented Jun 20, 2023 •

edited

Loading

cortner commented Jun 26, 2023 •

edited

Loading

cortner commented Jun 27, 2023 •

edited

Loading

wcwitt commented Aug 10, 2023 •

edited

Loading

cortner commented Aug 11, 2023 •

edited

Loading