Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threaded parameter estimation #49

Open
cortner opened this issue Jun 12, 2023 · 31 comments
Open

Multi-threaded parameter estimation #49

cortner opened this issue Jun 12, 2023 · 31 comments

Comments

@cortner
Copy link
Member

cortner commented Jun 12, 2023

Distributed has a lot of overhead. I think we should return to providing also multi-threaded assembly of linear systems

@wcwitt
Copy link
Collaborator

wcwitt commented Jun 12, 2023

Very open to this, but I would prefer to see some scalability tests first. Could be something for a student to do.

@cortner
Copy link
Member Author

cortner commented Jun 12, 2023

I was actually going to try myself because I'm struggling with the distributed assembly.

@cortner
Copy link
Member Author

cortner commented Jun 20, 2023

I wrote a first version of this - not yet exported. My distributed assembly problems are immediately gone and I can now assemble larger datasets fully utilising all cores I have. I'll make a PR so you can take a look.

There is a potential issue with how the slices of the design matrix are written into the design matrix using a lock. But something like that will always have to be done - distributed or not. This problem probably only arises when there are many very small structures. So then maybe one needs to write in batches...

@wcwitt
Copy link
Collaborator

wcwitt commented Jun 20, 2023

This problem probably only arises when there are many very small structures

This makes me wonder if adjusting the pmap batch size argument might also fix your problem. I'm happy to merge a PR if it improves things, although if a one-size-fits-all distributed option can work I slightly prefer that

@cortner
Copy link
Member Author

cortner commented Jun 20, 2023

I appreciate that, but the other thing to keep in mind is that for people like me who don't work with clusters but large shared memory nodes, the overhead of working with processes is pretty annoying.

@wcwitt
Copy link
Collaborator

wcwitt commented Jun 20, 2023

the overhead of working with processes is pretty annoying

Is it starting up the processes that is annoying? Or just the throughput? If the throughput, I think we might be able to address that by increasing the batch size to reduce communication latency. I also work on single nodes most of the time.

But by all means do submit the PR if important

@cortner
Copy link
Member Author

cortner commented Jun 20, 2023

Starting up really is the issue - and for some reason Julia seems to want to precompile everything again on every process. Otherwise I wouldn't mind too much.

@wcwitt
Copy link
Collaborator

wcwitt commented Jun 26, 2023

I started to dig into this today. The first thing I wanted to test are drop-in threaded replacements for map from ThreadedIterables.jl and ThreadsX.jl. Some experiments in that direction are here and here.

I'm running some scalability tests now and will report the results here. Happy to do the same for your solution, and whatever else we come up with.

@cortner
Copy link
Member Author

cortner commented Jun 26, 2023

Make sure to test with datasets that have a lot of variation in structure size. I believe that's where we get the worst performance. A few huge structures and many unit cells.

@wcwitt
Copy link
Collaborator

wcwitt commented Jun 26, 2023

Here's what I could manage today. It's for the PRX Silicon dataset (2475 configs of somewhat varying size), with a 4,16 basis (528978, 1445 design matrix). The threaded implementation is @cortner's ... the others I tried were comparable but slower.

Both plots are for the matrix assembly. The first plot includes Julia startup and compilation time (to look out for any extra distributed overhead). The second plot shows the time for a subsequent assembly (so, not affected by startup or compilation).

The main takeaways are how similar they are and that both hit diminishing returns fairly quickly. The distributed does seem to have marginally higher startup cost, but then slightly outperforms the threaded when that aspect is removed.

image

image

@wcwitt
Copy link
Collaborator

wcwitt commented Jun 26, 2023

Do you want me to try one of your datasets? If the trend there is the same, then I suppose we need to look for system-dependent explanations?

@cortner
Copy link
Member Author

cortner commented Jun 26, 2023

Thank you for looking into this. I'll have to ask to share the dataset I'm using.

@tjjarvinen
Copy link
Collaborator

Starting up really is the issue - and for some reason Julia seems to want to precompile everything again on every process. Otherwise I wouldn't mind too much.

Because precompile does not work (for P4ML) Julia will compile everything for every process.

From what I read for performance above and here it seems that performance is limited by memory bandwidth. Most likely by random writes. This means that there is no difference between multithreading and multiprocessing.

In fact, if there is svd that takes a lot of time also. The option is to have multithreading with normal Lapack or multiprocessing with ScaLapack (this means also MPI). Basically having only multithreading is probably just easier.

@cortner
Copy link
Member Author

cortner commented Jun 26, 2023

seems that performance is limited by memory bandwidth

I didn't say this is always true and I would in fact be surprised if it is. I think this varies a lot between datasets.

@cortner
Copy link
Member Author

cortner commented Jun 26, 2023

In fact, if there is svd that takes a lot of time also

The SVD comes later. This discussion is only about assembling the design matrix. For most of our problems the SVD is much faster than the design matrix assembly.

@cortner
Copy link
Member Author

cortner commented Jun 27, 2023

Do you want me to try one of your datasets?

It was public after all. Would you mind trying this one?

https://archive.materialscloud.org/record/2022.102

If the performance is the same again, then I'll rerun your tests on my machine.

@wcwitt
Copy link
Collaborator

wcwitt commented Jun 27, 2023

Sure - can you give me rough basis settings? Or even just the matrix size?

@tjjarvinen
Copy link
Collaborator

We should add proper benchmarking using PkgBenchmark.jl. I can add basics for of it, if you give me an example input. After that you can add more tests on the file.

@cortner
Copy link
Member Author

cortner commented Jun 27, 2023

TDEGS_4 = [ 
      [ 18, 14, 10, 6 ],  # 183
      [ 20, 16, 12, 8 ],  # 314 
      [ 22, 18, 14, 10],  # 539 
      [ 24, 20, 16, 12 ], # 929
      [ 26, 22, 18, 14 ], # 1609 
      [ 28, 24, 20, 16 ], # 2773 
      [ 30, 26, 22, 18 ], # 4789
   ]

make_model(tdeg; 
         transform = (:agnesi, 2, 4),
         pair_transform = (:agnesi, 1, 4),
         pure = false, ) = 
   ACE1x.acemodel(; 
         elements = [:Fe], 
         order = length(tdeg), 
         totaldegree = tdeg,
         transform = transform, 
         pair_transform = pair_transform,
         pure = pure, 
         rcut = r_cut,
         Eref = Dict(:Fe => -3455.6995339),  )

make_model(TDEGS_4[6])

@cortner
Copy link
Member Author

cortner commented Jun 27, 2023

That's the model that should work.

The one that failed is

make_model(TDEGS_4[3]; pure=true)

Don't do this with level larger than 3 since generating the model then takes very very long.

@tjjarvinen
Copy link
Collaborator

I started to dig into this today. The first thing I wanted to test are drop-in threaded replacements for map from ThreadedIterables.jl and ThreadsX.jl.

Use Folds.jl rather than ThreadsX.jl. Both use Transdusers.jl on the background, but Folds.jl is newer and can use distributed excecution also.

@wcwitt
Copy link
Collaborator

wcwitt commented Jun 28, 2023

Ah, thank you! I was wondering about that, actually.

@wcwitt
Copy link
Collaborator

wcwitt commented Jun 28, 2023

@cortner, I can confirm your distributed vs threading observation on the iron dataset, with one caveat, which is that I can't find the original data (from the earlier paper - whose links to the data appear broken), so I am only fitting to the "extensions" in the paper you linked (2306 configs, matrix size 36725,2773). Here are plots analogous to the above. I can't really explain them at this point.

image

image

@wcwitt
Copy link
Collaborator

wcwitt commented Jun 28, 2023

We should add proper benchmarking using PkgBenchmark.jl.

I'm open to this, but don't have time to clean things up at the moment

@cortner
Copy link
Member Author

cortner commented Jun 28, 2023

that's much much smaller than the full dataset I'm working with, but perfectly fine if it repreduces the problem :)

@wcwitt
Copy link
Collaborator

wcwitt commented Aug 10, 2023

I spent most of last night and today (without much success) trying to improve the parallel scaling of assembly, looking at both multiprocess and multithreaded. In particular, I set a goal to reduce assembly time for the silicon dataset (same model parameters as in the plots above) to below one minute using 50 threads and/or processes. I tried a lot of things, but I never got below 100 seconds.

A few observations:

  • At first, I thought the bottleneck might be that we have all processors/threads write to a single matrix, but that's wrong. If I simply compute the matrix elements and throw them away immediately - eliminating any synchronization concerns - the timings don't improve.

  • I expected threaded evaluation of the assembly loop over data (without writing/saving any of the results) to be extremely fast and parallelizable - but not so. I believe the explanation is related to garbage collection, as I saw high GC times; see also www.github.com/JuliaLang/julia/issues/32304.

  • I learned about CachingPools for pmap, which didn't solve my problem but may be useful in the future (www.github.com/JuliaLang/julia/pull/33892).

  • I did see some improvement from batching, although it wasn't dramatic.

All together, I think I will put add a batch_size option to the main assembly routine, but I'm running out of ideas for dramatic improvement.

Edit: I should also try with a newer Julia version that has multi-threaded GC.

@tjjarvinen
Copy link
Collaborator

That means that the reason for the bad performance is in feature_matrix, target_vector and/or weight_vector execution. They do calculate energy, forces and virials? So, it might be that those are the cause the issues.

Do those energy, forces and virial calls go back to JuLIP?

@cortner
Copy link
Member Author

cortner commented Aug 11, 2023

They do - and in the past we ensured that the JuLIP calculators switch to single-threaded mode. But maybe we need to double-check this here.

@tjjarvinen
Copy link
Collaborator

Also with JuLIP retirement incoming, we need to reimplement them to at some point. So, we could consider doing it now.

@cortner
Copy link
Member Author

cortner commented Aug 11, 2023

That's a very risky and huge change. We cannot do this before publishing. The problem is that JuLIP.Atoms deeply permeates ACE1.jl, ACE1x.jl and ACE1pack.jl.

Off-topic : ... but please do come up with a plan how this transition could be managed.

@wcwitt
Copy link
Collaborator

wcwitt commented Aug 14, 2023

That means that the reason for the bad performance is in feature_matrix, target_vector and/or weight_vector execution.

Agree - at one point I confirmed it was basically all feature_matrix. If I were going to push this further (not likely in the short term), I would probably try to reduce the allocations there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants