-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-threaded parameter estimation #49
Comments
Very open to this, but I would prefer to see some scalability tests first. Could be something for a student to do. |
I was actually going to try myself because I'm struggling with the distributed assembly. |
I wrote a first version of this - not yet exported. My distributed assembly problems are immediately gone and I can now assemble larger datasets fully utilising all cores I have. I'll make a PR so you can take a look. There is a potential issue with how the slices of the design matrix are written into the design matrix using a lock. But something like that will always have to be done - distributed or not. This problem probably only arises when there are many very small structures. So then maybe one needs to write in batches... |
This makes me wonder if adjusting the |
I appreciate that, but the other thing to keep in mind is that for people like me who don't work with clusters but large shared memory nodes, the overhead of working with processes is pretty annoying. |
Is it starting up the processes that is annoying? Or just the throughput? If the throughput, I think we might be able to address that by increasing the batch size to reduce communication latency. I also work on single nodes most of the time. But by all means do submit the PR if important |
Starting up really is the issue - and for some reason Julia seems to want to precompile everything again on every process. Otherwise I wouldn't mind too much. |
I started to dig into this today. The first thing I wanted to test are drop-in threaded replacements for I'm running some scalability tests now and will report the results here. Happy to do the same for your solution, and whatever else we come up with. |
Make sure to test with datasets that have a lot of variation in structure size. I believe that's where we get the worst performance. A few huge structures and many unit cells. |
Here's what I could manage today. It's for the PRX Silicon dataset (2475 configs of somewhat varying size), with a 4,16 basis (528978, 1445 design matrix). The threaded implementation is @cortner's ... the others I tried were comparable but slower. Both plots are for the matrix assembly. The first plot includes Julia startup and compilation time (to look out for any extra distributed overhead). The second plot shows the time for a subsequent assembly (so, not affected by startup or compilation). The main takeaways are how similar they are and that both hit diminishing returns fairly quickly. The distributed does seem to have marginally higher startup cost, but then slightly outperforms the threaded when that aspect is removed. |
Do you want me to try one of your datasets? If the trend there is the same, then I suppose we need to look for system-dependent explanations? |
Thank you for looking into this. I'll have to ask to share the dataset I'm using. |
Because precompile does not work (for P4ML) Julia will compile everything for every process. From what I read for performance above and here it seems that performance is limited by memory bandwidth. Most likely by random writes. This means that there is no difference between multithreading and multiprocessing. In fact, if there is svd that takes a lot of time also. The option is to have multithreading with normal Lapack or multiprocessing with ScaLapack (this means also MPI). Basically having only multithreading is probably just easier. |
I didn't say this is always true and I would in fact be surprised if it is. I think this varies a lot between datasets. |
The SVD comes later. This discussion is only about assembling the design matrix. For most of our problems the SVD is much faster than the design matrix assembly. |
It was public after all. Would you mind trying this one? https://archive.materialscloud.org/record/2022.102 If the performance is the same again, then I'll rerun your tests on my machine. |
Sure - can you give me rough basis settings? Or even just the matrix size? |
We should add proper benchmarking using PkgBenchmark.jl. I can add basics for of it, if you give me an example input. After that you can add more tests on the file. |
|
That's the model that should work. The one that failed is make_model(TDEGS_4[3]; pure=true) Don't do this with level larger than 3 since generating the model then takes very very long. |
Use Folds.jl rather than |
Ah, thank you! I was wondering about that, actually. |
@cortner, I can confirm your distributed vs threading observation on the iron dataset, with one caveat, which is that I can't find the original data (from the earlier paper - whose links to the data appear broken), so I am only fitting to the "extensions" in the paper you linked (2306 configs, matrix size 36725,2773). Here are plots analogous to the above. I can't really explain them at this point. |
I'm open to this, but don't have time to clean things up at the moment |
that's much much smaller than the full dataset I'm working with, but perfectly fine if it repreduces the problem :) |
I spent most of last night and today (without much success) trying to improve the parallel scaling of assembly, looking at both multiprocess and multithreaded. In particular, I set a goal to reduce assembly time for the silicon dataset (same model parameters as in the plots above) to below one minute using 50 threads and/or processes. I tried a lot of things, but I never got below 100 seconds. A few observations:
All together, I think I will put add a batch_size option to the main assembly routine, but I'm running out of ideas for dramatic improvement. Edit: I should also try with a newer Julia version that has multi-threaded GC. |
That means that the reason for the bad performance is in Do those energy, forces and virial calls go back to JuLIP? |
They do - and in the past we ensured that the JuLIP calculators switch to single-threaded mode. But maybe we need to double-check this here. |
Also with JuLIP retirement incoming, we need to reimplement them to at some point. So, we could consider doing it now. |
That's a very risky and huge change. We cannot do this before publishing. The problem is that JuLIP.Atoms deeply permeates ACE1.jl, ACE1x.jl and ACE1pack.jl. Off-topic : ... but please do come up with a plan how this transition could be managed. |
Agree - at one point I confirmed it was basically all |
Distributed has a lot of overhead. I think we should return to providing also multi-threaded assembly of linear systems
The text was updated successfully, but these errors were encountered: