forked from elemental/Elemental
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SendRecv bugs, add support for annotations #2
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Add support for roctx annotations * Add roctracer to the CMake export * Protect imported library double-creation * Don't throw exceptions anymore * Fix typo * Add support for Flux when identifying the local device
* Add debugging annotations to the MemoryPool. Control with H_MEMPOOL_DEBUG environment variable. * Add env variable control for other mempool parameters Maybe we want to further separate these based on pinned/not-pinned, but this seems fine for the current use-cases.
* Fix a bug in in-place sendrecv operation This provides a correct backward-compatible implementation of in-place SendRecv that we can use with old versions of Aluminum. When a new Aluminum is released with in-place SendRecv support, this will directly dispatch to that automatically, with no update necessary. Closes #146. * Correct in-place Al call; test for the call directly instead of version
* Reduce the number of event recordings when unnecessary to call * Apply suggestions from code review Co-authored-by: Tom Benson <[email protected]> * Formatting --------- Co-authored-by: Tom Benson <[email protected]>
It's mildly unfortunate that the function for recording the event on a SyncInfo is called "AddSynchronizationPoint". Thus we need an explicit `if constexpr` to stop the recursion such that the event record is actually avoided when we don't need it. C'est la vie.
* Update the event creation flags under hip * Support ROCm version less than 5.6.0
* Modified HalfPrecision.hpp to use fp16 on systems with GPUs * fix some CMake logic * initial changes to fix __half->double issues * Everything compiles * Update the config * Fix for LBANN compilation * Revert whitespace/indentation changes * Some additional cleanup; fp16 dot,nrm2,scal impls * Fix a few other issues --------- Co-authored-by: Tom Benson <[email protected]>
This only helps if El::Finalize is actually called.
Development version now reports as 1.5.4
* Port CUB binned memory allocator to Hydrogen * Add tests * Implement linear and custom binned allocation * Modify human-readable output * Implement mallocAsync backend * Improve reporting and configurability * Documentation update * Use streams when freeing * Revert "Use streams when freeing" This reverts commit 5fe11e5. * Remove need for active stream in free * Apply suggestions from code review Co-authored-by: Tom Benson <[email protected]> --------- Co-authored-by: Tom Benson <[email protected]>
…178) * Add Aluminum dispatch for send and recv within TranslateBetweenGrids This commit only addresses the synchronous calls in the (STAR,VC) variant. I'm using this as a bit of testbed before wholesale adding these into regular `El::mpi` dispatch as the point-to-point use-cases are a bit more delicate. * Appease linguistic nits being picked offline * Fix formatting * Squash build warnings * Aluminum-ize El::mpi::Send and El::mpi::Recv; Aluminum now required Note that the nonblocking Send/Recv are still not aluminum-ized. (They also don't check host/device-ness of buffers, so there's that...) It would be a bigger lift to incorporate Aluminum with the Request struct, so I'm disinclined to do that until there's programmatic need. * Revert to now-Aluminum-ized El::mpi::Send and El::mpi::Recv; cleanup syncs
If this option is not set, the communication will land on the Hydrogen-default stream. This _could_ be a separate stream from the data compute stream, but in the LBANN use case, it generally will be the compute stream.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Clean up HIP IMPORTED targets in CMake (https://github.com/elemental/Elemental/pull/137[)](https://github.com/a0x8o/elemental/commit/4abe4ef0eacb1e86dc66d374da8a4b193cd8aa7d)
Add support for roctx annotations (https://github.com/elemental/Elemental/pull/138[)](https://github.com/a0x8o/elemental/commit/ebfbc6418f29134d9696eb763472d964b6e9aaf2)
Add debugging annotations to the MemoryPool (https://github.com/elemental/Elemental/issues/141[)](https://github.com/a0x8o/elemental/commit/570b4fc47011cb7814677a88c2f1415a25ebdd6e)
Reenable the Al::NCCLBackend dispatch for SendRecv (https://github.com/elemental/Elemental/issues/144[)](https://github.com/a0x8o/elemental/commit/385877ec9d4afa67e20d5088502d995d1616f019)
Revert "Reenable the Al::NCCLBackend dispatch for SendRecv (elemental#144
Fix a bug in in-place sendrecv operation (https://github.com/elemental/Elemental/issues/147[)](https://github.com/a0x8o/elemental/commit/007d37e4ebcff31ba1fe3b7a95e30cfb2476f8bd)
Fix a small logic bug in El::copy::Translate (https://github.com/elemental/Elemental/pull/148[)](https://github.com/a0x8o/elemental/commit/08f3f6a32a9f86a08ff929455f885537c97e1332)
Use version info rather than try_compile to determine Aluminum featur…
Add metadata support to TranslateBetweenGrid for Star VC (https://github.com/elemental/Elemental/pull/151[)](https://github.com/a0x8o/elemental/commit/2f7f309e76118ec9341ac24e67b9b5f814344054)
Update ROCm includes (https://github.com/elemental/Elemental/pull/152[)](https://github.com/a0x8o/elemental/commit/6ff82fed5a3030037f262b7796225ae5f5f03cd1)
Reduce the number of event record calls (https://github.com/elemental/Elemental/issues/153[)](https://github.com/a0x8o/elemental/commit/315bc6461d45690ff3dc962bb66b6bcd18a1be9a)
Fix naming bug (https://github.com/elemental/Elemental/pull/154[)](https://github.com/a0x8o/elemental/commit/12f0180e2e773898ef27ada83b78707f8afd2a7b)
Fix a subtle parameter expansion issue in sync info (https://github.com/elemental/Elemental/pull/155[)](https://github.com/a0x8o/elemental/commit/a0d3bbdd239ead42ef54b1ee57751aeb20933e54)
Skip DifferentGridsGeneral tests. (https://github.com/elemental/Elemental/pull/158[)](https://github.com/a0x8o/elemental/commit/904f6290eb74eefa72f49dfad5401b7745bf7dc2)
Remove warnings about function deprecation (https://github.com/elemental/Elemental/pull/160[)](https://github.com/a0x8o/elemental/commit/98db86b13d0a8e5dc381a97c1cfbbddfa0576bf4)
Update the event creation flags under HIP (https://github.com/elemental/Elemental/pull/161[)](https://github.com/a0x8o/elemental/commit/6537d24fc7557aba3ad7ee88a3fd60be63392cea)
Fix half includes for using the one that ships with ROCm (https://github.com/elemental/Elemental/pull/163[)](https://github.com/a0x8o/elemental/commit/40bfa0d4ce0b8e0861b9afca31da9acf14508247)
Enable FP16 on ROCm systems (https://github.com/elemental/Elemental/issues/109[)](https://github.com/a0x8o/elemental/commit/2aa6443c84b0aa9dcf686d2236036a7968295ab5)
Quick patch for build error without half (https://github.com/elemental/Elemental/issues/164[)](https://github.com/a0x8o/elemental/commit/edfdb2957b7c98698b1b544947b0ec5b4991834c)
add function decls for fp16 GPU blas functions (https://github.com/elemental/Elemental/pull/165[)](https://github.com/a0x8o/elemental/commit/485ba43369419ecb418705473c1796c7cf239a12)
Fix prototype of Nrm2 on CUBLAS (https://github.com/elemental/Elemental/issues/166[)](https://github.com/a0x8o/elemental/commit/e19ad6deb8b879043ad8d7113be588d7b634ec66)
Force destruction of "special" comms before shutdown (https://github.com/elemental/Elemental/pull/167[)](https://github.com/a0x8o/elemental/commit/d8539d2eca3ca7161d45824bd6d4861b4920253b)
Manage host returns from rocBLAS manually (https://github.com/elemental/Elemental/issues/168[)](https://github.com/a0x8o/elemental/commit/bc4ea53067ea9c5fd544a5170e05cb2e977257c0)
Match cuBLAS sync behavior (https://github.com/elemental/Elemental/pull/169[)](https://github.com/a0x8o/elemental/commit/d883ac3d9367aab480b1c8763fa8cf6489198c1d)
Update CMakeLists.txt
Extended GPU memory pool (https://github.com/elemental/Elemental/issues/172[)](https://github.com/a0x8o/elemental/commit/587ffa4f61d0277b4795432aaa2680b36b9d0806)
Fix ROCm compile issues (https://github.com/elemental/Elemental/issues/174[)](https://github.com/a0x8o/elemental/commit/2a6b657c60fd5d05b379319081b34e06b49b71e0)
Don't declare operator overloads for half with CUDA >= 12.2. (elemental#175
Update ElementalREADME.md
[intel MPI] handle quotes in MPI configuration (https://github.com/elemental/Elemental/issues/177[)](https://github.com/a0x8o/elemental/commit/f853bf7e5c352b83b175b5e5486f525d7f720e72)
Add Aluminum dispatch for send and recv within TranslateBetweenGrids (elemental#178
Restore the matrix shape allreduce in TranslateBetweenGrids (elemental#181
Add an EnsureComm call to make sure things are sane (https://github.com/elemental/Elemental/issues/182[)](https://github.com/a0x8o/elemental/commit/e6aea8dec9ab091b99002fdcb8c335b86da3bfef)
Make the separate communication stream optional (https://github.com/elemental/Elemental/pull/184[)](https://github.com/a0x8o/elemental/commit/039d681cc7e12cf2fdd8e5643f8ecc1beb3a25c7)
Add macro protection to GPU code block (https://github.com/elemental/Elemental/issues/185[)](https://github.com/a0x8o/elemental/commit/a5db028869809b64932383f2c2665148ad2db5be)
Fix uncaught_exception issue (https://github.com/elemental/Elemental/pull/186[)](https://github.com/a0x8o/elemental/commit/a7be48ca8c036489313fe3785e671f9e97975f5d)