Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SendRecv bugs, add support for annotations #2

Merged
merged 36 commits into from
Nov 26, 2024
Merged

Conversation

a0x8o
Copy link
Owner

@a0x8o a0x8o commented Nov 11, 2024

Clean up HIP IMPORTED targets in CMake (https://github.com/elemental/Elemental/pull/137[)](https://github.com/a0x8o/elemental/commit/4abe4ef0eacb1e86dc66d374da8a4b193cd8aa7d)

Add support for roctx annotations (https://github.com/elemental/Elemental/pull/138[)](https://github.com/a0x8o/elemental/commit/ebfbc6418f29134d9696eb763472d964b6e9aaf2)

Add debugging annotations to the MemoryPool (https://github.com/elemental/Elemental/issues/141[)](https://github.com/a0x8o/elemental/commit/570b4fc47011cb7814677a88c2f1415a25ebdd6e)

Reenable the Al::NCCLBackend dispatch for SendRecv (https://github.com/elemental/Elemental/issues/144[)](https://github.com/a0x8o/elemental/commit/385877ec9d4afa67e20d5088502d995d1616f019)

Revert "Reenable the Al::NCCLBackend dispatch for SendRecv (elemental#144

Fix a bug in in-place sendrecv operation (https://github.com/elemental/Elemental/issues/147[)](https://github.com/a0x8o/elemental/commit/007d37e4ebcff31ba1fe3b7a95e30cfb2476f8bd)

Fix a small logic bug in El::copy::Translate (https://github.com/elemental/Elemental/pull/148[)](https://github.com/a0x8o/elemental/commit/08f3f6a32a9f86a08ff929455f885537c97e1332)

Use version info rather than try_compile to determine Aluminum featur…

Add metadata support to TranslateBetweenGrid for Star VC (https://github.com/elemental/Elemental/pull/151[)](https://github.com/a0x8o/elemental/commit/2f7f309e76118ec9341ac24e67b9b5f814344054)

Update ROCm includes (https://github.com/elemental/Elemental/pull/152[)](https://github.com/a0x8o/elemental/commit/6ff82fed5a3030037f262b7796225ae5f5f03cd1)

Reduce the number of event record calls (https://github.com/elemental/Elemental/issues/153[)](https://github.com/a0x8o/elemental/commit/315bc6461d45690ff3dc962bb66b6bcd18a1be9a)

Fix naming bug (https://github.com/elemental/Elemental/pull/154[)](https://github.com/a0x8o/elemental/commit/12f0180e2e773898ef27ada83b78707f8afd2a7b)

Fix a subtle parameter expansion issue in sync info (https://github.com/elemental/Elemental/pull/155[)](https://github.com/a0x8o/elemental/commit/a0d3bbdd239ead42ef54b1ee57751aeb20933e54)

Skip DifferentGridsGeneral tests. (https://github.com/elemental/Elemental/pull/158[)](https://github.com/a0x8o/elemental/commit/904f6290eb74eefa72f49dfad5401b7745bf7dc2)

Remove warnings about function deprecation (https://github.com/elemental/Elemental/pull/160[)](https://github.com/a0x8o/elemental/commit/98db86b13d0a8e5dc381a97c1cfbbddfa0576bf4)

Update the event creation flags under HIP (https://github.com/elemental/Elemental/pull/161[)](https://github.com/a0x8o/elemental/commit/6537d24fc7557aba3ad7ee88a3fd60be63392cea)

Fix half includes for using the one that ships with ROCm (https://github.com/elemental/Elemental/pull/163[)](https://github.com/a0x8o/elemental/commit/40bfa0d4ce0b8e0861b9afca31da9acf14508247)

Enable FP16 on ROCm systems (https://github.com/elemental/Elemental/issues/109[)](https://github.com/a0x8o/elemental/commit/2aa6443c84b0aa9dcf686d2236036a7968295ab5)

Quick patch for build error without half (https://github.com/elemental/Elemental/issues/164[)](https://github.com/a0x8o/elemental/commit/edfdb2957b7c98698b1b544947b0ec5b4991834c)

add function decls for fp16 GPU blas functions (https://github.com/elemental/Elemental/pull/165[)](https://github.com/a0x8o/elemental/commit/485ba43369419ecb418705473c1796c7cf239a12)

Fix prototype of Nrm2 on CUBLAS (https://github.com/elemental/Elemental/issues/166[)](https://github.com/a0x8o/elemental/commit/e19ad6deb8b879043ad8d7113be588d7b634ec66)

Force destruction of "special" comms before shutdown (https://github.com/elemental/Elemental/pull/167[)](https://github.com/a0x8o/elemental/commit/d8539d2eca3ca7161d45824bd6d4861b4920253b)

Manage host returns from rocBLAS manually (https://github.com/elemental/Elemental/issues/168[)](https://github.com/a0x8o/elemental/commit/bc4ea53067ea9c5fd544a5170e05cb2e977257c0)

Match cuBLAS sync behavior (https://github.com/elemental/Elemental/pull/169[)](https://github.com/a0x8o/elemental/commit/d883ac3d9367aab480b1c8763fa8cf6489198c1d)

Update CMakeLists.txt

Extended GPU memory pool (https://github.com/elemental/Elemental/issues/172[)](https://github.com/a0x8o/elemental/commit/587ffa4f61d0277b4795432aaa2680b36b9d0806)

Fix ROCm compile issues (https://github.com/elemental/Elemental/issues/174[)](https://github.com/a0x8o/elemental/commit/2a6b657c60fd5d05b379319081b34e06b49b71e0)

Don't declare operator overloads for half with CUDA >= 12.2. (elemental#175

Update ElementalREADME.md

[intel MPI] handle quotes in MPI configuration (https://github.com/elemental/Elemental/issues/177[)](https://github.com/a0x8o/elemental/commit/f853bf7e5c352b83b175b5e5486f525d7f720e72)

Add Aluminum dispatch for send and recv within TranslateBetweenGrids (elemental#178

Restore the matrix shape allreduce in TranslateBetweenGrids (elemental#181

Add an EnsureComm call to make sure things are sane (https://github.com/elemental/Elemental/issues/182[)](https://github.com/a0x8o/elemental/commit/e6aea8dec9ab091b99002fdcb8c335b86da3bfef)

Make the separate communication stream optional (https://github.com/elemental/Elemental/pull/184[)](https://github.com/a0x8o/elemental/commit/039d681cc7e12cf2fdd8e5643f8ecc1beb3a25c7)

Add macro protection to GPU code block (https://github.com/elemental/Elemental/issues/185[)](https://github.com/a0x8o/elemental/commit/a5db028869809b64932383f2c2665148ad2db5be)

Fix uncaught_exception issue (https://github.com/elemental/Elemental/pull/186[)](https://github.com/a0x8o/elemental/commit/a7be48ca8c036489313fe3785e671f9e97975f5d)

benson31 and others added 30 commits June 24, 2022 00:17
* Add support for roctx annotations

* Add roctracer to the CMake export

* Protect imported library double-creation

* Don't throw exceptions anymore

* Fix typo

* Add support for Flux when identifying the local device
* Add debugging annotations to the MemoryPool. Control with H_MEMPOOL_DEBUG environment variable.

* Add env variable control for other mempool parameters

Maybe we want to further separate these based on pinned/not-pinned,
but this seems fine for the current use-cases.
* Fix a bug in in-place sendrecv operation

This provides a correct backward-compatible implementation of in-place
SendRecv that we can use with old versions of Aluminum. When a new
Aluminum is released with in-place SendRecv support, this will
directly dispatch to that automatically, with no update necessary.

Closes #146.

* Correct in-place Al call; test for the call directly instead of version
* Reduce the number of event recordings when unnecessary to call

* Apply suggestions from code review

Co-authored-by: Tom Benson <[email protected]>

* Formatting

---------

Co-authored-by: Tom Benson <[email protected]>
It's mildly unfortunate that the function for recording the event on a
SyncInfo is called "AddSynchronizationPoint". Thus we need an explicit
`if constexpr` to stop the recursion such that the event record is
actually avoided when we don't need it. C'est la vie.
* Update the event creation flags under hip

* Support ROCm version less than 5.6.0
* Modified HalfPrecision.hpp to use fp16 on systems with GPUs

* fix some CMake logic

* initial changes to fix __half->double issues

* Everything compiles

* Update the config

* Fix for LBANN compilation

* Revert whitespace/indentation changes

* Some additional cleanup; fp16 dot,nrm2,scal impls

* Fix a few other issues

---------

Co-authored-by: Tom Benson <[email protected]>
This only helps if El::Finalize is actually called.
Development version now reports as 1.5.4
* Port CUB binned memory allocator to Hydrogen

* Add tests

* Implement linear and custom binned allocation

* Modify human-readable output

* Implement mallocAsync backend

* Improve reporting and configurability

* Documentation update

* Use streams when freeing

* Revert "Use streams when freeing"

This reverts commit 5fe11e5.

* Remove need for active stream in free

* Apply suggestions from code review

Co-authored-by: Tom Benson <[email protected]>

---------

Co-authored-by: Tom Benson <[email protected]>
benson31 added 6 commits June 10, 2024 16:15
…178)

* Add Aluminum dispatch for send and recv within TranslateBetweenGrids

This commit only addresses the synchronous calls in the (STAR,VC) variant. I'm
using this as a bit of testbed before wholesale adding these into regular
`El::mpi` dispatch as the point-to-point use-cases are a bit more delicate.

* Appease linguistic nits being picked offline

* Fix formatting

* Squash build warnings

* Aluminum-ize El::mpi::Send and El::mpi::Recv; Aluminum now required

Note that the nonblocking Send/Recv are still not aluminum-ized. (They also
don't check host/device-ness of buffers, so there's that...) It would be a
bigger lift to incorporate Aluminum with the Request struct, so I'm disinclined
to do that until there's programmatic need.

* Revert to now-Aluminum-ized El::mpi::Send and El::mpi::Recv; cleanup syncs
If this option is not set, the communication will land on the
Hydrogen-default stream. This _could_ be a separate stream from the
data compute stream, but in the LBANN use case, it generally will be
the compute stream.
@a0x8o a0x8o merged commit f6bce6c into a0x8o:master Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants