Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nonblocking Collectives #456

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

manjugv
Copy link
Collaborator

@manjugv manjugv commented Oct 26, 2020

No description provided.

@manjugv manjugv force-pushed the topic/nb_collectives branch from 2683ab3 to a77c345 Compare January 21, 2022 19:10
@manjugv manjugv changed the title Nonblocking Collectives (Draft) Nonblocking Collectives Jan 21, 2022
@manjugv manjugv self-assigned this Jan 21, 2022
Like data exchange semantics, the entry and completion
criteria of blocking and nonblocking alltoall is similar.

{\bf Entry criteria}: Before any \ac{PE} calls a \FUNC{shmem\_alltoall\_nb} routine,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entry criteria is confusing - if we have the same entry criteria as blocking operations - then the users are supposed to make sure the src/dst buffers are available at the point of entry - meaning the users are expected to sync with themselves before launching the nb collective operation.

AFAIU - the src/dst needs to be available when everyone in the team reaches the state - which can be determined during runtime - without the need for explicit sync before launching the nb collective operation.

Please clarify the intended semantics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naveen, my intent is to keep it consistent with the semantics of blocking collectives. I wanted to improve the description. I don’t see how the current wording is violating that. If the wording doesn’t reflect that, then we can fix it.

@manjugv
Copy link
Collaborator Author

manjugv commented May 26, 2023

Feedback from Jim (May Spec Meeting): Should we consider shmem_local_complete ?

\item Collective Types: The nonblocking variants supported include barrier, alltoall,
and broadcast collectives. Other collective operations such as
reductions, collect, barrier, alltoalls, and sync will not have nonblocking variants.

Copy link
Collaborator

@davidozog davidozog Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should sync_all be supported? My hunch is it seems possibly useful...

@jdinan
Copy link
Collaborator

jdinan commented Feb 27, 2024

Feedback from Feb. 27, 2024 discussion:

  1. Request handle should be passed by reference to wait/test to allow implementation to reset the request handle's value to SHMEM_REQUEST_INVALID.
  2. We discussed a counter proposal to make "nbi" collectives on the team that can be completed in bulk via shmem_team_sync or other similar operation. This would mirror the contexts behavior for nonblocking RMA/AMO.
  3. Discussion about how to manage internal psync space used by the collectives
  • Wait/test are local operations which prevents freeing symmetric memory in the wait/test. If a pool allocator is used, the pool will become inconsistent across PEs. This can be addressed through periodic synchronization when initiating a collective to "garbage collect" pSync operations.

threads but on different request objects. It is the responsibility of the
\openshmem user to ensure that proper synchronization is used to prevent race
conditions or deadlock. Specifically, the \FUNC{shmem\_req\_wait} operation should
be called after the collective operation to ensure that the request object is
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the requests are always deallocated in wait or test call, and if request handles are not reused, it will not run into the scenario of "premature deallocation".

\apidescription{
A call to \FUNC{shmem\_req\_test} returns immediately. If the
operation identified by the request is completed, it returns
zero, and the request object is deallocated and cannot be reused.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could request objects have value SHMEM_REQUEST_INVALID after they are deallocated? In which case this routine and shmem_req_wait would take an argument of type shmem_req_h *.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidozog Just updated. Please let me know if it's missing something.

@jdinan
Copy link
Collaborator

jdinan commented Jun 27, 2024

Notes from today's discussion -- 06-27-2024 -- NBI Collectives pSync Management.pptx

Add SHMEM_REQ_INVALID library constant

Address WG Feedback

Clarify collective start/execution

Remove barrier_all_nb

Address Feedback

Minor update to nb_collective_intro.tex

Clarify completion of broadcast_nb
\item Completion semantics: \openshmem programs can learn the status of the collective operations
using the \FUNC{shmem\_req\_test} routine. The operation is completed after
a call to \FUNC{shmem\_req\_test} or a call to \FUNC{shmem\_req\_wait}.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to what @kwaters4 mentioned. Add something like "Completion of the operation can be observed through one or more calls to \FUNC{shmem_req_test} or a single call to \FUNC{shmem_req_wait}."

@@ -0,0 +1,123 @@
\apisummary{
Exchanges a fixed amount of contiguous data blocks between all pairs
of \acp{PE} participating in the collective routine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we mention later that "All PEs in the provided team must participate in the collective.", can we just say that here, instead of keeping it in the abstract?
i.e. "Exchanges a fixed amount of contiguous data blocks between all PEs on the specified team."

\item The \source{} data object may be safely reused.
\item The \dest{} data object is updated.
\item If the local \ac{PE} is \VAR{PE\_root}, the data has been copied
out of the \source{} data object.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this change to a separate PR and ask section committee to add it. @davidozog

@jdinan
Copy link
Collaborator

jdinan commented Aug 23, 2024

This proposal was brought forward for a vote on August 23, 2024 and the ballot did not pass. The proposal will need to be re-read before it can be voted on again.

@@ -0,0 +1,38 @@
\apisummary{
The routine waits until a operation identified by a request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "an operation" instead of "a operation"

@wfaderhold21
Copy link

Capturing feedback from specification meeting:

  1. General feedback including from HPE and DOD: not confident regarding the proposal and want to delay out of abundance of caution (get it right once rather than continually fix later). Would like to see some performance numbers with a working reference implementation on InfiniBand networks. Would also like to see some driving use cases.
  2. Intel Feedback: Need examples in proposal and they feel all proposals should have reference implementations.
  3. No feedback from AMD or ANL...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants