Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCC/CTX: passing cuda-check from tl ucp to mlx5 #1013

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

MamziB
Copy link
Collaborator

@MamziB MamziB commented Aug 21, 2024

Inside TL MLX5, we need to know if TL Service has cuda-support or not. Since we cannot call ucp_context_query() directly inside TL MLX5, we use a shared variable and pass it to TL MLX5. This can be extended to other capabilities in the future as well.

@MamziB MamziB self-assigned this Aug 21, 2024
@MamziB
Copy link
Collaborator Author

MamziB commented Sep 6, 2024

@Sergei-Lebedev can you please take a look at this small patch?

@janjust
Copy link
Collaborator

janjust commented Sep 10, 2024

@MamziB please fix commit title to pass code style check

Copy link
Collaborator

@janjust janjust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a general question regarding the backwards compatibility
Also, since we're introducing the capability for a specific TL - do we have to update other TL's as well? (including private ones)

src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Show resolved Hide resolved
Copy link
Contributor

@Sergei-Lebedev Sergei-Lebedev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my opinion it's more logical to return supported memory types as part of attributes. Since you need to query service team you can add additional field to attributes and check it in tl mlx5

@MamziB
Copy link
Collaborator Author

MamziB commented Sep 25, 2024

@Sergei-Lebedev @janjust

After conducting a thorough code review of UCC, I found that implementing the desired feature without modifying the existing API seems unfeasible. Specifically, we need to make changes in the attribute section.

Currently, the ucc_context_attr_field lacks an attribute related to supported memory types. To address this, I propose adding a new MASK flag to ucc_context_attr_field to represent memory type support. This will allow us to introduce a new valid attribute indicating the supported memory types.

When the function ucc_tl_ucp_get_context_attr is invoked, it can check this MASK flag. If the corresponding flag is set, the memory type information can be populated accordingly. During TL MLX5 team creation, we can then call ucc_tl_ucp_get_context_attr to determine if TL UCP supports GPU memory, which would allow us to enable MCAST for GPU memory when supported.

  • Can you please advise if you have a more concrete resolution? if not, let's merge the PR as is for now. Thank you
*  @ref ucc_context_attr_t defines the attributes of the context. The bits in
 *  "mask" bit array is defined by @ref ucc_context_attr_field, which correspond to
 *  fields in structure @ref ucc_context_attr_t. The valid fields of the structure
 *  is specified by the setting the bit to "1" in the bit-array "mask". When
 *  bits corresponding to the fields is not set, the fields are not defined.
 *
 *  @endparblock
 *
 */
typedef struct ucc_context_attr {
    uint64_t                mask;
    ucc_context_type_t      type;
    ucc_coll_sync_type_t    sync_type;
    ucc_context_addr_h      ctx_addr;
    ucc_context_addr_len_t  ctx_addr_len;
    uint64_t                global_work_buffer_size;
} ucc_context_attr_t;

@MamziB
Copy link
Collaborator Author

MamziB commented Oct 7, 2024

@manjugv hey manjo, did you have any thoughts on this PR? I think you mentioned you would take a look at it

@janjust
Copy link
Collaborator

janjust commented Nov 14, 2024

@manjugv ping

@manjugv
Copy link
Contributor

manjugv commented Nov 14, 2024

@MamziB @janjust It seems like there is no agreement between you guys on the solution. For me, this is very odd - TL/MLX5 asking about TL/UCP. Also, you are creating a structure with one field.

@MamziB
Copy link
Collaborator Author

MamziB commented Nov 14, 2024

@manjugv At this stage, I'm focused on exploring all potential solutions for this issue, which is why Tommy, Sergey, and I haven't reached a conclustion yet. We're aiming to be thorough and consider various angles before deciding on a path forward. If you have any additional insights or alternative approaches in mind for this PR, we’d greatly appreciate your guidance.

@swx-jenkins3
Copy link

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants