-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ucx build #938
Fix ucx build #938
Conversation
@eddy16112 Do you think this is ready to be merged? |
Can we have someone test it on aws to make sure that the ucx works? |
# option for using Python | ||
set(FF_GASNET_CONDUITS aries udp mpi ibv ucx) | ||
# option for using network | ||
set(FF_GASNET_CONDUITS aries udp mpi ibv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we remove ucx as an opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is prefer to use the realm ucx module over gastnet ucx conduit, so I think it is not necessary to provide the ucx option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So instead of automatically downloading, installing and building UCX in CMakeLists.txt, these steps will have to be done manually right? It seems like the expectation for UCX_DIR
is that UCX should be compiled and installed under ${UCX_DIR}/install
, not that UCX source code be merely downloaded and extracted to ${UCX_DIR}
. Previously this is automated, but now that it has to be done manually I would suggest to document this somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this PR cleans up UCX build process, but without changes to source code somewhere I can't see how it would solve the segfault problem. I have built FlexFlow with UCX release v1.14.1 but the segfault problem remains.
Yes, because UCX could be pre-installed, so I just let users manually specify its directory. BTW, what is the segfault? I have never seen it before. |
The segfault happens when building with native UCX on two g4dn.4xlarge instances and running |
Yes, I have tried on 2 nodes, but not AWS instances. Can you print the backtrace? Could you please also check ldd libflexflow.so, ldd librealm.so and ldd liblegion.so. I have seen weird errors that the flexflow/legion libraries are linked against a system ucx library, not the one I manually installed. |
Here is the backtrace:
A more detailed backtrace can be found at StanfordLegion/legion#1517.
It does look like the case that they are linked against a system ucx library. Any ideas on how to fix that? |
You can set the LD_LIBRARY_PATH by export LD_LIBRARY_PATH=/your-ucx-root/lib:$LD_LIBRARY_PATH |
I set LD_LIBRARY_PATH and made sure that ucx in the build dir is used instead of the system-level one. The backtrace indicates this is indeed the case, but I got a segmentation fault nevertheless:
What is the distribution or docker image you are using for both nodes, and are there any extra steps during installation? I'm using Deep Learning AMI from AWS and activated conda environment |
I was running it on the Stanford sapling machine, not AWS. Could you please rebuild Flexflow and Legion with debug mode and print the backtrace. |
Here is a more detailed backtrace which is the same as in the issue referenced above:
|
UCX has version 1.14.1, FlexFlow is built from latest master and Legion is version 23.06.0 which are all latest versions. |
It might be an issue of Legion UCX backend. Do you have an account on zulip? You can post this issue with the backtrace there. If not, I can post it for you. |
Not yet. Please post it there if you can, thanks. |
@vincent-163 you are using the master branch? Shouldn't we use the flexflow or control replication branch? |
Yes, I use |
@vincent-163 @eddy16112 What's the status of this PR? Are we blocked by any issue? |
There is a ucx error on aws, even though I am not able to reproduce it on sapling. |
There was a segfault in native UCX when run on AWS machines but it seems that the author did not run into such a problem on Stanford sapling machines. I haven’t figured out what configuration differences or code caused this uet. |
@@ -22,6 +22,9 @@ if(NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES) | |||
STRING "Choose the type of build." FORCE) | |||
endif() | |||
|
|||
# set std 11 | |||
set (CMAKE_CXX_STANDARD 11) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lockshaw Have we reached out the agreement that FlexFlow will use c++17 moving forward?
@vincent-163 I do not have access to AWS, could you please create a reproducer on an AWS instance, and give me the ssh access to it? Then I can login and take a look at it. |
Is your email [email protected]? I'll send SSH details to that email. |
Yes, it is my email. If you have an account on the FlexFlow slack channel, you can also ping me there. |
See also changes in #807 |
I have added documentatoin for UCX installation on top of this PR at https://github.com/vincent-163/FlexFlow/tree/fix-ucx. The installation guide for multi-node with UCX is at https://github.com/vincent-163/FlexFlow/blob/fix-ucx/MULTI-NODE.md. |
Co-authored-by: vincent-163 <[email protected]>
The PR is ready to be merged. |
Merged in #1230 . |
Description of changes:
Related Issues:
Linked Issues:
Issues closed by this PR:
Before merging:
This change is