Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ucx build #938

Closed
wants to merge 9 commits into from
Closed

Fix ucx build #938

wants to merge 9 commits into from

Conversation

eddy16112
Copy link
Collaborator

@eddy16112 eddy16112 commented Aug 9, 2023

Description of changes:

Related Issues:

Linked Issues:

  • Issue #

Issues closed by this PR:

  • Closes #

Before merging:

  • Did you update the flexflow-third-party repo, if modifying any of the Cmake files, the build configs, or the submodules?

This change is Reviewable

@jiazhihao
Copy link
Collaborator

@eddy16112 Do you think this is ready to be merged?

@eddy16112
Copy link
Collaborator Author

@eddy16112 Do you think this is ready to be merged?

Can we have someone test it on aws to make sure that the ucx works?

# option for using Python
set(FF_GASNET_CONDUITS aries udp mpi ibv ucx)
# option for using network
set(FF_GASNET_CONDUITS aries udp mpi ibv)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we remove ucx as an opinion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is prefer to use the realm ucx module over gastnet ucx conduit, so I think it is not necessary to provide the ucx option.

Copy link

@vincent-163 vincent-163 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So instead of automatically downloading, installing and building UCX in CMakeLists.txt, these steps will have to be done manually right? It seems like the expectation for UCX_DIR is that UCX should be compiled and installed under ${UCX_DIR}/install, not that UCX source code be merely downloaded and extracted to ${UCX_DIR}. Previously this is automated, but now that it has to be done manually I would suggest to document this somewhere.

Copy link

@vincent-163 vincent-163 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this PR cleans up UCX build process, but without changes to source code somewhere I can't see how it would solve the segfault problem. I have built FlexFlow with UCX release v1.14.1 but the segfault problem remains.

@eddy16112
Copy link
Collaborator Author

So instead of automatically downloading, installing and building UCX in CMakeLists.txt, these steps will have to be done manually right? It seems like the expectation for UCX_DIR is that UCX should be compiled and installed under ${UCX_DIR}/install, not that UCX source code be merely downloaded and extracted to ${UCX_DIR}. Previously this is automated, but now that it has to be done manually I would suggest to document this somewhere.

Yes, because UCX could be pre-installed, so I just let users manually specify its directory. BTW, what is the segfault? I have never seen it before.

@vincent-163
Copy link

So instead of automatically downloading, installing and building UCX in CMakeLists.txt, these steps will have to be done manually right? It seems like the expectation for UCX_DIR is that UCX should be compiled and installed under ${UCX_DIR}/install, not that UCX source code be merely downloaded and extracted to ${UCX_DIR}. Previously this is automated, but now that it has to be done manually I would suggest to document this somewhere.

Yes, because UCX could be pre-installed, so I just let users manually specify its directory. BTW, what is the segfault? I have never seen it before.

The segfault happens when building with native UCX on two g4dn.4xlarge instances and running scripts/mnist_mlp_run.sh with MPI. It does not happen when running on a single instance, nor when using GasNet with the UCX conduit. Have you tried running FlexFlow with MPI on multiple instances?

@eddy16112
Copy link
Collaborator Author

So instead of automatically downloading, installing and building UCX in CMakeLists.txt, these steps will have to be done manually right? It seems like the expectation for UCX_DIR is that UCX should be compiled and installed under ${UCX_DIR}/install, not that UCX source code be merely downloaded and extracted to ${UCX_DIR}. Previously this is automated, but now that it has to be done manually I would suggest to document this somewhere.

Yes, because UCX could be pre-installed, so I just let users manually specify its directory. BTW, what is the segfault? I have never seen it before.

The segfault happens when building with native UCX on two g4dn.4xlarge instances and running scripts/mnist_mlp_run.sh with MPI. It does not happen when running on a single instance, nor when using GasNet with the UCX conduit. Have you tried running FlexFlow with MPI on multiple instances?

Yes, I have tried on 2 nodes, but not AWS instances. Can you print the backtrace? Could you please also check ldd libflexflow.so, ldd librealm.so and ldd liblegion.so. I have seen weird errors that the flexflow/legion libraries are linked against a system ucx library, not the one I manually installed.

@vincent-163
Copy link

Yes, I have tried on 2 nodes, but not AWS instances. Can you print the backtrace? Could you please also check ldd libflexflow.so, ldd librealm.so and ldd liblegion.so. I have seen weird errors that the flexflow/legion libraries are linked against a system ucx library, not the one I manually installed.

Here is the backtrace:

[ip-172-31-40-19:34058:1:34065] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x9)
==== backtrace (tid:  34065) ====
 0  /usr/local/lib/libucs.so.0(ucs_handle_error+0x2dc) [0x7fc1e6c589dc]
 1  /usr/local/lib/libucs.so.0(+0x30bbf) [0x7fc1e6c58bbf]
 2  /usr/local/lib/libucs.so.0(+0x30ef4) [0x7fc1e6c58ef4]
 3  /lib/x86_64-linux-gnu/libc.so.6(+0x43090) [0x7fc1e8f20090]
 4  /usr/local/lib/libucp.so.0(ucp_mem_type_pack+0x70) [0x7fc1e6e75cb0]
 5  /usr/local/lib/libucp.so.0(ucp_dt_pack+0x69) [0x7fc1e6e75e69]
 6  /usr/local/lib/libucp.so.0(+0x80331) [0x7fc1e6ea3331]
 7  /usr/local/lib/libuct.so.0(uct_tcp_ep_am_bcopy+0x7e) [0x7fc1e6cb698e]
 8  /usr/local/lib/libucp.so.0(ucp_rndv_progress_am_bcopy+0x9b) [0x7fc1e6ea343b]
 9  /usr/local/lib/libucp.so.0(ucp_rndv_rtr_handler+0xc8) [0x7fc1e6ea6108]
10  /usr/local/lib/libuct.so.0(+0x23427) [0x7fc1e6cb5427]
11  /usr/local/lib/libuct.so.0(+0x23a08) [0x7fc1e6cb5a08]
12  /usr/local/lib/libuct.so.0(+0x25f2c) [0x7fc1e6cb7f2c]
13  /usr/local/lib/libucs.so.0(ucs_event_set_wait+0xf9) [0x7fc1e6c632f9]
14  /usr/local/lib/libuct.so.0(uct_tcp_iface_progress+0x7b) [0x7fc1e6cb7fdb]
15  /usr/local/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7fc1e6e7228a]
16  /home/ubuntu/FlexFlow/build/deps/legion/lib/librealm.so.1(+0x740807) [0x7fc1e9a29807]
17  /home/ubuntu/FlexFlow/build/deps/legion/lib/librealm.so.1(+0x72ba9c) [0x7fc1e9a14a9c]
18  /home/ubuntu/FlexFlow/build/deps/legion/lib/librealm.so.1(+0x5dbc63) [0x7fc1e98c4c63]
19  /home/ubuntu/FlexFlow/build/deps/legion/lib/librealm.so.1(+0x5dc4b6) [0x7fc1e98c54b6]
20  /home/ubuntu/FlexFlow/build/deps/legion/lib/librealm.so.1(+0x6e3006) [0x7fc1e99cc006]
21  /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fc1e6f1a609]
22  /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fc1e8ffc133]
=================================
/home/ubuntu/FlexFlow/build/flexflow_python: line 13: 34058 Segmentation fault      (core dumped) $BUILD_FOLDER/deps/legion/bin/legion_python "$@"

A more detailed backtrace can be found at StanfordLegion/legion#1517.

(flexflow) ubuntu@ip-172-31-40-19:~/FlexFlow/build$ ldd libflexflow.so
        linux-vdso.so.1 (0x00007fffe27f5000)
        libcublas.so.11 => /usr/local/cuda/lib64/libcublas.so.11 (0x00007f2892680000)
        libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10 (0x00007f288ca86000)
        libcudnn.so.8 => /usr/local/cuda/lib/libcudnn.so.8 (0x00007f288c860000)
        libnccl.so.2 => /usr/local/cuda/lib/libnccl.so.2 (0x00007f287e9d1000)
        liblegion.so.1 => /usr/local/lib/liblegion.so.1 (0x00007f287d1bd000)
        libcudart.so.11.0 => /usr/local/cuda/lib64/libcudart.so.11.0 (0x00007f287cf18000)
        librealm.so.1 => /usr/local/lib/librealm.so.1 (0x00007f287b306000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f287b10a000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f287afbb000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f287af9e000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f287adac000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f289bf14000)
        libcublasLt.so.11 => /usr/local/cuda/lib64/libcublasLt.so.11 (0x00007f2866e0b000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f2866e01000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2866dde000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f2866dd8000)
        libcuda.so.1 => /lib/x86_64-linux-gnu/libcuda.so.1 (0x00007f28650f6000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f28650da000)
        libucp.so.0 => /usr/local/lib/libucp.so.0 (0x00007f2864ff5000)
        libuct.so.0 => /usr/local/lib/libuct.so.0 (0x00007f2864fb3000)
        libucs.so.0 => /usr/local/lib/libucs.so.0 (0x00007f2864f4b000)
        libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f2864f3c000)
        libucm.so.0 => /usr/local/lib/libucm.so.0 (0x00007f2864f1e000)
 (flexflow) ubuntu@ip-172-31-40-19:~/FlexFlow/build$ ldd deps/legion/lib/liblegion.so
        linux-vdso.so.1 (0x00007fff604a3000)
        libcudart.so.11.0 => /usr/local/cuda/lib64/libcudart.so.11.0 (0x00007f03a17fe000)
        libcuda.so.1 => /lib/x86_64-linux-gnu/libcuda.so.1 (0x00007f039fb04000)
        librealm.so.1 => /usr/local/lib/librealm.so.1 (0x00007f039def2000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f039ded4000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f039dcf2000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f039dcd7000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f039dae5000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f03a32b9000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f039dadf000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f039dabc000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f039dab0000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f039d961000)
        libucp.so.0 => /usr/local/lib/libucp.so.0 (0x00007f039d87c000)
        libuct.so.0 => /usr/local/lib/libuct.so.0 (0x00007f039d83a000)
        libucs.so.0 => /usr/local/lib/libucs.so.0 (0x00007f039d7d2000)
        libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f039d7c3000)
        libucm.so.0 => /usr/local/lib/libucm.so.0 (0x00007f039d7a5000)
(flexflow) ubuntu@ip-172-31-40-19:~/FlexFlow/build$ ldd deps/legion/lib/librealm.so
        linux-vdso.so.1 (0x00007ffd15732000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f6373dfa000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f6373dd7000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f6373dcd000)
        libucp.so.0 => /usr/local/lib/libucp.so.0 (0x00007f6373ce8000)
        libcuda.so.1 => /lib/x86_64-linux-gnu/libcuda.so.1 (0x00007f6372006000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f6371e24000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f6371cd5000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f6371cba000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6371ac8000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f6375a2e000)
        libuct.so.0 => /usr/local/lib/libuct.so.0 (0x00007f6371a86000)
        libucs.so.0 => /usr/local/lib/libucs.so.0 (0x00007f6371a1c000)
        libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f6371a0f000)
        libucm.so.0 => /usr/local/lib/libucm.so.0 (0x00007f63719f1000)

It does look like the case that they are linked against a system ucx library. Any ideas on how to fix that?

@eddy16112
Copy link
Collaborator Author

You can set the LD_LIBRARY_PATH by export LD_LIBRARY_PATH=/your-ucx-root/lib:$LD_LIBRARY_PATH

@vincent-163
Copy link

You can set the LD_LIBRARY_PATH by export LD_LIBRARY_PATH=/your-ucx-root/lib:$LD_LIBRARY_PATH

I set LD_LIBRARY_PATH and made sure that ucx in the build dir is used instead of the system-level one. The backtrace indicates this is indeed the case, but I got a segmentation fault nevertheless:

[ip-172-31-40-19:46418:1:46440] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x9)
==== backtrace (tid:  46440) ====
 0  /home/ubuntu/FlexFlow/build/ucx/install/lib/libucs.so.0(ucs_handle_error+0x2dc) [0x7f87c0b45c1c]
 1  /home/ubuntu/FlexFlow/build/ucx/install/lib/libucs.so.0(+0x30dff) [0x7f87c0b45dff]
 2  /home/ubuntu/FlexFlow/build/ucx/install/lib/libucs.so.0(+0x31134) [0x7f87c0b46134]
 3  /lib/x86_64-linux-gnu/libc.so.6(+0x43090) [0x7f87c2e15090]
 4  /home/ubuntu/FlexFlow/build/ucx/install/lib/libucp.so.0(ucp_mem_type_pack+0x73) [0x7f87c0d64243]
 5  /home/ubuntu/FlexFlow/build/ucx/install/lib/libucp.so.0(ucp_dt_pack+0x69) [0x7f87c0d643f9]
 6  /home/ubuntu/FlexFlow/build/ucx/install/lib/libucp.so.0(+0x84767) [0x7f87c0d94767]
 7  /home/ubuntu/FlexFlow/build/ucx/install/lib/libuct.so.0(uct_tcp_ep_am_bcopy+0x7e) [0x7f87c0ba398e]
 8  /home/ubuntu/FlexFlow/build/ucx/install/lib/libucp.so.0(ucp_rndv_progress_am_bcopy+0xa2) [0x7f87c0d94872]
 9  /home/ubuntu/FlexFlow/build/ucx/install/lib/libucp.so.0(ucp_rndv_rtr_handler+0xd0) [0x7f87c0d97510]
10  /home/ubuntu/FlexFlow/build/ucx/install/lib/libuct.so.0(+0x23427) [0x7f87c0ba2427]
11  /home/ubuntu/FlexFlow/build/ucx/install/lib/libuct.so.0(+0x23a08) [0x7f87c0ba2a08]
12  /home/ubuntu/FlexFlow/build/ucx/install/lib/libuct.so.0(+0x25f2c) [0x7f87c0ba4f2c]
13  /home/ubuntu/FlexFlow/build/ucx/install/lib/libucs.so.0(ucs_event_set_wait+0xf9) [0x7f87c0b50539]
14  /home/ubuntu/FlexFlow/build/ucx/install/lib/libuct.so.0(uct_tcp_iface_progress+0x7b) [0x7f87c0ba4fdb]
15  /home/ubuntu/FlexFlow/build/ucx/install/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7f87c0d6063a]
16  /home/ubuntu/FlexFlow/build/deps/legion/lib/librealm.so.1(+0x740807) [0x7f87c391e807]
17  /home/ubuntu/FlexFlow/build/deps/legion/lib/librealm.so.1(+0x72ba9c) [0x7f87c3909a9c]
18  /home/ubuntu/FlexFlow/build/deps/legion/lib/librealm.so.1(+0x5dbc63) [0x7f87c37b9c63]
19  /home/ubuntu/FlexFlow/build/deps/legion/lib/librealm.so.1(+0x5dc4b6) [0x7f87c37ba4b6]
20  /home/ubuntu/FlexFlow/build/deps/legion/lib/librealm.so.1(+0x6e3006) [0x7f87c38c1006]
21  /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f87c0e0f609]
22  /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f87c2ef1133]
=================================
/home/ubuntu/FlexFlow/build/flexflow_python: line 13: 46418 Segmentation fault      (core dumped) $BUILD_FOLDER/deps/legion/bin/legion_python "$@"

What is the distribution or docker image you are using for both nodes, and are there any extra steps during installation? I'm using Deep Learning AMI from AWS and activated conda environment flexflow from conda/environment.yml before building with config/config.linux.

@eddy16112
Copy link
Collaborator Author

I was running it on the Stanford sapling machine, not AWS. Could you please rebuild Flexflow and Legion with debug mode and print the backtrace.

@vincent-163
Copy link

Here is a more detailed backtrace which is the same as in the issue referenced above:

#0  0x00007f26e5f4f23f in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x555d2a0b81e0, rem=rem@entry=0x555d2a0b81e0)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
78      ../sysdeps/unix/sysv/linux/clock_nanosleep.c: No such file or directory.
(gdb) bt
#0  0x00007f26e5f4f23f in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x555d2a0b81e0, rem=rem@entry=0x555d2a0b81e0)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
#1  0x00007f26e5f54ec7 in __GI___nanosleep (requested_time=requested_time@entry=0x555d2a0b81e0, remaining=remaining@entry=0x555d2a0b81e0) at nanosleep.c:27
#2  0x00007f26e5f54dfe in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3  0x00007f26e73954fd in Realm::realm_freeze (signal=11) at /home/ubuntu/FlexFlow/deps/legion/runtime/realm/runtime_impl.cc:187
#4  <signal handler called>
#5  0x00007f26e3e04243 in ucp_ep_config (ep=<optimized out>, ep=<optimized out>) at dt/dt.c:80
#6  ucp_mem_type_pack (worker=0x555d29f60ec0, dest=dest@entry=0x7f2270076895, src=0x7f24db2a6840, length=8176, mem_type=UCS_MEMORY_TYPE_CUDA) at dt/dt.c:84
#7  0x00007f26e3e0392b in ucp_dt_contig_pack (worker=<optimized out>, dest=dest@entry=0x7f2270076895, src=<optimized out>, length=length@entry=8176, mem_type=<optimized out>) at dt/dt_contig.c:31
#8  0x00007f26e3e043f9 in ucp_dt_pack (worker=worker@entry=0x555d29f60ec0, datatype=<optimized out>, mem_type=<optimized out>, dest=dest@entry=0x7f2270076895, src=<optimized out>, 
    state=state@entry=0x555d29fe3910, length=8176) at dt/dt.c:118
#9  0x00007f26e3e34767 in ucp_rndv_pack_data (dest=0x7f2270076885, arg=0x555d29fe38b8) at rndv/rndv.c:1877
#10 0x00007f26e3c4398e in uct_tcp_ep_am_bcopy (uct_ep=0x555d2a068c30, am_id=<optimized out>, pack_cb=<optimized out>, arg=<optimized out>, flags=<optimized out>) at tcp/tcp_ep.c:1874
#11 0x00007f26e3e34993 in uct_ep_am_bcopy (flags=0, arg=0x555d29fe38b8, pack_cb=0x7f26e3e34650 <ucp_rndv_pack_data>, id=12 '\f', ep=0x555d2a068c30)
    at /home/ubuntu/FlexFlow/build/ucx/src/uct/api/uct.h:3020
#12 ucp_do_am_bcopy_multi (handle_user_hdr=0, enable_am_bw=1, pack_middle=0x7f26e3e34650 <ucp_rndv_pack_data>, pack_first=0x7f26e3e34650 <ucp_rndv_pack_data>, am_id_middle=12 '\f', 
    am_id_first=12 '\f', self=0x555d29fe3990) at /home/ubuntu/FlexFlow/build/ucx/src/ucp/proto/proto_am.inl:120
#13 ucp_rndv_progress_am_bcopy (self=0x555d29fe3990) at rndv/rndv.c:1897
#14 0x00007f26e3e37510 in ucp_request_try_send (req=0x555d29fe38b8) at /home/ubuntu/FlexFlow/build/ucx/src/ucp/core/ucp_request.inl:357
#15 ucp_request_send (req=0x555d29fe38b8) at /home/ubuntu/FlexFlow/build/ucx/src/ucp/core/ucp_request.inl:357
#16 ucp_rndv_rtr_handler (arg=<optimized out>, data=<optimized out>, length=<optimized out>, flags=<optimized out>) at rndv/rndv.c:2434
#17 0x00007f26e3c42427 in uct_iface_invoke_am (flags=0, length=<optimized out>, data=0x7f226c1e1085, id=<optimized out>, iface=0x555d29f574e0)
    at /home/ubuntu/FlexFlow/build/ucx/src/uct/base/uct_iface.h:904
#18 uct_tcp_ep_comp_recv_am (ep=<optimized out>, hdr=<optimized out>, iface=0x555d29f574e0) at tcp/tcp_ep.c:1289
#19 uct_tcp_ep_progress_am_rx (ep=<optimized out>, ep@entry=0x555d2a068c30) at tcp/tcp_ep.c:1421
#20 0x00007f26e3c42a08 in uct_tcp_ep_progress_data_rx (arg=0x555d2a068c30) at tcp/tcp_ep.c:1526
#21 0x00007f26e3c44f2c in uct_tcp_iface_handle_events (callback_data=0x555d2a068c30, events=<optimized out>, arg=0x7f26cc1c0df0) at tcp/tcp_iface.c:324
#22 0x00007f26e3bf0539 in ucs_event_set_wait (event_set=<optimized out>, num_events=num_events@entry=0x7f26cc1c0df4, timeout_ms=timeout_ms@entry=0, 
    event_set_handler=event_set_handler@entry=0x7f26e3c44f00 <uct_tcp_iface_handle_events>, arg=arg@entry=0x7f26cc1c0df0) at sys/event_set.c:215
#23 0x00007f26e3c44fdb in uct_tcp_iface_progress (tl_iface=0x555d29f574e0) at tcp/tcp_iface.c:341
#24 0x00007f26e3e0063a in ucs_callbackq_dispatch (cbq=<optimized out>) at /home/ubuntu/FlexFlow/build/ucx/src/ucs/datastruct/callbackq.h:211
#25 uct_worker_progress (worker=<optimized out>) at /home/ubuntu/FlexFlow/build/ucx/src/uct/api/uct.h:2768
#26 ucp_worker_progress (worker=0x555d29f60ec0) at core/ucp_worker.c:2814
#27 0x00007f26e7482af1 in Realm::UCP::UCPContext::progress_without_wakeup (this=0x555d29e5b550) at /home/ubuntu/FlexFlow/deps/legion/runtime/realm/ucx/ucp_context.cc:230
#28 0x00007f26e7482c94 in Realm::UCP::UCPContext::progress (this=0x555d29e5b550) at /home/ubuntu/FlexFlow/deps/legion/runtime/realm/ucx/ucp_context.cc:258
#29 0x00007f26e74676c3 in Realm::UCP::UCPPoller::do_work (this=0x555d294c06c0, work_until=...) at /home/ubuntu/FlexFlow/deps/legion/runtime/realm/ucx/ucp_internal.cc:305
#30 0x00007f26e720f243 in Realm::BackgroundWorkManager::Worker::do_work (this=0x7f26cc1c12c0, max_time_in_ns=-1, interrupt_flag=0x0)
    at /home/ubuntu/FlexFlow/deps/legion/runtime/realm/bgwork.cc:621
#31 0x00007f26e720cba8 in Realm::BackgroundWorkThread::main_loop (this=0x555d2a068700) at /home/ubuntu/FlexFlow/deps/legion/runtime/realm/bgwork.cc:125
#32 0x00007f26e721093c in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x555d2a068700)
    at /home/ubuntu/FlexFlow/deps/legion/runtime/realm/threads.inl:97
#33 0x00007f26e740160c in Realm::KernelThread::pthread_entry (data=0x555d2a0668b0) at /home/ubuntu/FlexFlow/deps/legion/runtime/realm/threads.cc:781
#34 0x00007f26e3eaf609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#35 0x00007f26e5f91133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@vincent-163
Copy link

UCX has version 1.14.1, FlexFlow is built from latest master and Legion is version 23.06.0 which are all latest versions.

@eddy16112
Copy link
Collaborator Author

It might be an issue of Legion UCX backend. Do you have an account on zulip? You can post this issue with the backtrace there. If not, I can post it for you.

@vincent-163
Copy link

It might be an issue of Legion UCX backend. Do you have an account on zulip? You can post this issue with the backtrace there. If not, I can post it for you.

Not yet. Please post it there if you can, thanks.

@eddy16112
Copy link
Collaborator Author

@vincent-163 you are using the master branch? Shouldn't we use the flexflow or control replication branch?

@vincent-163
Copy link

@vincent-163 you are using the master branch? Shouldn't we use the flexflow or control replication branch?

Yes, I use git submodules update --init --recursive to initialize git submodules and deps/legion seems to point to the master branch by default. I've updated deps/legion to point to flexflow branch instead but the segfault seems to remain. The backtrace is nearly identical (down to source file lines, I've checked that all source files involved in the backtrace are the same between flexflow and master).

@jiazhihao
Copy link
Collaborator

@vincent-163 @eddy16112 What's the status of this PR? Are we blocked by any issue?

@eddy16112
Copy link
Collaborator Author

@vincent-163 @eddy16112 What's the status of this PR? Are we blocked by any issue?

There is a ucx error on aws, even though I am not able to reproduce it on sapling.

@vincent-163
Copy link

@vincent-163 @eddy16112 What's the status of this PR? Are we blocked by any issue?

There was a segfault in native UCX when run on AWS machines but it seems that the author did not run into such a problem on Stanford sapling machines. I haven’t figured out what configuration differences or code caused this uet.

@@ -22,6 +22,9 @@ if(NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES)
STRING "Choose the type of build." FORCE)
endif()

# set std 11
set (CMAKE_CXX_STANDARD 11)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lockshaw Have we reached out the agreement that FlexFlow will use c++17 moving forward?

@eddy16112
Copy link
Collaborator Author

@vincent-163 I do not have access to AWS, could you please create a reproducer on an AWS instance, and give me the ssh access to it? Then I can login and take a look at it.

@vincent-163
Copy link

@vincent-163 I do not have access to AWS, could you please create a reproducer on an AWS instance, and give me the ssh access to it? Then I can login and take a look at it.

Is your email [email protected]? I'll send SSH details to that email.

@eddy16112
Copy link
Collaborator Author

@vincent-163 I do not have access to AWS, could you please create a reproducer on an AWS instance, and give me the ssh access to it? Then I can login and take a look at it.

Is your email [email protected]? I'll send SSH details to that email.

Yes, it is my email. If you have an account on the FlexFlow slack channel, you can also ping me there.

@goliaro
Copy link
Collaborator

goliaro commented Oct 19, 2023

See also changes in #807

@vincent-163
Copy link

I have added documentatoin for UCX installation on top of this PR at https://github.com/vincent-163/FlexFlow/tree/fix-ucx. The installation guide for multi-node with UCX is at https://github.com/vincent-163/FlexFlow/blob/fix-ucx/MULTI-NODE.md.
Using multi-node with UCX seems to be extremely slow (around 0.3~0.5 line per second for the example MNIST run, as opposed to several lines per second when running on single node), so I think further optimization will be needed.

@eddy16112
Copy link
Collaborator Author

The PR is ready to be merged.

@jiazhihao
Copy link
Collaborator

Merged in #1230 .

@jiazhihao jiazhihao closed this Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants