Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ethernet support #284

Merged
merged 20 commits into from
Apr 25, 2024
Merged

Ethernet support #284

merged 20 commits into from
Apr 25, 2024

Conversation

chhwang
Copy link
Contributor

@chhwang chhwang commented Apr 8, 2024

No description provided.

@chhwang chhwang linked an issue Apr 8, 2024 that may be closed by this pull request
@caiomcbr caiomcbr marked this pull request as ready for review April 16, 2024 00:05
@caiomcbr caiomcbr requested a review from Binyang2014 April 16, 2024 00:06
Sadewoabdi

This comment was marked as abuse.

src/connection.cc Outdated Show resolved Hide resolved
src/connection.cc Outdated Show resolved Hide resolved
src/include/connection.hpp Outdated Show resolved Hide resolved
src/connection.cc Outdated Show resolved Hide resolved
src/connection.cc Outdated Show resolved Hide resolved
src/connection.cc Outdated Show resolved Hide resolved
src/registered_memory.cc Outdated Show resolved Hide resolved
test/mp_unit/communicator_tests.cu Outdated Show resolved Hide resolved
test/mp_unit/communicator_tests.cu Outdated Show resolved Hide resolved
test/mp_unit/proxy_channel_tests.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@Binyang2014 Binyang2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@caiomcbr caiomcbr merged commit d4ede48 into main Apr 25, 2024
15 checks passed
@caiomcbr caiomcbr deleted the EthernetSupport branch April 25, 2024 18:06
@Fjallraven-hc
Copy link

Fjallraven-hc commented Sep 2, 2024

Sorry to bother, I just wanna know does this PR "Ethernet support" means that I can use mscclpp on a non-infiniband server?
I build this project on a 20.04 ubuntu server with only Ethernet, but met error when running mp_unit_tests.

:~/mscclpp/build$ mpirun -np 2 ./test/mp_unit_tests
[==========] Running 33 tests from 7 test suites.
[----------] Global test environment set-up.
[==========] Running 33 tests from 7 test suites.
[----------] Global test environment set-up.
[----------] 3 tests from MultiProcessTest
[ RUN ] MultiProcessTest.Prelim
/home/yhc/mscclpp/test/mp_unit/mp_unit_tests.cc:98: Failure
Expected: (gEnv->worldSize) >= (2), actual: 1 vs 2
[ FAILED ] MultiProcessTest.Prelim (0 ms)
[ RUN ] MultiProcessTest.HostName
[ OK ] MultiProcessTest.HostName (0 ms)
[ RUN ] MultiProcessTest.HostHash
[ OK ] MultiProcessTest.HostHash (0 ms)
[----------] 3 tests from MultiProcessTest (0 ms total)
[----------] 7 tests from BootstrapTest
[ RUN ] BootstrapTest.WithId
[ OK ] BootstrapTest.WithId (2 ms)
[ RUN ] BootstrapTest.WithIpPortPair
[ OK ] BootstrapTest.WithIpPortPair (2 ms)
[ RUN ] BootstrapTest.ResumeWithId
[----------] 3 tests from MultiProcessTest
[ RUN ] MultiProcessTest.Prelim
/home/yhc/mscclpp/test/mp_unit/mp_unit_tests.cc:98: Failure
Expected: (gEnv->worldSize) >= (2), actual: 1 vs 2
[ FAILED ] MultiProcessTest.Prelim (0 ms)
[ RUN ] MultiProcessTest.HostName
[ OK ] MultiProcessTest.HostName (0 ms)
[ RUN ] MultiProcessTest.HostHash
[ OK ] MultiProcessTest.HostHash (0 ms)
[----------] 3 tests from MultiProcessTest (0 ms total)
[----------] 7 tests from BootstrapTest
[ RUN ] BootstrapTest.WithId
[ OK ] BootstrapTest.WithId (4 ms)
[ RUN ] BootstrapTest.WithIpPortPair
[ OK ] BootstrapTest.ResumeWithId (28 ms)
[ RUN ] BootstrapTest.ResumeWithIpPortPair
[ OK ] BootstrapTest.WithIpPortPair (3 ms)
[ RUN ] BootstrapTest.ResumeWithId
[ OK ] BootstrapTest.ResumeWithId (23 ms)
[ RUN ] BootstrapTest.ResumeWithIpPortPair
unknown file: Failure
C++ exception with description "TcpBootstrap connection timeout (Mscclpp failure: Timeout)" thrown in the test body.
[ FAILED ] BootstrapTest.ResumeWithIpPortPair (30000 ms)
[ RUN ] BootstrapTest.ExitBeforeConnect
[ OK ] BootstrapTest.ExitBeforeConnect (1 ms)
[ RUN ] BootstrapTest.TimeoutWithId
/home/yhc/mscclpp/test/mp_unit/bootstrap_tests.cc:106: Failure
Expected: (timer.elapsed()) > (1000000), actual: 4669 vs 1000000
[ FAILED ] BootstrapTest.TimeoutWithId (4 ms)
[ RUN ] BootstrapTest.MPIBootstrap
[ OK ] BootstrapTest.MPIBootstrap (0 ms)
[----------] 7 tests from BootstrapTest (30039 ms total)
[----------] 3 tests from IbPeerToPeerTest
[ RUN ] IbPeerToPeerTest.SimpleSendRecv
unknown file: Failure
C++ exception with description "TcpBootstrap connection timeout (Mscclpp failure: Timeout)" thrown in the test body.
[ FAILED ] BootstrapTest.ResumeWithIpPortPair (30001 ms)
[ RUN ] BootstrapTest.ExitBeforeConnect
[ OK ] BootstrapTest.ExitBeforeConnect (1 ms)
[ RUN ] BootstrapTest.TimeoutWithId
/home/yhc/mscclpp/test/mp_unit/bootstrap_tests.cc:106: Failure
Expected: (timer.elapsed()) > (1000000), actual: 4663 vs 1000000
[ FAILED ] BootstrapTest.TimeoutWithId (4 ms)
[ RUN ] BootstrapTest.MPIBootstrap
[ OK ] BootstrapTest.MPIBootstrap (0 ms)
[----------] 7 tests from BootstrapTest (30039 ms total)
[----------] 3 tests from IbPeerToPeerTest
[ RUN ] IbPeerToPeerTest.SimpleSendRecv
[ubuntu:16923] *** Process received signal ***
[ubuntu:16923] Signal: Floating point exception (8)
[ubuntu:16923] Signal code: Integer divide-by-zero (1)
[ubuntu:16923] Failing at address: 0x557a78cc5519
[ubuntu:16923] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f13409ba420]
[ubuntu:16923] [ 1] ./test/mp_unit_tests(+0x43519)[0x557a78cc5519]
[ubuntu:16923] [ 2] ./test/mp_unit_tests(+0x4369c)[0x557a78cc569c]
[ubuntu:16923] [ 3] ./test/mp_unit_tests(+0xaa7c1)[0x557a78d2c7c1]
[ubuntu:16923] [ 4] ./test/mp_unit_tests(+0x9573d)[0x557a78d1773d]
[ubuntu:16923] [ 5] ./test/mp_unit_tests(+0x95f0a)[0x557a78d17f0a]
[ubuntu:16923] [ 6] ./test/mp_unit_tests(+0x969eb)[0x557a78d189eb]
[ubuntu:16923] [ 7] ./test/mp_unit_tests(+0x9ef57)[0x557a78d20f57]
[ubuntu:16923] [ 8] ./test/mp_unit_tests(+0x96043)[0x557a78d18043]
[ubuntu:16923] [ 9] ./test/mp_unit_tests(+0x2f279)[0x557a78cb1279]
[ubuntu:16923] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f13402b1083]
[ubuntu:16923] [11] ./test/mp_unit_tests(+0x32a8e)[0x557a78cb4a8e]
[ubuntu:16923] *** End of error message ***
[ubuntu:16924] *** Process received signal ***
[ubuntu:16924] Signal: Floating point exception (8)
[ubuntu:16924] Signal code: Integer divide-by-zero (1)
[ubuntu:16924] Failing at address: 0x559e4c91e519
[ubuntu:16924] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f48318f4420]
[ubuntu:16924] [ 1] ./test/mp_unit_tests(+0x43519)[0x559e4c91e519]
[ubuntu:16924] [ 2] ./test/mp_unit_tests(+0x4369c)[0x559e4c91e69c]
[ubuntu:16924] [ 3] ./test/mp_unit_tests(+0xaa7c1)[0x559e4c9857c1]
[ubuntu:16924] [ 4] ./test/mp_unit_tests(+0x9573d)[0x559e4c97073d]
[ubuntu:16924] [ 5] ./test/mp_unit_tests(+0x95f0a)[0x559e4c970f0a]
[ubuntu:16924] [ 6] ./test/mp_unit_tests(+0x969eb)[0x559e4c9719eb]
[ubuntu:16924] [ 7] ./test/mp_unit_tests(+0x9ef57)[0x559e4c979f57]
[ubuntu:16924] [ 8] ./test/mp_unit_tests(+0x96043)[0x559e4c971043]
[ubuntu:16924] [ 9] ./test/mp_unit_tests(+0x2f279)[0x559e4c90a279]
[ubuntu:16924] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f48311eb083]
[ubuntu:16924] [11] ./test/mp_unit_tests(+0x32a8e)[0x559e4c90da8e]
[ubuntu:16924] *** End of error message ***
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 16923 RUNNING AT ubuntu
= KILLED BY SIGNAL: 8 (Floating point exception)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 16924 RUNNING AT ubuntu
= KILLED BY SIGNAL: 8 (Floating point exception)
===================================================================================

@caiomcbr
Copy link
Contributor

caiomcbr commented Sep 3, 2024 via email

@Fjallraven-hc
Copy link

Hi H.Yuan, Indeed, mscclpp can be utilized on a server that doesn't support InfiniBand. The errors you encountered during the tests occurred because these tests were designed to operate with InfiniBand. Best Regards, Caio Rocha From: H.Yuan @.> Sent: September 1, 2024 10:48 PM To: microsoft/mscclpp @.> Cc: Caio Rocha @.>; State change @.> Subject: Re: [microsoft/mscclpp] Ethernet support (PR #284) Sorry to bother, I just wanna know does this PR "Ethernet support" means that I can use mscclpp on a non-infiniband server? I build this project on a 20.04 ubuntu server with only Ethernet, but met error when running mp_unit_tests :~/mscclpp/build$ mpirun -np 2 ./test/mp_unit_tests [==========] Running 33 tests from 7 test suites. [----------] Global test environment set-up. [==========] Running 33 tests from 7 test suites. [----------] Global test environment set-up. [----------] 3 tests from MultiProcessTest [ RUN ] MultiProcessTest.Prelim /home/yhc/mscclpp/test/mp_unit/mp_unit_tests.cc:98: Failure Expected: (gEnv->worldSize) >= (2), actual: 1 vs 2 [ FAILED ] MultiProcessTest.Prelim (0 ms) [ RUN ] MultiProcessTest.HostName [ OK ] MultiProcessTest.HostName (0 ms) [ RUN ] MultiProcessTest.HostHash [ OK ] MultiProcessTest.HostHash (0 ms) [----------] 3 tests from MultiProcessTest (0 ms total) [----------] 7 tests from BootstrapTest [ RUN ] BootstrapTest.WithId [ OK ] BootstrapTest.WithId (2 ms) [ RUN ] BootstrapTest.WithIpPortPair [ OK ] BootstrapTest.WithIpPortPair (2 ms) [ RUN ] BootstrapTest.ResumeWithId [----------] 3 tests from MultiProcessTest [ RUN ] MultiProcessTest.Prelim /home/yhc/mscclpp/test/mp_unit/mp_unit_tests.cc:98: Failure Expected: (gEnv->worldSize) >= (2), actual: 1 vs 2 [ FAILED ] MultiProcessTest.Prelim (0 ms) [ RUN ] MultiProcessTest.HostName [ OK ] MultiProcessTest.HostName (0 ms) [ RUN ] MultiProcessTest.HostHash [ OK ] MultiProcessTest.HostHash (0 ms) [----------] 3 tests from MultiProcessTest (0 ms total) [----------] 7 tests from BootstrapTest [ RUN ] BootstrapTest.WithId [ OK ] BootstrapTest.WithId (4 ms) [ RUN ] BootstrapTest.WithIpPortPair [ OK ] BootstrapTest.ResumeWithId (28 ms) [ RUN ] BootstrapTest.ResumeWithIpPortPair [ OK ] BootstrapTest.WithIpPortPair (3 ms) [ RUN ] BootstrapTest.ResumeWithId [ OK ] BootstrapTest.ResumeWithId (23 ms) [ RUN ] BootstrapTest.ResumeWithIpPortPair unknown file: Failure C++ exception with description "TcpBootstrap connection timeout (Mscclpp failure: Timeout)" thrown in the test body. [ FAILED ] BootstrapTest.ResumeWithIpPortPair (30000 ms) [ RUN ] BootstrapTest.ExitBeforeConnect [ OK ] BootstrapTest.ExitBeforeConnect (1 ms) [ RUN ] BootstrapTest.TimeoutWithId /home/yhc/mscclpp/test/mp_unit/bootstrap_tests.cc:106: Failure Expected: (timer.elapsed()) > (1000000), actual: 4669 vs 1000000 [ FAILED ] BootstrapTest.TimeoutWithId (4 ms) [ RUN ] BootstrapTest.MPIBootstrap [ OK ] BootstrapTest.MPIBootstrap (0 ms) [----------] 7 tests from BootstrapTest (30039 ms total) [----------] 3 tests from IbPeerToPeerTest [ RUN ] IbPeerToPeerTest.SimpleSendRecv unknown file: Failure C++ exception with description "TcpBootstrap connection timeout (Mscclpp failure: Timeout)" thrown in the test body. [ FAILED ] BootstrapTest.ResumeWithIpPortPair (30001 ms) [ RUN ] BootstrapTest.ExitBeforeConnect [ OK ] BootstrapTest.ExitBeforeConnect (1 ms) [ RUN ] BootstrapTest.TimeoutWithId /home/yhc/mscclpp/test/mp_unit/bootstrap_tests.cc:106: Failure Expected: (timer.elapsed()) > (1000000), actual: 4663 vs 1000000 [ FAILED ] BootstrapTest.TimeoutWithId (4 ms) [ RUN ] BootstrapTest.MPIBootstrap [ OK ] BootstrapTest.MPIBootstrap (0 ms) [----------] 7 tests from BootstrapTest (30039 ms total) [----------] 3 tests from IbPeerToPeerTest [ RUN ] IbPeerToPeerTest.SimpleSendRecv [ubuntu:16923] *** Process received signal *** [ubuntu:16923] Signal: Floating point exception (8) [ubuntu:16923] Signal code: Integer divide-by-zero (1) [ubuntu:16923] Failing at address: 0x557a78cc5519 [ubuntu:16923] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f13409ba420] [ubuntu:16923] [ 1] ./test/mp_unit_tests(+0x43519)[0x557a78cc5519] [ubuntu:16923] [ 2] ./test/mp_unit_tests(+0x4369c)[0x557a78cc569c] [ubuntu:16923] [ 3] ./test/mp_unit_tests(+0xaa7c1)[0x557a78d2c7c1] [ubuntu:16923] [ 4] ./test/mp_unit_tests(+0x9573d)[0x557a78d1773d] [ubuntu:16923] [ 5] ./test/mp_unit_tests(+0x95f0a)[0x557a78d17f0a] [ubuntu:16923] [ 6] ./test/mp_unit_tests(+0x969eb)[0x557a78d189eb] [ubuntu:16923] [ 7] ./test/mp_unit_tests(+0x9ef57)[0x557a78d20f57] [ubuntu:16923] [ 8] ./test/mp_unit_tests(+0x96043)[0x557a78d18043] [ubuntu:16923] [ 9] ./test/mp_unit_tests(+0x2f279)[0x557a78cb1279] [ubuntu:16923] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f13402b1083] [ubuntu:16923] [11] ./test/mp_unit_tests(+0x32a8e)[0x557a78cb4a8e] [ubuntu:16923] *** End of error message *** [ubuntu:16924] *** Process received signal *** [ubuntu:16924] Signal: Floating point exception (8) [ubuntu:16924] Signal code: Integer divide-by-zero (1) [ubuntu:16924] Failing at address: 0x559e4c91e519 [ubuntu:16924] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f48318f4420] [ubuntu:16924] [ 1] ./test/mp_unit_tests(+0x43519)[0x559e4c91e519] [ubuntu:16924] [ 2] ./test/mp_unit_tests(+0x4369c)[0x559e4c91e69c] [ubuntu:16924] [ 3] ./test/mp_unit_tests(+0xaa7c1)[0x559e4c9857c1] [ubuntu:16924] [ 4] ./test/mp_unit_tests(+0x9573d)[0x559e4c97073d] [ubuntu:16924] [ 5] ./test/mp_unit_tests(+0x95f0a)[0x559e4c970f0a] [ubuntu:16924] [ 6] ./test/mp_unit_tests(+0x969eb)[0x559e4c9719eb] [ubuntu:16924] [ 7] ./test/mp_unit_tests(+0x9ef57)[0x559e4c979f57] [ubuntu:16924] [ 8] ./test/mp_unit_tests(+0x96043)[0x559e4c971043] [ubuntu:16924] [ 9] ./test/mp_unit_tests(+0x2f279)[0x559e4c90a279] [ubuntu:16924] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f48311eb083] [ubuntu:16924] [11] ./test/mp_unit_tests(+0x32a8e)[0x559e4c90da8e] [ubuntu:16924] *** End of error message *** =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 0 PID 16923 RUNNING AT ubuntu = KILLED BY SIGNAL: 8 (Floating point exception) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 1 PID 16924 RUNNING AT ubuntu = KILLED BY SIGNAL: 8 (Floating point exception) =================================================================================== - Reply to this email directly, view it on GitHub<#284 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BHFFAY4VCG3ZTYTBMPHUP3LZUP3XZAVCNFSM6AAAAABF3444HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRTHA3DENZXG4. You are receiving this because you modified the open/close state.Message ID: @.***>

Great thanks for reply, I still have another question that whether nvlink is necessary when I tried to use mscclpp for intra-node cross GPUs communication.
If it could, is there any test scripts for GPUs communication inside one server?

Best Regards,
Haochen Yuan

@caiomcbr
Copy link
Contributor

caiomcbr commented Sep 3, 2024 via email

@Fjallraven-hc
Copy link

Hi Haochen Yuan, The mscclpp should operate without NVLink, but our algorithm is optimized for NVLink, resulting in suboptimal performance if you use PCIe. Additionally, we don't have a specific test in environments lacking InfiniBand but utilizing PCIe. There is one test using only ethernet if you want to try: mpirun -np 2 ./test/mp_unit_tests --gtest_filter=ProxyChannelOneToOneTest.PingPongEthernet Best Regards, Caio Rocha

Got it. Thanks for your patient reply~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Ethernet connection
5 participants