-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ethernet support #284
Ethernet support #284
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…into EthernetSupport
…into EthernetSupport
Sorry to bother, I just wanna know does this PR "Ethernet support" means that I can use mscclpp on a non-infiniband server?
|
Hi H.Yuan,
Indeed, mscclpp can be utilized on a server that doesn't support InfiniBand. The errors you encountered during the tests occurred because these tests were designed to operate with InfiniBand.
Best Regards,
Caio Rocha
From: H.Yuan ***@***.***>
Sent: September 1, 2024 10:48 PM
To: microsoft/mscclpp ***@***.***>
Cc: Caio Rocha ***@***.***>; State change ***@***.***>
Subject: Re: [microsoft/mscclpp] Ethernet support (PR #284)
Sorry to bother, I just wanna know does this PR "Ethernet support" means that I can use mscclpp on a non-infiniband server?
I build this project on a 20.04 ubuntu server with only Ethernet, but met error when running mp_unit_tests
`:~/mscclpp/build$ mpirun -np 2 ./test/mp_unit_tests
[==========] Running 33 tests from 7 test suites.
[----------] Global test environment set-up.
[==========] Running 33 tests from 7 test suites.
[----------] Global test environment set-up.
[----------] 3 tests from MultiProcessTest
[ RUN ] MultiProcessTest.Prelim
/home/yhc/mscclpp/test/mp_unit/mp_unit_tests.cc:98: Failure
Expected: (gEnv->worldSize) >= (2), actual: 1 vs 2
[ FAILED ] MultiProcessTest.Prelim (0 ms)
[ RUN ] MultiProcessTest.HostName
[ OK ] MultiProcessTest.HostName (0 ms)
[ RUN ] MultiProcessTest.HostHash
[ OK ] MultiProcessTest.HostHash (0 ms)
[----------] 3 tests from MultiProcessTest (0 ms total)
[----------] 7 tests from BootstrapTest
[ RUN ] BootstrapTest.WithId
[ OK ] BootstrapTest.WithId (2 ms)
[ RUN ] BootstrapTest.WithIpPortPair
[ OK ] BootstrapTest.WithIpPortPair (2 ms)
[ RUN ] BootstrapTest.ResumeWithId
[----------] 3 tests from MultiProcessTest
[ RUN ] MultiProcessTest.Prelim
/home/yhc/mscclpp/test/mp_unit/mp_unit_tests.cc:98: Failure
Expected: (gEnv->worldSize) >= (2), actual: 1 vs 2
[ FAILED ] MultiProcessTest.Prelim (0 ms)
[ RUN ] MultiProcessTest.HostName
[ OK ] MultiProcessTest.HostName (0 ms)
[ RUN ] MultiProcessTest.HostHash
[ OK ] MultiProcessTest.HostHash (0 ms)
[----------] 3 tests from MultiProcessTest (0 ms total)
[----------] 7 tests from BootstrapTest
[ RUN ] BootstrapTest.WithId
[ OK ] BootstrapTest.WithId (4 ms)
[ RUN ] BootstrapTest.WithIpPortPair
[ OK ] BootstrapTest.ResumeWithId (28 ms)
[ RUN ] BootstrapTest.ResumeWithIpPortPair
[ OK ] BootstrapTest.WithIpPortPair (3 ms)
[ RUN ] BootstrapTest.ResumeWithId
[ OK ] BootstrapTest.ResumeWithId (23 ms)
[ RUN ] BootstrapTest.ResumeWithIpPortPair
unknown file: Failure
C++ exception with description "TcpBootstrap connection timeout (Mscclpp failure: Timeout)" thrown in the test body.
[ FAILED ] BootstrapTest.ResumeWithIpPortPair (30000 ms)
[ RUN ] BootstrapTest.ExitBeforeConnect
[ OK ] BootstrapTest.ExitBeforeConnect (1 ms)
[ RUN ] BootstrapTest.TimeoutWithId
/home/yhc/mscclpp/test/mp_unit/bootstrap_tests.cc:106: Failure
Expected: (timer.elapsed()) > (1000000), actual: 4669 vs 1000000
[ FAILED ] BootstrapTest.TimeoutWithId (4 ms)
[ RUN ] BootstrapTest.MPIBootstrap
[ OK ] BootstrapTest.MPIBootstrap (0 ms)
[----------] 7 tests from BootstrapTest (30039 ms total)
[----------] 3 tests from IbPeerToPeerTest
[ RUN ] IbPeerToPeerTest.SimpleSendRecv
unknown file: Failure
C++ exception with description "TcpBootstrap connection timeout (Mscclpp failure: Timeout)" thrown in the test body.
[ FAILED ] BootstrapTest.ResumeWithIpPortPair (30001 ms)
[ RUN ] BootstrapTest.ExitBeforeConnect
[ OK ] BootstrapTest.ExitBeforeConnect (1 ms)
[ RUN ] BootstrapTest.TimeoutWithId
/home/yhc/mscclpp/test/mp_unit/bootstrap_tests.cc:106: Failure
Expected: (timer.elapsed()) > (1000000), actual: 4663 vs 1000000
[ FAILED ] BootstrapTest.TimeoutWithId (4 ms)
[ RUN ] BootstrapTest.MPIBootstrap
[ OK ] BootstrapTest.MPIBootstrap (0 ms)
[----------] 7 tests from BootstrapTest (30039 ms total)
[----------] 3 tests from IbPeerToPeerTest
[ RUN ] IbPeerToPeerTest.SimpleSendRecv
[ubuntu:16923] *** Process received signal ***
[ubuntu:16923] Signal: Floating point exception (8)
[ubuntu:16923] Signal code: Integer divide-by-zero (1)
[ubuntu:16923] Failing at address: 0x557a78cc5519
[ubuntu:16923] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f13409ba420]
[ubuntu:16923] [ 1] ./test/mp_unit_tests(+0x43519)[0x557a78cc5519]
[ubuntu:16923] [ 2] ./test/mp_unit_tests(+0x4369c)[0x557a78cc569c]
[ubuntu:16923] [ 3] ./test/mp_unit_tests(+0xaa7c1)[0x557a78d2c7c1]
[ubuntu:16923] [ 4] ./test/mp_unit_tests(+0x9573d)[0x557a78d1773d]
[ubuntu:16923] [ 5] ./test/mp_unit_tests(+0x95f0a)[0x557a78d17f0a]
[ubuntu:16923] [ 6] ./test/mp_unit_tests(+0x969eb)[0x557a78d189eb]
[ubuntu:16923] [ 7] ./test/mp_unit_tests(+0x9ef57)[0x557a78d20f57]
[ubuntu:16923] [ 8] ./test/mp_unit_tests(+0x96043)[0x557a78d18043]
[ubuntu:16923] [ 9] ./test/mp_unit_tests(+0x2f279)[0x557a78cb1279]
[ubuntu:16923] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f13402b1083]
[ubuntu:16923] [11] ./test/mp_unit_tests(+0x32a8e)[0x557a78cb4a8e]
[ubuntu:16923] *** End of error message ***
[ubuntu:16924] *** Process received signal ***
[ubuntu:16924] Signal: Floating point exception (8)
[ubuntu:16924] Signal code: Integer divide-by-zero (1)
[ubuntu:16924] Failing at address: 0x559e4c91e519
[ubuntu:16924] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f48318f4420]
[ubuntu:16924] [ 1] ./test/mp_unit_tests(+0x43519)[0x559e4c91e519]
[ubuntu:16924] [ 2] ./test/mp_unit_tests(+0x4369c)[0x559e4c91e69c]
[ubuntu:16924] [ 3] ./test/mp_unit_tests(+0xaa7c1)[0x559e4c9857c1]
[ubuntu:16924] [ 4] ./test/mp_unit_tests(+0x9573d)[0x559e4c97073d]
[ubuntu:16924] [ 5] ./test/mp_unit_tests(+0x95f0a)[0x559e4c970f0a]
[ubuntu:16924] [ 6] ./test/mp_unit_tests(+0x969eb)[0x559e4c9719eb]
[ubuntu:16924] [ 7] ./test/mp_unit_tests(+0x9ef57)[0x559e4c979f57]
[ubuntu:16924] [ 8] ./test/mp_unit_tests(+0x96043)[0x559e4c971043]
[ubuntu:16924] [ 9] ./test/mp_unit_tests(+0x2f279)[0x559e4c90a279]
[ubuntu:16924] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f48311eb083]
[ubuntu:16924] [11] ./test/mp_unit_tests(+0x32a8e)[0x559e4c90da8e]
[ubuntu:16924] *** End of error message ***
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 16923 RUNNING AT ubuntu
= KILLED BY SIGNAL: 8 (Floating point exception)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 16924 RUNNING AT ubuntu
= KILLED BY SIGNAL: 8 (Floating point exception)
===================================================================================`
-
Reply to this email directly, view it on GitHub<#284 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BHFFAY4VCG3ZTYTBMPHUP3LZUP3XZAVCNFSM6AAAAABF3444HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRTHA3DENZXG4>.
You are receiving this because you modified the open/close state.Message ID: ***@***.***>
|
Great thanks for reply, I still have another question that whether nvlink is necessary when I tried to use mscclpp for intra-node cross GPUs communication. Best Regards, |
Hi Haochen Yuan,
The mscclpp should operate without NVLink, but our algorithm is optimized for NVLink, resulting in suboptimal performance if you use PCIe. Additionally, we don't have a specific test in environments lacking InfiniBand but utilizing PCIe. There is one test using only ethernet if you want to try: mpirun -np 2 ./test/mp_unit_tests --gtest_filter=ProxyChannelOneToOneTest.PingPongEthernet
Best Regards,
Caio Rocha
From: H.Yuan ***@***.***>
Sent: September 3, 2024 12:11 AM
To: microsoft/mscclpp ***@***.***>
Cc: Caio Rocha ***@***.***>; State change ***@***.***>
Subject: Re: [microsoft/mscclpp] Ethernet support (PR #284)
Hi H.Yuan, Indeed, mscclpp can be utilized on a server that doesn't support InfiniBand. The errors you encountered during the tests occurred because these tests were designed to operate with InfiniBand. Best Regards, Caio Rocha From: H.Yuan @.> Sent: September 1, 2024 10:48 PM To: microsoft/mscclpp @.> Cc: Caio Rocha @.>; State change @.> Subject: Re: [microsoft/mscclpp] Ethernet support (PR #284<#284>) Sorry to bother, I just wanna know does this PR "Ethernet support" means that I can use mscclpp on a non-infiniband server? I build this project on a 20.04 ubuntu server with only Ethernet, but met error when running mp_unit_tests :~/mscclpp/build$ mpirun -np 2 ./test/mp_unit_tests [==========] Running 33 tests from 7 test suites. [----------] Global test environment set-up. [==========] Running 33 tests from 7 test suites. [----------] Global test environment set-up. [----------] 3 tests from MultiProcessTest [ RUN ] MultiProcessTest.Prelim /home/yhc/mscclpp/test/mp_unit/mp_unit_tests.cc:98: Failure Expected: (gEnv->worldSize) >= (2), actual: 1 vs 2 [ FAILED ] MultiProcessTest.Prelim (0 ms) [ RUN ] MultiProcessTest.HostName [ OK ] MultiProcessTest.HostName (0 ms) [ RUN ] MultiProcessTest.HostHash [ OK ] MultiProcessTest.HostHash (0 ms) [----------] 3 tests from MultiProcessTest (0 ms total) [----------] 7 tests from BootstrapTest [ RUN ] BootstrapTest.WithId [ OK ] BootstrapTest.WithId (2 ms) [ RUN ] BootstrapTest.WithIpPortPair [ OK ] BootstrapTest.WithIpPortPair (2 ms) [ RUN ] BootstrapTest.ResumeWithId [----------] 3 tests from MultiProcessTest [ RUN ] MultiProcessTest.Prelim /home/yhc/mscclpp/test/mp_unit/mp_unit_tests.cc:98: Failure Expected: (gEnv->worldSize) >= (2), actual: 1 vs 2 [ FAILED ] MultiProcessTest.Prelim (0 ms) [ RUN ] MultiProcessTest.HostName [ OK ] MultiProcessTest.HostName (0 ms) [ RUN ] MultiProcessTest.HostHash [ OK ] MultiProcessTest.HostHash (0 ms) [----------] 3 tests from MultiProcessTest (0 ms total) [----------] 7 tests from BootstrapTest [ RUN ] BootstrapTest.WithId [ OK ] BootstrapTest.WithId (4 ms) [ RUN ] BootstrapTest.WithIpPortPair [ OK ] BootstrapTest.ResumeWithId (28 ms) [ RUN ] BootstrapTest.ResumeWithIpPortPair [ OK ] BootstrapTest.WithIpPortPair (3 ms) [ RUN ] BootstrapTest.ResumeWithId [ OK ] BootstrapTest.ResumeWithId (23 ms) [ RUN ] BootstrapTest.ResumeWithIpPortPair unknown file: Failure C++ exception with description "TcpBootstrap connection timeout (Mscclpp failure: Timeout)" thrown in the test body. [ FAILED ] BootstrapTest.ResumeWithIpPortPair (30000 ms) [ RUN ] BootstrapTest.ExitBeforeConnect [ OK ] BootstrapTest.ExitBeforeConnect (1 ms) [ RUN ] BootstrapTest.TimeoutWithId /home/yhc/mscclpp/test/mp_unit/bootstrap_tests.cc:106: Failure Expected: (timer.elapsed()) > (1000000), actual: 4669 vs 1000000 [ FAILED ] BootstrapTest.TimeoutWithId (4 ms) [ RUN ] BootstrapTest.MPIBootstrap [ OK ] BootstrapTest.MPIBootstrap (0 ms) [----------] 7 tests from BootstrapTest (30039 ms total) [----------] 3 tests from IbPeerToPeerTest [ RUN ] IbPeerToPeerTest.SimpleSendRecv unknown file: Failure C++ exception with description "TcpBootstrap connection timeout (Mscclpp failure: Timeout)" thrown in the test body. [ FAILED ] BootstrapTest.ResumeWithIpPortPair (30001 ms) [ RUN ] BootstrapTest.ExitBeforeConnect [ OK ] BootstrapTest.ExitBeforeConnect (1 ms) [ RUN ] BootstrapTest.TimeoutWithId /home/yhc/mscclpp/test/mp_unit/bootstrap_tests.cc:106: Failure Expected: (timer.elapsed()) > (1000000), actual: 4663 vs 1000000 [ FAILED ] BootstrapTest.TimeoutWithId (4 ms) [ RUN ] BootstrapTest.MPIBootstrap [ OK ] BootstrapTest.MPIBootstrap (0 ms) [----------] 7 tests from BootstrapTest (30039 ms total) [----------] 3 tests from IbPeerToPeerTest [ RUN ] IbPeerToPeerTest.SimpleSendRecv [ubuntu:16923] *** Process received signal *** [ubuntu:16923] Signal: Floating point exception (8) [ubuntu:16923] Signal code: Integer divide-by-zero (1) [ubuntu:16923] Failing at address: 0x557a78cc5519 [ubuntu:16923] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f13409ba420] [ubuntu:16923] [ 1] ./test/mp_unit_tests(+0x43519)[0x557a78cc5519] [ubuntu:16923] [ 2] ./test/mp_unit_tests(+0x4369c)[0x557a78cc569c] [ubuntu:16923] [ 3] ./test/mp_unit_tests(+0xaa7c1)[0x557a78d2c7c1] [ubuntu:16923] [ 4] ./test/mp_unit_tests(+0x9573d)[0x557a78d1773d] [ubuntu:16923] [ 5] ./test/mp_unit_tests(+0x95f0a)[0x557a78d17f0a] [ubuntu:16923] [ 6] ./test/mp_unit_tests(+0x969eb)[0x557a78d189eb] [ubuntu:16923] [ 7] ./test/mp_unit_tests(+0x9ef57)[0x557a78d20f57] [ubuntu:16923] [ 8] ./test/mp_unit_tests(+0x96043)[0x557a78d18043] [ubuntu:16923] [ 9] ./test/mp_unit_tests(+0x2f279)[0x557a78cb1279] [ubuntu:16923] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f13402b1083] [ubuntu:16923] [11] ./test/mp_unit_tests(+0x32a8e)[0x557a78cb4a8e] [ubuntu:16923] *** End of error message *** [ubuntu:16924] *** Process received signal *** [ubuntu:16924] Signal: Floating point exception (8) [ubuntu:16924] Signal code: Integer divide-by-zero (1) [ubuntu:16924] Failing at address: 0x559e4c91e519 [ubuntu:16924] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f48318f4420] [ubuntu:16924] [ 1] ./test/mp_unit_tests(+0x43519)[0x559e4c91e519] [ubuntu:16924] [ 2] ./test/mp_unit_tests(+0x4369c)[0x559e4c91e69c] [ubuntu:16924] [ 3] ./test/mp_unit_tests(+0xaa7c1)[0x559e4c9857c1] [ubuntu:16924] [ 4] ./test/mp_unit_tests(+0x9573d)[0x559e4c97073d] [ubuntu:16924] [ 5] ./test/mp_unit_tests(+0x95f0a)[0x559e4c970f0a] [ubuntu:16924] [ 6] ./test/mp_unit_tests(+0x969eb)[0x559e4c9719eb] [ubuntu:16924] [ 7] ./test/mp_unit_tests(+0x9ef57)[0x559e4c979f57] [ubuntu:16924] [ 8] ./test/mp_unit_tests(+0x96043)[0x559e4c971043] [ubuntu:16924] [ 9] ./test/mp_unit_tests(+0x2f279)[0x559e4c90a279] [ubuntu:16924] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f48311eb083] [ubuntu:16924] [11] ./test/mp_unit_tests(+0x32a8e)[0x559e4c90da8e] [ubuntu:16924] *** End of error message *** =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 0 PID 16923 RUNNING AT ubuntu = KILLED BY SIGNAL: 8 (Floating point exception) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 1 PID 16924 RUNNING AT ubuntu = KILLED BY SIGNAL: 8 (Floating point exception) =================================================================================== - Reply to this email directly, view it on GitHub<#284 (comment)<#284 (comment)>>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BHFFAY4VCG3ZTYTBMPHUP3LZUP3XZAVCNFSM6AAAAABF3444HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRTHA3DENZXG4. You are receiving this because you modified the open/close state.Message ID: @.***>
Great thanks for reply, I still have another question that whether nvlink is necessary when I tried to use mscclpp for intra-node cross GPUs communication.
If it could, is there any test scripts for GPUs communication inside one server?
Best Regards,
Haochen Yuan
-
Reply to this email directly, view it on GitHub<#284 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BHFFAY6I34KJQACUNSDQGPLZUVOJZAVCNFSM6AAAAABF3444HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRVG43DGOJUHE>.
You are receiving this because you modified the open/close state.Message ID: ***@***.***>
|
Got it. Thanks for your patient reply~ |
No description provided.