Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Improve single-node AllReduce latency #164

Closed
chhwang opened this issue Aug 16, 2023 · 2 comments · Fixed by #169
Closed

[Performance] Improve single-node AllReduce latency #164

chhwang opened this issue Aug 16, 2023 · 2 comments · Fixed by #169

Comments

@chhwang
Copy link
Contributor

chhwang commented Aug 16, 2023

Target latencies:

[LATENCY SUMMARY]:
 |   n_ctx | size    |   latency (us) | bwidth    |
|---------|---------|----------------|-----------|
|       1 | 24.0kB  |            7.7 | 3.0GB/s   |
|       2 | 48.0kB  |            7.7 | 6.0GB/s   |
|       4 | 96.0kB  |            8   | 11.5GB/s  |
|       8 | 192.0kB |           12.6 | 14.5GB/s  |
|      12 | 288.0kB |           13   | 21.1GB/s  |
|      16 | 384.0kB |           13.3 | 27.6GB/s  |
|      32 | 768.0kB |           15.2 | 48.1GB/s  |
|      48 | 1.1MB   |           19.1 | 57.5GB/s  |
|      64 | 1.5MB   |           21.9 | 66.8GB/s  |
|      80 | 1.9MB   |           26.7 | 68.5GB/s  |
|      96 | 2.2MB   |           28.8 | 76.2GB/s  |
|     112 | 2.6MB   |           32   | 80.0GB/s  |
|     128 | 3.0MB   |           36.8 | 79.6GB/s  |
|     160 | 3.8MB   |           43.2 | 84.7GB/s  |
|     192 | 4.5MB   |           49.2 | 89.3GB/s  |
|     224 | 5.2MB   |           55.3 | 92.8GB/s  |
|     256 | 6.0MB   |           63.1 | 92.9GB/s  |
|     288 | 6.8MB   |           69.6 | 94.7GB/s  |
|     320 | 7.5MB   |           77   | 95.1GB/s  |
|     352 | 8.2MB   |           76.9 | 104.7GB/s |
|     384 | 9.0MB   |           83.6 | 105.2GB/s |
|     416 | 9.8MB   |           89.7 | 106.1GB/s |
|     448 | 10.5MB  |           95.9 | 106.9GB/s |
|     480 | 11.2MB  |          102.8 | 106.9GB/s |
|     512 | 12.0MB  |          109.4 | 107.1GB/s |
|     640 | 15.0MB  |          133.9 | 109.4GB/s |
|     768 | 18.0MB  |          158.7 | 110.7GB/s |
|     896 | 21.0MB  |          184.5 | 111.2GB/s |
|    1024 | 24.0MB  |          209.5 | 111.8GB/s |
|    1152 | 27.0MB  |          234.3 | 112.5GB/s |
|    1280 | 30.0MB  |          260   | 112.7GB/s |
|    1408 | 33.0MB  |          284.9 | 113.1GB/s |
|    1536 | 36.0MB  |          310.3 | 113.3GB/s |
|    1664 | 39.0MB  |          336.2 | 113.3GB/s |
|    1792 | 42.0MB  |          361.4 | 113.5GB/s |
|    1920 | 45.0MB  |          384.6 | 114.3GB/s |
|    2048 | 48.0MB  |          409.1 | 114.6GB/s |
@Binyang2014
Copy link
Contributor

Binyang2014 commented Aug 23, 2023

Current perf number for kernel 4 all-reduce from 12MB-48MB

#                                        in-place                       out-of-place
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
    12582912       3145728    123.4  101.94  178.39      0
    15728640       3932160    147.3  106.81  186.91      0
    18874368       4718592    171.9  109.78  192.11      0
    22020096       5505024    195.6  112.58  197.01      0
    25165824       6291456    219.2  114.83  200.95      0
    28311552       7077888    245.8  115.20  201.60      0
    31457280       7864320    270.0  116.52  203.90      0
    34603008       8650752    296.9  116.56  203.97      0
    37748736       9437184    321.1  117.56  205.74      0
    40894464      10223616    347.1  117.81  206.17      0
    44040192      11010048    371.8  118.45  207.29      0
    47185920      11796480    398.4  118.44  207.26      0
    50331648      12582912    422.5  119.12  208.47      0

3M - 12M

#                                        in-place                       out-of-place
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
     3145728        786432    50.91   61.79  108.13      0
     3964928        991232    57.08   69.46  121.55      0
     4784128       1196032    64.54   74.13  129.73      0
     5603328       1400832    71.02   78.90  138.08      0
     6422528       1605632    76.05   84.46  147.80      0
     7241728       1810432    84.31   85.89  150.31      0
     8060928       2015232    89.92   89.64  156.87      0
     8880128       2220032    95.47   93.02  162.78      0
     9699328       2424832    103.2   94.03  164.55      0
    10518528       2629632    108.3   97.09  169.91      0
    11337728       2834432    114.0   99.47  174.07      0
    12156928       3039232    121.7   99.92  174.86      0

@Binyang2014
Copy link
Contributor

Binyang2014 commented Aug 30, 2023

[LATENCY SUMMARY]:

n_ctx size latency (us) allreduce5 allreduce6
1 24.0kB 7.7 7.23
2 48.0kB 7.7 7.69
4 96.0kB 8 8.34
8 192.0kB 12.6 9.75
12 288.0kB 13 11.34
16 384.0kB 13.3 12.99
32 768.0kB 15.2 21.53
48 1.1MB 19.1
64 1.5MB 21.9
80 1.9MB 26.7
96 2.2MB 28.8
112 2.6MB 32
128 3.0MB 36.8
160 3.8MB 43.2
192 4.5MB 49.2
224 5.2MB 55.3
256 6.0MB 63.1
288 6.8MB 69.6
320 7.5MB 77
352 8.2MB 76.9
384 9.0MB 83.6
416 9.8MB 89.7
448 10.5MB 95.9
480 11.2MB 102.8
512 12.0MB 109.4 113.4
640 15.0MB 133.9 136.9
768 18.0MB 158.7 160.3
896 21.0MB 184.5 183.8
1024 24.0MB 209.5 207.5
1152 27.0MB 234.3 231.9
1280 30.0MB 260 255.6
1408 33.0MB 284.9 278.7
1536 36.0MB 310.3 302.0
1664 39.0MB 336.2 325.3
1792 42.0MB 361.4 348.8
1920 45.0MB 384.6 372.2
2048 48.0MB 409.1 395.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants