Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why exclude CUDA write lat test? #276

Closed
yangrudan opened this issue Aug 7, 2024 · 5 comments
Closed

why exclude CUDA write lat test? #276

yangrudan opened this issue Aug 7, 2024 · 5 comments

Comments

@yangrudan
Copy link

Can you exlain the reason for excluding CUDA write lat test? On the other hand, RDMA write lat test is ok.

fprintf(stderr,"Perftest supports CUDA latency tests with read/send verbs only\n");

@drossetti
Copy link

For send_lat, the CPU can poll on the CQ irrespective of where the RX buffer is placed.
For write_lat instead, when using RDMA_WRITE, there are no CQEs on the RX side and the CPU cannot (officially) poll on the received data since that sits on CUDA device memory.
Considering #230, we could look at enabling CUDA when RDMA_WRITE_WITH_IMM is selected.

@yangrudan
Copy link
Author

Thanks for your answer. @drossetti

By the way,

  • wheather CUDA write latency value double than RDMA write latency value in small payload?

  • If this phenomenon is normal, why is there such a difference between CUDA and RDMA?

@yangrudan yangrudan reopened this Dec 20, 2024
@yangrudan
Copy link
Author

Hi, when I enabe CUDA with RDMA_WRITE_WITH_IMM , I still meet this warning,
Can you figure out the reason of this?

(base) root@NH-DC-NM129-I06-12U-GPU-246:~/yangrudan/cq/perftest# ./ib_write_lat -d  mlx5_cx6_0 -a -F --report_gbit --write_with_imm --use_cuda=0
---------------------------------------------------------------------------------------
Perftest supports CUDA latency tests with read/send verbs only

By the way , in cpu side is fine:

(base) root@NH-DC-NM129-I06-12U-GPU-246:~/yangrudan/cq/perftest# ./ib_write_lat -d  mlx5_cx6_0 -a -F --report_gbit --write_with_imm

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write_imm Latency Test
 Dual-port       : OFF          Device         : mlx5_cx6_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: OFF          Lock-free      : OFF
 ibv_wr* API     : ON           Using DDP      : OFF
 RX depth        : 512
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 220[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x27e6 PSN 0x195edb
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:16:03:16
 remote address: LID 0000 QPN 0x3c99 PSN 0x45ccc2
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:16:04:04
---------------------------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec] 
 2       1000          5.49           12.27        5.69                5.65             0.00            5.81                    12.27  
 4       1000          5.49           8.09         5.55                5.55             0.03            5.60                    8.09   
 8       1000          5.51           7.09         5.55                5.55             0.03            5.61                    7.09   
 16      1000          5.50           8.91         5.55                5.56             0.00            5.61                    8.91   
 32      1000          5.53           16.20        5.59                5.59             0.00            5.64                    16.20  
 64      1000          5.55           6.95         5.60                5.60             0.00            5.66                    6.95   
 128     1000          5.59           6.57         5.65                5.65             0.00            5.70                    6.57   
 256     1000          6.26           7.41         6.33                6.33             0.00            6.51                    7.41   
 512     1000          6.35           8.46         6.41                6.43             0.05            6.64                    8.46   
 1024    1000          6.45           8.25         6.51                6.52             0.03            6.72                    8.25   
 2048    1000          6.63           8.59         6.70                6.72             0.00            6.89                    8.59   
 4096    1000          6.99           9.31         7.15                7.14             0.03            7.29                    9.31   
 8192    1000          7.18           10.01        7.35                7.35             0.03            7.54                    10.01  
 16384   1000          7.58           8.82         7.80                7.80             0.00            7.84                    8.82   
 32768   1000          8.20           10.46        8.45                8.44             0.00            8.67                    10.46  
 65536   1000          9.61           11.63        9.75                9.75             0.00            9.83                    11.63  
 131072  1000          12.27          15.31        12.48               12.47            0.00            12.64                   15.31  

@yangrudan yangrudan reopened this Dec 20, 2024
@mrgolin
Copy link
Contributor

mrgolin commented Dec 22, 2024

We can use ctx->memory->copy_buffer_to_host for data polling in run_iter_lat_write(). It should already point to the right implementations for reading from HBM.

@yangrudan
Copy link
Author

We can use ctx->memory->copy_buffer_to_host for data polling in run_iter_lat_write(). It should already point to the right implementations for reading from HBM.

Thanks for your reply. Maybe using ctx->memory->copy_buffer_to_host seems like not suitable for write latency test? (I don't quite understand. Can you explain more?)

Actually, my question is why CUDA can not run run_iter_lat_write_imm() to test write latency?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants