Marc S. Orr, Shuai Che, Bradford M. Beckmann, Mark Oskin, Steven K. Reinhardt, and David A. Wood. 2017. Gravel: fine-grain GPU-initiated network messages. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). Association for Computing Machinery, New York, NY, USA, Article 23, 1–12. DOI:https://doi.org/10.1145/3126908.3126914
- PGAS (Partitioned Global Address Space) style communication
- GPU-initiated communication is inefficient for small messages.
- Gravel offloads GPU-initiated messages through a GPU-queue to an aggregator.
- Aggregator is implemented with CPU threads. Aggregator repacks the messages into per-node queues and sends them to the NI after they become full or exceed a timeout (this is the part where Gravel generates large messages to amortize network overhead).
- Leverages a diverged WG-level semantic to asynchronously offload messages to the NI.
- Another alternative is to offload messages at wavefront granularity, which is done in prior work, but offloading messages at WG granularity is approximately 3x faster.
- Gravel’s “CPU-side aggregation” strategy scales better than the “GPU-side aggregation” strategy of coprocessor model and coalesced APIs because as the number of destinations (and per-node queues) increase, GPU-side aggregation suffers low SIMT utilization since WIs in the same WG write different queues.
- In Gravel, GPU always writes messages to a single queue.
- Gravel is similar to NVSHMEM but it targets other interconnects such as Ethernet, Infiniband (NVSHMEM is for PCIe and NVLink only).
- Main idea: Combine small messages targeting the same node into larger messages. Then, initiate network request.
- Network Inteface access coordination: A dependency between WIs executing in lockstep can cause deadlock.
- Cost of producer/consumer synchronization: For example, in a graph algorithm it is typical to initiate a small message (e.g., a few bytes) every time a vertex’s neighbor resides on a different machine.
- Coprocessor model: Disallows GPUs to access NI. You write CPU code for communication before and after a GPU kernel.
Examples: CUDA RDMA, CUDA-aware MPI
Challenges:
- Organize messages into per-node queues.
- Avoid overflowing the queue.
- Manually send and receivee queues and overlap sends/recvs with GPU execution.
- GPU code needs to insert into queues efficiently.
- Message-per-lane model: GPU threads independently access to NI. It may generate small, high-overhead messages.
Examples: NVSHMEM
Challenges:
- Manage GPU-initiated messages in a SIMT efficient way or use special hardware (like NVSHMEM uses NVlink or same PCIe).
- Messages generated by WIs are too small for efficient network transmission.
- WF width is too small to amortize synchronization overhead.
- Coalesced APIs: WIs coordinate with their neighbors to access the NI. Harder to program, but it uses adjacent WIs to form larger messages (larger than message-per-lane, smaller than coprocessor).
Examples: GPUnet, GPUrdma
Challenges:
- Aggregating across a WG (instead of the entire GPU) generates small per-node queues.
- Coalesced APIs are invoked for each destination, which degrades SIMT utilization.
- GPU thread = work item (WI)
- Network Interface = NI
- Wavefront (Warp) = WF
- Wavegroup (Block) = WG
- Compute Unit (Streaming Multiprocessor) = CU