Skip to content

Latest commit

 

History

History

Gravel-Fine-Grain-GPU-Initiated-Network-Messages

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Gravel: Fine-Grain GPU-Initiated Network Messages

Marc S. Orr, Shuai Che, Bradford M. Beckmann, Mark Oskin, Steven K. Reinhardt, and David A. Wood. 2017. Gravel: fine-grain GPU-initiated network messages. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). Association for Computing Machinery, New York, NY, USA, Article 23, 1–12. DOI:https://doi.org/10.1145/3126908.3126914

Learn more about this stuff

  • PGAS (Partitioned Global Address Space) style communication

Notes

  • GPU-initiated communication is inefficient for small messages.
  • Gravel offloads GPU-initiated messages through a GPU-queue to an aggregator.
  • Aggregator is implemented with CPU threads. Aggregator repacks the messages into per-node queues and sends them to the NI after they become full or exceed a timeout (this is the part where Gravel generates large messages to amortize network overhead).
  • Leverages a diverged WG-level semantic to asynchronously offload messages to the NI.
  • Another alternative is to offload messages at wavefront granularity, which is done in prior work, but offloading messages at WG granularity is approximately 3x faster.
  • Gravel’s “CPU-side aggregation” strategy scales better than the “GPU-side aggregation” strategy of coprocessor model and coalesced APIs because as the number of destinations (and per-node queues) increase, GPU-side aggregation suffers low SIMT utilization since WIs in the same WG write different queues.
  • In Gravel, GPU always writes messages to a single queue.
  • Gravel is similar to NVSHMEM but it targets other interconnects such as Ethernet, Infiniband (NVSHMEM is for PCIe and NVLink only).
  • Main idea: Combine small messages targeting the same node into larger messages. Then, initiate network request.

Challeneges about routing a message from a GPU thread to Network Interface

  1. Network Inteface access coordination: A dependency between WIs executing in lockstep can cause deadlock.
  2. Cost of producer/consumer synchronization: For example, in a graph algorithm it is typical to initiate a small message (e.g., a few bytes) every time a vertex’s neighbor resides on a different machine.

Three programming abstraction proposals

  1. Coprocessor model: Disallows GPUs to access NI. You write CPU code for communication before and after a GPU kernel.

Examples: CUDA RDMA, CUDA-aware MPI

Challenges:

  • Organize messages into per-node queues.
  • Avoid overflowing the queue.
  • Manually send and receivee queues and overlap sends/recvs with GPU execution.
  • GPU code needs to insert into queues efficiently.
  1. Message-per-lane model: GPU threads independently access to NI. It may generate small, high-overhead messages.

Examples: NVSHMEM

Challenges:

  • Manage GPU-initiated messages in a SIMT efficient way or use special hardware (like NVSHMEM uses NVlink or same PCIe).
  • Messages generated by WIs are too small for efficient network transmission.
  • WF width is too small to amortize synchronization overhead.
  1. Coalesced APIs: WIs coordinate with their neighbors to access the NI. Harder to program, but it uses adjacent WIs to form larger messages (larger than message-per-lane, smaller than coprocessor).

Examples: GPUnet, GPUrdma

Challenges:

  • Aggregating across a WG (instead of the entire GPU) generates small per-node queues.
  • Coalesced APIs are invoked for each destination, which degrades SIMT utilization.

Terminologies and abbreviations used in the paper

  • GPU thread = work item (WI)
  • Network Interface = NI
  • Wavefront (Warp) = WF
  • Wavegroup (Block) = WG
  • Compute Unit (Streaming Multiprocessor) = CU