Add ROCm as alternative to CUDA for plugin use #461

ryanhankins · 2024-06-27T15:00:00Z

Description of changes:

See commit messages for more detail. Add a --with-rocm flag to configure.ac to switch between CUDA and ROCm GPU calls, to support AMD GPUs. Add code to fiiles to abstract CUDA calls, and, upon the use of the --with-rocm option, to call the ROCm alternatives.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

liralon · 2024-06-28T14:58:12Z

@ryanhankins Can you please add to commit message some information on which platforms you have tested this functionality to work properly?

The nccl_net_ofi_cu* calls map directly to CUDA methods. Instead of this mapping, insert indirection via nccl_net_ofi_gpu methods so that the implementation of the methods depends on CUDA, but the methods themselves can be called for different underling frameworks (such as ROCm). Signed-off-by: Ryan Hankins <[email protected]>

ROCm provides an interface similar to CUDA, to work with AMD GPUs. Provide a compile time option to build with ROCm instead of CUDA. 1. Add --with-rocm= flag to ./configure. 2. Make all CUDA calls "gpu" calls, which are independent of the underlying framework. 3. Switch between _rocm and _cuda files at compile time to make the appropriate calls. 4. When building for RCCL (AMD's NCCL), generate a rccl-net.so-named plugin for binary compatibility. Tested on: 1. HPE Cray EX with EX235A BardPeak GPUs + 200Gb Slingshot adapters. 2. HPE Cray EX with NVIDIA A100 SXM4 80GB GPUs + 200 Gb Slingshot adapters. Signed-off-by: Ryan Hankins <[email protected]>

ryanhankins changed the title ~~Merge6~~ Add ROCm as alternative to CUDA for plugin use. Jun 27, 2024

ryanhankins changed the title ~~Add ROCm as alternative to CUDA for plugin use.~~ Add ROCm as alternative to CUDA for plugin use Jun 27, 2024

ryanhankins force-pushed the merge6 branch 8 times, most recently from 98282b4 to 064fb2c Compare June 27, 2024 18:32

ryanhankins marked this pull request as ready for review June 28, 2024 11:19

ryanhankins requested review from bwbarrett and a team as code owners June 28, 2024 11:19

ryanhankins force-pushed the merge6 branch from 064fb2c to 9ef3a44 Compare June 28, 2024 11:58

ryanhankins force-pushed the merge6 branch from 9ef3a44 to 33b0ed8 Compare June 28, 2024 17:06

ryanhankins force-pushed the merge6 branch from 33b0ed8 to 5d8ebee Compare August 14, 2024 20:02

ryanhankins force-pushed the merge6 branch from 5d8ebee to b1a22d5 Compare August 15, 2024 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ROCm as alternative to CUDA for plugin use #461

Add ROCm as alternative to CUDA for plugin use #461

ryanhankins commented Jun 27, 2024

liralon commented Jun 28, 2024

Add ROCm as alternative to CUDA for plugin use #461

Are you sure you want to change the base?

Add ROCm as alternative to CUDA for plugin use #461

Conversation

ryanhankins commented Jun 27, 2024

liralon commented Jun 28, 2024