Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MR: Set minimal access permissions for each MR based on its usage #240

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

liralon
Copy link
Contributor

@liralon liralon commented Aug 15, 2023

This PR is composed of 2 commits.
The 1st commit fixes a bug in MR access permissions setting for RMA reads. It is a minimal change to the functions initializing the MR attributes that just fixes the issue.
The 2nd commit change the codebase such that the various MR users determines the specific MR access permissions they require rather than setting the MR permission overly permissive for all possible use cases. This is less error prone (As the bug fixed in 1st commit) and provides some protection on mistakenly confusing MR handles, as that would lead to MR access permission failure rather than potentially a memory corruption issue.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Contributor

@rauteric rauteric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, is this tested yet?

@liralon
Copy link
Contributor Author

liralon commented Aug 15, 2023

I have verified it runs successfully a PyTorch-FSDP workload and nccl-tests. Both with standard configuration and with configuration of (FI_EFA_USE_DEVICE_RDMA=0, FI_LOG_LEVEL=Warn and OFI_NCCL_DISABLE_NATIVE_RDMA_CHECK=1).

I have also verified that when one of the MR permissions is wrong (e.g. removing FI_REMOTE_READ from GPU flush target), then we get an RMA verification abort from libfabric.

Copy link
Contributor

@rauteric rauteric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. We should probably also test SENDRECV protocol before committing.

Host buffer access permission was missing FI_REMOTE_READ to allow it to
be used as a RMA read source buffer on GPU flush.

CUDA buffer access permission was missing FI_READ to allow it to be used
as a RMA read result buffer on eager local copy.

Neuron buffer access permission was set to FI_REMOTE_READ rather than
FI_READ to allow it to be used as a RMA read result buffer on eager
local copy. In addition, Neuron buffer was unnecessary given
FI_REMOTE_READ on send/recv protocol.

Use this oppurtonity to also add documentation to the MR permissions
setting code.

It's worth mentioning that this issue was easily discovered by running
with the following configuration:
* FI_EFA_USE_DEVICE_RDMA=0 (libfabric disable EFA RDMA support)
* OFI_NCCL_DISABLE_NATIVE_RDMA_CHECK=1
  (Force plugin to use RMA even when RDMA write is not supported by HW)
* FI_LOG_LEVEL=Warn (libfabric set log level to warning)

Signed-off-by: Liran Alon <[email protected]>
MR access permissions are curerntly set in the internal function that
initialize MR attributes for all registered MRs (set_mr_req_attr() for
RDMA protocol and register_mr_buffers() for SENDRECV protocol).
Thus, this function is required to consider all the possible usages of
registered MRs in the codebase to properly set the access permissions.

First, this is prone to error because one may not consider some MR usage
(See previous commit which fixes such a bug). Second, setting minimal
access permissions for each MR based on its usage provides us some
protection against mistakenly mixing MR handles. Resulting in such case
in MR access permission violation rather than a memory corruption, which
is easier to debug and root-cause.

Signed-off-by: Liran Alon <[email protected]>
@bwbarrett
Copy link
Contributor

Converting to draft until we sort out some internal testing failures.

@rajachan
Copy link
Member

rajachan commented Apr 9, 2024

There have been sizable changes to the plugin since this PR was cut, so the PR would need to be reworked. It is still a good change to take in though. @liralon, can you refresh the PR? The CI issues have been largely resolved.

@a-szegel
Copy link
Contributor

a-szegel commented Jun 3, 2024

Please merge master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants