Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fabtests/rdm_atomic, ubertest: move atomic verification to common code and use in functional test #10155

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

aingerson
Copy link
Contributor

Ubertest copies the software implementation of atomics from libfabric in order to add atomic operation data validation. This is done by saving the sent data and performing a local atomic in order to validate against the actual data calculated by the libfabric provider.
This patch set moves that implementation from ubertest to the common code for other tests (specifically the fi_rdm_atomic test) to leverage it.
Ubertest atomics with validation is not run in the CI by default but he rdm_atomic test is. Because of the lack of validation in the functional fabtest, an incorrect fetched buffer pointer was never caught. This patch set also enables running rdm_atomic with validation in the standard and short test sets in runfabtests.sh so we can catch this in the future.
During testing of these changes, it properly caught the previously missed bug and properly passes when it was fixed.

@aingerson aingerson requested a review from j-xiong July 8, 2024 20:53
Comment on lines 3733 to 3808
ret = ft_hmem_copy_from(opts.iface, opts.device, dev_host_comp,
cmp, count * datatype_to_size(type));
if (ret) {
FT_ERR("Failed to copy from atomic buffer\n");
return ret;
}

check_comp = dev_host_comp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is check_comp needed when atomic != FI_ATOMIC_COMPARE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think I left it because coverity gets really confused by the unused variables with the atomic cases. But I'm going to refactor this whole check which is looking really messy. I'll push an update when I figure out the other failures.

@j-xiong
Copy link
Contributor

j-xiong commented Jul 10, 2024

Looks like MS compiler doesn't like the complex types:

 dgram_pingpong.c
C:\projects\libfabric\fabtests\include\ofi_atomic.h(44): error C2061: syntax error: identifier 'ofi_complex_float' [C:\projects\libfabric\fabtests\fabtests.vcxproj]
C:\projects\libfabric\fabtests\include\ofi_atomic.h(44): error C2059: syntax error: ';' [C:\projects\libfabric\fabtests\fabtests.vcxproj]
C:\projects\libfabric\fabtests\include\ofi_atomic.h(45): error C2061: syntax error: identifier 'ofi_complex_double' [C:\projects\libfabric\fabtests\fabtests.vcxproj]
C:\projects\libfabric\fabtests\include\ofi_atomic.h(45): error C2059: syntax error: ';' [C:\projects\libfabric\fabtests\fabtests.vcxproj]
C:\projects\libfabric\fabtests\include\ofi_atomic.h(46): error C2061: syntax error: identifier 'ofi_complex_long_double' [C:\projects\libfabric\fabtests\fabtests.vcxproj]
C:\projects\libfabric\fabtests\include\ofi_atomic.h(46): error C2059: syntax error: ';' 

OFI_UNUSED(src); \
for (i = 0; i < cnt; i++) \
r[i] = d[i]; \
memcpy(res, dst, sizeof(type) * cnt); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use ofi_complex_##type here.

ofi_complex_##type *r = (res); \
for (i = 0; i < cnt; i++) { \
r[i] = d[i]; \
memcpy(res, dst, sizeof(type) * cnt); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ofi_complex_##type

ofi_complex_##type *r = (res); \
for (i = 0; i < cnt; i++) { \
r[i] = d[i]; \
memcpy(res, dst, sizeof(type) * cnt); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

This allows fabtests to make use of atomic validation code

There were many Windows atomics bugs, inconsistencies, and missing
definitions. This patch also cleans up the entire ofi_atomic.c
implementation for unix and windows

The following changes are included:
- Replace reads with memcpy as setting complexes on windows is not allowed
- Remove CHECK_LOCAL and do memcmp instead to reduce code and calls
- Remove duplicated ofi_complex definitions in ofi_atomic (already in osd.h file)
- Add general check_atomic and fill_atomic calls and use them in ubertest
- Replace FT_FILL set with memcpy to be compatible with windows complex types
- Add EXPAND ( x) x define to work nicely with windows VA_ARGS handling
- Fix inconsistency with ofi_complex_type/or naming ('complex' always should come first)
- Fix inconsistency with op names "equ" and "mul" -> "eq" and "prod"
- Add missing lxor complex op definitions on Windows

Signed-off-by: Alexia Ingerson <[email protected]>
To properly validate atomic data, we need host bounce buffers for the
result and compare buffers in addition to the regular bounce buffer for
the tx/rx bufs.
This adds two extra bufs allocated only for atomic purposes and adds hmem
support to the common atomic validation path.
It also renames the alloc/free_tx_buf calls to generic alloc/free_host_bufs
which allocates all three buffers at once.

Signed-off-by: Alexia Ingerson <[email protected]>
ft_post_atomic posted "buf" which is the base address for the entire
send and recv buffer allocation. The first half of the allocation is the
receive buffer and the second half is the send buffer. Posting just "buf"
meant it was sending the receive buffer.
This changes it to send the tx buf and do an atomic on the rx buf which allows
us to properly do atomic validation

Signed-off-by: Alexia Ingerson <[email protected]>
This allows us to post the rx buf without corrupting memory in case its needed
for validation

Signed-off-by: Alexia Ingerson <[email protected]>
Match the behavior of memset() where the value passed in
is an int, but it is interpreted as a char.
While ZE can technically handle this scenario, others may not
so we need to standardize across ifaces

Signed-off-by: Alexia Ingerson <[email protected]>
Add data validation to the atomic test by using the newly added atomic fill and check
support imported from ubertest. This code uses a macro that switches on datatype for
filling and checking the buffer contents.

The atomic validation path requires an extra buffer to copy the contents of the original
atomic buffer in order to recreate the atomic function locally and check the buffer against
the simulated atomic operation.

This patch also refactors the entire test to remove the extremely confuctin macros used
for the base/fetch/compare operations. The macros made the code extremely difficult to
read and debug and also made it difficult to add data validation. Separating it into
three explicit functions is about the same amount of code and significantly more readable

Synchronization messages are added in the validation case to ensure the atomic operation
completed on both sides before validation occurs. This requires the addition of the
FI_ORDER_SAW and FI_ORDER_SAR message ordering to ensure that we get the completion for
the send/recv sync after the atomic message is processed

Signed-off-by: Alexia Ingerson <[email protected]>
Run fi_rdm_atomic with data validation in standard and short test suites

Signed-off-by: Alexia Ingerson <[email protected]>
Comparison of complex numbers is undefined and not a valid
combination of atomic ops. Disable in psm3

Signed-off-by: Alexia Ingerson <[email protected]>
Report atomic op errors back to the application.
Some datatype/op combinations were falsely being reported
to the application but failing when the atomic was being
performed. These failures were silently trested as successful
because the errors were not passed back. Check the error code
to catch future issues

Signed-off-by: Alexia Ingerson <[email protected]>
psm3 advertises support for logical ops (lor, land, lxor)
with all datatypes but the functions are only defined for
integer types. When the atomic op is called with a non-integer
type, it drops down to the default case and returns an error
(FI_ENOTSUPP)

Signed-off-by: Alexia Ingerson <[email protected]>
Atomic long double and complex long double reads and swaps
are failing data verification. Only part of the data is
getting copied. Disable in psm3 until root caused.

Signed-off-by: Alexia Ingerson <[email protected]>
@aingerson
Copy link
Contributor Author

@acgoldma This PR is adding atomic verification to the regular fabtests runs. psm3 was failing a lot of them so I've added a few patches to fix most of them. The last patch disables the remaining failing combinations instead of fixing them because I couldn't completely root cause - long double and complex long double reads/swaps are failing and I'm not sure why. Switching the sets to a memcpy of the datatype size also fixes the issue but it takes a performance hit (and also affects all other data types for those ops) so I opted to disable instead of adding that fix. I did this by having all the reads and compares go through the PSM_CUDA | PSM_ONEAPI implementation which does the psmx3_memcpy (which internally switches to a memcpy for the same ifdef). Overall, something is off when doing r[i] = d[i] with those two datatypes but I'm not sure why. Let me know if you want me to take a different approach!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants