-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR#18052 caused runtime crashes for all the MaxText training with multi-gpus #18214
Comments
I note that "Invalid opaque object size" is an error that comes from NVIDIA's TransformerEngine. Can you say a bit more why you think this is an XLA bug? |
This is probably because some of the frontend attributes got into backend confug and descriptor size doesn't match (is it from this file: https://jax.readthedocs.io/en/latest/_downloads/6887b43f6c37e251530df2326372488f/kernel_helpers.h ?). I recommend to migrate to FFI ASAP, because it's more robust to errors like this one and should be able to decode only relevant attributes or at least will give you a better error message. Can you compare HLO module after optimization and see what's inside backends configs for TE custom calls? |
I think it is coming from TE: https://github.com/NVIDIA/TransformerEngine/blob/161b1d98f80243c78ddecdadd15b010549e4e3d0/transformer_engine/jax/csrc/extensions.h#L49
|
Can you patch TE to print more helpful error message that includes opaque object, it's a string, so it should fine to print it as a part of error message. |
|
Derived instruction might set its own backend config, and it's not safe to overwrite it with an original one. Fix for: #18214 PiperOrigin-RevId: 686553267
Derived instruction might set its own backend config, and it's not safe to overwrite it with an original one. Fix for: openxla/xla#18214 PiperOrigin-RevId: 686553267
Derived instruction might set its own backend config, and it's not safe to overwrite it with an original one. Fix for: #18214 PiperOrigin-RevId: 686594662
Derived instruction might set its own backend config, and it's not safe to overwrite it with an original one. Fix for: openxla/xla#18214 PiperOrigin-RevId: 686594662
Should be fixed by #18399 to my understanding. |
The issue started with #18052
Error log:
Minimal steps to reproduce on one node with at least 2 A100 (or H100) GPUs:
The above command, we used single node with 8GPUs on it (fsdp equals to the number of GPUs)
The text was updated successfully, but these errors were encountered: