Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR#18052 caused runtime crashes for all the MaxText training with multi-gpus #18214

Open
gpupuck opened this issue Oct 11, 2024 · 6 comments
Open
Labels
bug Something isn't working NVIDIA-GPU XLA on Nvidia GPU

Comments

@gpupuck
Copy link

gpupuck commented Oct 11, 2024

The issue started with #18052

Error log:

Per train step:
 Total TFLOPs: 377.53 
 split as 86.02% learnable weight flops and 13.98% attention flops
jax.errors.SimplifiedTraceback: For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/maxtext/MaxText/train.py", line 776, in <module>
    app.run(main)
  File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/opt/maxtext/MaxText/train.py", line 772, in main
    train_loop(config)
  File "/opt/maxtext/MaxText/train.py", line 666, in train_loop
    state, metrics = p_train_step(state, example_batch, nextrng)
RuntimeError: Invalid opaque object size

Minimal steps to reproduce on one node with at least 2 A100 (or H100) GPUs:

docker run -it --rm --gpus=all --shm-size=2g ghcr.io/nvidia/jax:maxtext-2024-10-10
test-maxtext.sh -b 4 --model-name=llama2-7b --attn-type=cudnn_flash_te --remat-policy=minimal_flash --steps=10 --fsdp=8 --output train_output -a "scan_layers=true max_target_length=4096 use_iota_embed=true logits_dot_in_fp32=false"

The above command, we used single node with 8GPUs on it (fsdp equals to the number of GPUs)

@gpupuck gpupuck added bug Something isn't working NVIDIA-GPU XLA on Nvidia GPU labels Oct 11, 2024
@hawkinsp
Copy link
Member

I note that "Invalid opaque object size" is an error that comes from NVIDIA's TransformerEngine. Can you say a bit more why you think this is an XLA bug?

@ezhulenev
Copy link
Member

This is probably because some of the frontend attributes got into backend confug and descriptor size doesn't match (is it from this file: https://jax.readthedocs.io/en/latest/_downloads/6887b43f6c37e251530df2326372488f/kernel_helpers.h ?).

I recommend to migrate to FFI ASAP, because it's more robust to errors like this one and should be able to decode only relevant attributes or at least will give you a better error message.

Can you compare HLO module after optimization and see what's inside backends configs for TE custom calls?

@Cjkkkk
Copy link
Contributor

Cjkkkk commented Oct 16, 2024

I think it is coming from TE: https://github.com/NVIDIA/TransformerEngine/blob/161b1d98f80243c78ddecdadd15b010549e4e3d0/transformer_engine/jax/csrc/extensions.h#L49
the before and after optimization hlo looks like this, the printed backend config seems the same
custom-call.1167 = (bf16[32,4096,32,128]{3,2,1,0}, bf16[32,4096,32,128]{3,2,1,0}, bf16[32,4096,32,128]{3,2,1,0}, bf16[0]{0}) custom-call(custom-call.1047, custom-call.1048, custom-call.1049, constant.918, Arg_6.902, /*index=5*/Arg_7.903, Arg_8.904, custom-call.1166, reshape.1076, reduce.1079, /*index=10*/constant.918, constant.918), custom_call_target="CustomSPMDPartitioning", api_version=API_VERSION_STATUS_RETURNING, backend_config="140455436907088"

custom-call.40.0 = (bf16[4,4096,32,128]{3,2,1,0}, bf16[4,4096,32,128]{3,2,1,0}, bf16[4,4096,32,128]{3,2,1,0}, bf16[0]{0}, u8[270532736]{0}) custom-call(input_concatenate_fusion.3, input_concatenate_fusion.2, bitcast.7249.0, constant_1634_0, bitcast.7261.0, /*index=5*/bitcast.7270.0, bitcast.7279.0, bitcast.7291.0, input_scatter_fusion.2, input_scatter_fusion.3, /*index=10*/constant_1634_0, constant_1634_0), custom_call_target="te_fused_attn_backward", operand_layout_constraints={bf16[4,4096,32,128]{3,2,1,0}, bf16[4,4096,32,128]{3,2,1,0}, bf16[4,4096,32,128]{3,2,1,0}, bf16[0]{0}, f32[4,32,4096,1]{3,2,1,0}, u32[2,4]{1,0}, bf16[4,4096,32,128]{3,2,1,0}, bf16[4,4096,32,128]{3,2,1,0}, s32[5]{0}, s32[5]{0}, bf16[0]{0}, bf16[0]{0}}, api_version=API_VERSION_STATUS_RETURNING, backend_config="140455436907088"

@ezhulenev
Copy link
Member

Can you patch TE to print more helpful error message that includes opaque object, it's a string, so it should fine to print it as a part of error message.

@ezhulenev
Copy link
Member

backend_config="140455436907088" can't be a correct backend config? it's supposed to be CustomCallFusedAttnDescriptor which is many many fields

copybara-service bot pushed a commit that referenced this issue Oct 16, 2024
Derived instruction might set its own backend config, and it's not safe to overwrite it with an original one.

Fix for: #18214

PiperOrigin-RevId: 686553267
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Oct 16, 2024
Derived instruction might set its own backend config, and it's not safe to overwrite it with an original one.

Fix for: openxla/xla#18214

PiperOrigin-RevId: 686553267
copybara-service bot pushed a commit that referenced this issue Oct 16, 2024
Derived instruction might set its own backend config, and it's not safe to overwrite it with an original one.

Fix for: #18214

PiperOrigin-RevId: 686594662
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Oct 16, 2024
Derived instruction might set its own backend config, and it's not safe to overwrite it with an original one.

Fix for: openxla/xla#18214

PiperOrigin-RevId: 686594662
@nouiz
Copy link
Contributor

nouiz commented Oct 16, 2024

Should be fixed by #18399 to my understanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working NVIDIA-GPU XLA on Nvidia GPU
Projects
None yet
Development

No branches or pull requests

5 participants