PR#18052 caused runtime crashes for all the MaxText training with multi-gpus #18214

gpupuck · 2024-10-11T16:51:22Z

The issue started with #18052

Error log:

Per train step:
 Total TFLOPs: 377.53 
 split as 86.02% learnable weight flops and 13.98% attention flops
jax.errors.SimplifiedTraceback: For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/maxtext/MaxText/train.py", line 776, in <module>
    app.run(main)
  File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/opt/maxtext/MaxText/train.py", line 772, in main
    train_loop(config)
  File "/opt/maxtext/MaxText/train.py", line 666, in train_loop
    state, metrics = p_train_step(state, example_batch, nextrng)
RuntimeError: Invalid opaque object size

Minimal steps to reproduce on one node with at least 2 A100 (or H100) GPUs:

docker run -it --rm --gpus=all --shm-size=2g ghcr.io/nvidia/jax:maxtext-2024-10-10
test-maxtext.sh -b 4 --model-name=llama2-7b --attn-type=cudnn_flash_te --remat-policy=minimal_flash --steps=10 --fsdp=8 --output train_output -a "scan_layers=true max_target_length=4096 use_iota_embed=true logits_dot_in_fp32=false"

The above command, we used single node with 8GPUs on it (fsdp equals to the number of GPUs)

The text was updated successfully, but these errors were encountered:

hawkinsp · 2024-10-11T17:58:06Z

I note that "Invalid opaque object size" is an error that comes from NVIDIA's TransformerEngine. Can you say a bit more why you think this is an XLA bug?

ezhulenev · 2024-10-15T18:50:01Z

This is probably because some of the frontend attributes got into backend confug and descriptor size doesn't match (is it from this file: https://jax.readthedocs.io/en/latest/_downloads/6887b43f6c37e251530df2326372488f/kernel_helpers.h ?).

I recommend to migrate to FFI ASAP, because it's more robust to errors like this one and should be able to decode only relevant attributes or at least will give you a better error message.

Can you compare HLO module after optimization and see what's inside backends configs for TE custom calls?

Cjkkkk · 2024-10-16T04:05:11Z

I think it is coming from TE: https://github.com/NVIDIA/TransformerEngine/blob/161b1d98f80243c78ddecdadd15b010549e4e3d0/transformer_engine/jax/csrc/extensions.h#L49
the before and after optimization hlo looks like this, the printed backend config seems the same
custom-call.1167 = (bf16[32,4096,32,128]{3,2,1,0}, bf16[32,4096,32,128]{3,2,1,0}, bf16[32,4096,32,128]{3,2,1,0}, bf16[0]{0}) custom-call(custom-call.1047, custom-call.1048, custom-call.1049, constant.918, Arg_6.902, /*index=5*/Arg_7.903, Arg_8.904, custom-call.1166, reshape.1076, reduce.1079, /*index=10*/constant.918, constant.918), custom_call_target="CustomSPMDPartitioning", api_version=API_VERSION_STATUS_RETURNING, backend_config="140455436907088"

custom-call.40.0 = (bf16[4,4096,32,128]{3,2,1,0}, bf16[4,4096,32,128]{3,2,1,0}, bf16[4,4096,32,128]{3,2,1,0}, bf16[0]{0}, u8[270532736]{0}) custom-call(input_concatenate_fusion.3, input_concatenate_fusion.2, bitcast.7249.0, constant_1634_0, bitcast.7261.0, /*index=5*/bitcast.7270.0, bitcast.7279.0, bitcast.7291.0, input_scatter_fusion.2, input_scatter_fusion.3, /*index=10*/constant_1634_0, constant_1634_0), custom_call_target="te_fused_attn_backward", operand_layout_constraints={bf16[4,4096,32,128]{3,2,1,0}, bf16[4,4096,32,128]{3,2,1,0}, bf16[4,4096,32,128]{3,2,1,0}, bf16[0]{0}, f32[4,32,4096,1]{3,2,1,0}, u32[2,4]{1,0}, bf16[4,4096,32,128]{3,2,1,0}, bf16[4,4096,32,128]{3,2,1,0}, s32[5]{0}, s32[5]{0}, bf16[0]{0}, bf16[0]{0}}, api_version=API_VERSION_STATUS_RETURNING, backend_config="140455436907088"

ezhulenev · 2024-10-16T08:58:28Z

Can you patch TE to print more helpful error message that includes opaque object, it's a string, so it should fine to print it as a part of error message.

ezhulenev · 2024-10-16T10:58:12Z

backend_config="140455436907088" can't be a correct backend config? it's supposed to be CustomCallFusedAttnDescriptor which is many many fields

Derived instruction might set its own backend config, and it's not safe to overwrite it with an original one. Fix for: #18214 PiperOrigin-RevId: 686553267

Derived instruction might set its own backend config, and it's not safe to overwrite it with an original one. Fix for: openxla/xla#18214 PiperOrigin-RevId: 686553267

Derived instruction might set its own backend config, and it's not safe to overwrite it with an original one. Fix for: #18214 PiperOrigin-RevId: 686594662

Derived instruction might set its own backend config, and it's not safe to overwrite it with an original one. Fix for: openxla/xla#18214 PiperOrigin-RevId: 686594662

nouiz · 2024-10-16T20:58:44Z

Should be fixed by #18399 to my understanding.

gpupuck added bug Something isn't working NVIDIA-GPU XLA on Nvidia GPU labels Oct 11, 2024

copybara-service bot mentioned this issue Oct 16, 2024

[hlo] Do not overwrite derived instruction backend config #18399

Merged

copybara-service bot mentioned this issue Oct 16, 2024

[hlo] Do not overwrite derived instruction backend config tensorflow/tensorflow#78044

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR#18052 caused runtime crashes for all the MaxText training with multi-gpus #18214

PR#18052 caused runtime crashes for all the MaxText training with multi-gpus #18214

gpupuck commented Oct 11, 2024

hawkinsp commented Oct 11, 2024

ezhulenev commented Oct 15, 2024

Cjkkkk commented Oct 16, 2024 •

edited

Loading

ezhulenev commented Oct 16, 2024

ezhulenev commented Oct 16, 2024

nouiz commented Oct 16, 2024

PR#18052 caused runtime crashes for all the MaxText training with multi-gpus #18214

PR#18052 caused runtime crashes for all the MaxText training with multi-gpus #18214

Comments

gpupuck commented Oct 11, 2024

hawkinsp commented Oct 11, 2024

ezhulenev commented Oct 15, 2024

Cjkkkk commented Oct 16, 2024 • edited Loading

ezhulenev commented Oct 16, 2024

ezhulenev commented Oct 16, 2024

nouiz commented Oct 16, 2024

Cjkkkk commented Oct 16, 2024 •

edited

Loading