Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Fix check failure: sync_reactors_.find(reactor->GetRemoteNodeID()) == sync_reactors_.end() #47861

Merged
merged 10 commits into from
Oct 4, 2024

Conversation

jjyao
Copy link
Collaborator

@jjyao jjyao commented Sep 30, 2024

Why are these changes needed?

The sequence of events that triggers this check failure is

  1. There is a transient network error and worker node's RayClientBidiReactor::OnDone is called with non-ok grpc status.
  2. cleanup_cb_ of RayClientBidiReactor is called which reconnects to the head node.
  3. The reconnection causes RaySyncerService::StartSync to be called.
  4. RaySyncer::Connect is called with a new RayServerBidiReactor.
  5. Because the head node still has the old RayServerBidiReactor from the worker node's previous connection, the check fails.

This PR fixes the issue by disconnecting the old RayServerBidiReactor first before adding a new one.

Related issue number

Closes #45639

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Sep 30, 2024
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
@jjyao jjyao marked this pull request as ready for review October 3, 2024 21:57
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG. one minor question

@@ -124,6 +125,77 @@ def check_task_pending(n=0):
wait_for_condition(lambda: check_task_pending(2))


head2 = gen_head_node(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test seems very useful. wonder if we should extend it to more e2e tests in some stress tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I'll use the same pattern for other tests. I'm also exploring chaos test tools for randomly injecting network errors.

// Disconnect exiting connection if there is any.
// This can happen when there is transient network error
// and the client reconnects.
syncer_.Disconnect(reactor->GetRemoteNodeID());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: why don't we call this in clean_cb?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syncer_.Disconnect triggers the call of cleanup_cb

@jjyao jjyao enabled auto-merge (squash) October 4, 2024 18:32
@jjyao jjyao merged commit 2532cca into ray-project:master Oct 4, 2024
6 checks passed
@jjyao jjyao deleted the jjyao/synnner branch October 4, 2024 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] GCS crashed with Check failed: sync_reactors_.find(reactor->GetRemoteNodeID()) == sync_reactors_.end()
2 participants