Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transport: prevent deadlock in transport Close when GoAway write hangs #7662

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

aranjans
Copy link
Contributor

@aranjans aranjans commented Sep 23, 2024

Fixes #7606.

Couple of recent changes worth noting here:

  • Send GOAWAY to server on Client Transport Shutdown #7015 added logic to send a GOAWAY frame on client transport shutdown. Unfortunately the handler responsible for writing the GOAWAY frame was holding on to the client transport mutex when attempting to write the GOAWAY frame.
  • transport: add timeout for writing GOAWAY on http2Client.Close() #7371 introduced a timeout in the client transport shutdown handler to wait for loopyWriter to exit (after enqueueing the above GOAWAY frame on the controlbuf). This was done to ensure that the client transport shutdown can complete in the face of a hanging network connection that blocks forever when attempting to write the above GOAWAY frame

Description of the deadlock:

  • During client transport shutdown, after enqueueing the GOAWAY frame on the controlbuf, http2Client.Close calls http2Client.GetGoAwayReason to fetch the last GOAWAY's debug message, and the latter attempts to grab http2Client.mu.
  • http2Client.outgoingGoAwayHandler holds http2Client.mu when it is attempting to write the GOAWAY frame. So, if the underlying network connection is hanging, this method will not release the mutex, and therefore http2Client.GetGoAwayReason will not be able to grab the same mutex, and thereby http2Client.Close will deadlock.

RELEASE NOTES:

  • transport: prevent deadlock in client transport shutdown when writing the GOAWAY frame hangs.

@aranjans aranjans added the Area: Testing Includes tests and testing utilities that we have for unit and e2e tests within our repo. label Sep 23, 2024
@aranjans aranjans added this to the 1.68 Release milestone Sep 23, 2024
Copy link

codecov bot commented Sep 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.96%. Comparing base (1418e5e) to head (4eaf6ab).
Report is 12 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7662      +/-   ##
==========================================
+ Coverage   81.89%   81.96%   +0.06%     
==========================================
  Files         361      361              
  Lines       27818    27823       +5     
==========================================
+ Hits        22782    22805      +23     
+ Misses       3847     3833      -14     
+ Partials     1189     1185       -4     
Files with missing lines Coverage Δ
internal/transport/http2_client.go 92.12% <100.00%> (+0.50%) ⬆️

... and 28 files with indirect coverage changes

@aranjans aranjans changed the title transport/http2_client: fixed flaky test Test/ClientCloseReturnsEarlyWhenGoAwayWriteHangs transport/http2_client: fixed Test/ClientCloseReturnsEarlyWhenGoAwayWriteHangs Sep 23, 2024
@aranjans aranjans requested review from arvindbr8 and removed request for arvindbr8 September 24, 2024 16:59
@purnesh42H
Copy link
Contributor

Is there an existing issue filed which needs to be linked?

@aranjans
Copy link
Contributor Author

@purnesh42H yes, added it in description.

@purnesh42H purnesh42H self-assigned this Sep 30, 2024
@aranjans aranjans changed the title transport/http2_client: fixed Test/ClientCloseReturnsEarlyWhenGoAwayWriteHangs transport: fixed deadlock happening due to http2Client.mu Oct 1, 2024
@easwars
Copy link
Contributor

easwars commented Oct 1, 2024

Some general comments here:

  • Can you please use bullet points in the PR description to make it more readable. Also, use backticks wherever appropriate to highlight code symbols
  • The release note (and the PR title) needs to be more descriptive. Saying that a deadlock is happening due to http2Client.mu does not provide any useful information for the user.

@@ -1010,6 +1010,18 @@ func (t *http2Client) Close(err error) {
}
t.mu.Unlock()

// Append info about previous goaways if there were any, since this may be important
// for understanding the root cause for this connection to be closed.
_, goAwayDebugMessage := t.GetGoAwayReason()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetGoAwayReason is implemented as follows:

func (t *http2Client) GetGoAwayReason() (GoAwayReason, string) {
	t.mu.Lock()
	defer t.mu.Unlock()
	return t.goAwayReason, t.goAwayDebugMessage
}

This means that the instead of unlocking t.mu on line 1011, we can instead do the following:

// Don't unlock on line 1011
goAwayDebugMessage := t.goAwayDebugMessage
t.mu.Unlock()
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, we could do that.

Copy link
Contributor

@purnesh42H purnesh42H Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, may be I am getting confused with line number but t.controlBuf.put() aquire a lock as well so we need to unlock before that.

I think for the case of loopyWriter we need to make sure that we unlock mu in case of timeout as well which should be enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@purnesh42H t.controlBuf.put() has a separate mu lock and is accessed with t.controlBuf.mu for control buffers resources, and here we are dealing with t.mu for htt2Client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but outgoingGoAwayHandler is acquiring lock on t.mu which is followed by t.controlBuf.put(). We need to just make sure we unlock the outgoingGoAwayHandler lock in case of timeout as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, please refer to this comment.
#7662 (comment)

// Append info about previous goaways if there were any, since this may be important
// for understanding the root cause for this connection to be closed.
_, goAwayDebugMessage := t.GetGoAwayReason()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation of http2Client.outgoingGoAwayHandler holds t.mu when calling WriteGoAway on the underlying framer, i.e. it is performing I/O when holding the lock. Is this the correct thing to do?

Can/should this be replaced with the following:

// OutgoingGoAwayHandler writes a GOAWAY to the connection.  Always returns
// (false, err) as we want the GoAway to be the last frame loopy writes to the
// transport.
func (t *http2Client) outgoingGoAwayHandler(g *goAway) (bool, error) {
	t.mu.Lock()
	nextID := t.nextID - 2
	t.mu.Unlock()
	if err := t.framer.fr.WriteGoAway(nextID, http2.ErrCodeNo, g.debugData); err != nil {
		return false, err
	}
	return false, g.closeConn
}

Copy link
Contributor Author

@aranjans aranjans Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that'd be a useful optimization and hence minimize acquiring the mutex. We could define maxStreamID as t.nextID - 2 and use it to write goAway frame as t.framer.fr.WriteGoAway(maxStreamID, http2.ErrCodeNo, g.debugData).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to t.mu.Unlock(), currently on line 1012, needs to be moved to be after this line because t.goAwayDebugMessage is guarded by that mutex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that'd be a useful optimization and hence minimize acquiring the mutex

I don't think this is a useful optimization. This is the root of the deadlock. I don't think we need to move code around in http2Client.Close at all if we don't hold the lock in outgoingGoAwayHandler when performing the actual write. Can you please confirm that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, yeah but we would need to release t.mu before we sent out goAway frame over to cbuf, bcz outgoingGoAwayHandler also needs to acquire t.mu to get the maxStreamID and we are waiting for t.writerDone with the timer. We can do something like

goAwayDebugMessage := t.goAwayDebugMessage
t.mu.Unlock()

// Per HTTP/2 spec, a GOAWAY frame must be sent before closing the
// connection. See https://httpwg.org/specs/rfc7540.html#GOAWAY. It
// also waits for loopyWriter to be closed with a timer to avoid the
// long blocking in case the connection is blackholed, i.e. TCP is
// just stuck.
t.controlBuf.put(&goAway{code: http2.ErrCodeNo, debugData: []byte("client transport shutdown"), closeConn: err})
timer := time.NewTimer(goAwayLoopyWriterTimeout)
defer timer.Stop()
select {
case <-t.writerDone: // success
case <-timer.C:
	t.logger.Infof("Failed to write a GOAWAY frame as part of connection close after %s. Giving up and closing the transport.", goAwayLoopyWriterTimeout)
}
t.cancel()
t.conn.Close()
channelz.RemoveEntry(t.channelz.ID)
var st *status.Status
if len(goAwayDebugMessage) > 0 {
	st = status.Newf(codes.Unavailable, "closing transport due to: %v, received prior goaway: %v", err, goAwayDebugMessage)
	err = st.Err()
} else {
	st = status.New(codes.Unavailable, err.Error())
}

WDYT?

@easwars easwars assigned aranjans and unassigned purnesh42H Oct 1, 2024
@aranjans aranjans changed the title transport: fixed deadlock happening due to http2Client.mu transport: prevent deadlock in transport Close when GoAway write hangs Oct 3, 2024
@aranjans aranjans assigned easwars and unassigned aranjans Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Testing Includes tests and testing utilities that we have for unit and e2e tests within our repo. Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test: Test/ClientCloseReturnsEarlyWhenGoAwayWriteHangs
3 participants