-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transport: prevent deadlock in transport Close when GoAway write hangs #7662
base: master
Are you sure you want to change the base?
Conversation
…WhenGoAwayWriteHangs
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #7662 +/- ##
==========================================
+ Coverage 81.89% 81.96% +0.06%
==========================================
Files 361 361
Lines 27818 27823 +5
==========================================
+ Hits 22782 22805 +23
+ Misses 3847 3833 -14
+ Partials 1189 1185 -4
|
Is there an existing issue filed which needs to be linked? |
@purnesh42H yes, added it in description. |
Some general comments here:
|
internal/transport/http2_client.go
Outdated
@@ -1010,6 +1010,18 @@ func (t *http2Client) Close(err error) { | |||
} | |||
t.mu.Unlock() | |||
|
|||
// Append info about previous goaways if there were any, since this may be important | |||
// for understanding the root cause for this connection to be closed. | |||
_, goAwayDebugMessage := t.GetGoAwayReason() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetGoAwayReason
is implemented as follows:
func (t *http2Client) GetGoAwayReason() (GoAwayReason, string) {
t.mu.Lock()
defer t.mu.Unlock()
return t.goAwayReason, t.goAwayDebugMessage
}
This means that the instead of unlocking t.mu
on line 1011, we can instead do the following:
// Don't unlock on line 1011
goAwayDebugMessage := t.goAwayDebugMessage
t.mu.Unlock()
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, we could do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, may be I am getting confused with line number but t.controlBuf.put()
aquire a lock as well so we need to unlock before that.
I think for the case of loopyWriter
we need to make sure that we unlock mu
in case of timeout as well which should be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@purnesh42H t.controlBuf.put()
has a separate mu
lock and is accessed with t.controlBuf.mu
for control buffers resources, and here we are dealing with t.mu
for htt2Client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but outgoingGoAwayHandler
is acquiring lock on t.mu
which is followed by t.controlBuf.put()
. We need to just make sure we unlock the outgoingGoAwayHandler
lock in case of timeout as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly, please refer to this comment.
#7662 (comment)
internal/transport/http2_client.go
Outdated
// Append info about previous goaways if there were any, since this may be important | ||
// for understanding the root cause for this connection to be closed. | ||
_, goAwayDebugMessage := t.GetGoAwayReason() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of http2Client.outgoingGoAwayHandler
holds t.mu
when calling WriteGoAway
on the underlying framer, i.e. it is performing I/O when holding the lock. Is this the correct thing to do?
Can/should this be replaced with the following:
// OutgoingGoAwayHandler writes a GOAWAY to the connection. Always returns
// (false, err) as we want the GoAway to be the last frame loopy writes to the
// transport.
func (t *http2Client) outgoingGoAwayHandler(g *goAway) (bool, error) {
t.mu.Lock()
nextID := t.nextID - 2
t.mu.Unlock()
if err := t.framer.fr.WriteGoAway(nextID, http2.ErrCodeNo, g.debugData); err != nil {
return false, err
}
return false, g.closeConn
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that'd be a useful optimization and hence minimize acquiring the mutex. We could define maxStreamID
as t.nextID - 2
and use it to write goAway frame as t.framer.fr.WriteGoAway(maxStreamID, http2.ErrCodeNo, g.debugData)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to t.mu.Unlock()
, currently on line 1012, needs to be moved to be after this line because t.goAwayDebugMessage
is guarded by that mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that'd be a useful optimization and hence minimize acquiring the mutex
I don't think this is a useful optimization. This is the root of the deadlock. I don't think we need to move code around in http2Client.Close
at all if we don't hold the lock in outgoingGoAwayHandler
when performing the actual write. Can you please confirm that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, yeah but we would need to release t.mu
before we sent out goAway frame over to cbuf, bcz outgoingGoAwayHandler
also needs to acquire t.mu
to get the maxStreamID
and we are waiting for t.writerDone with the timer. We can do something like
goAwayDebugMessage := t.goAwayDebugMessage
t.mu.Unlock()
// Per HTTP/2 spec, a GOAWAY frame must be sent before closing the
// connection. See https://httpwg.org/specs/rfc7540.html#GOAWAY. It
// also waits for loopyWriter to be closed with a timer to avoid the
// long blocking in case the connection is blackholed, i.e. TCP is
// just stuck.
t.controlBuf.put(&goAway{code: http2.ErrCodeNo, debugData: []byte("client transport shutdown"), closeConn: err})
timer := time.NewTimer(goAwayLoopyWriterTimeout)
defer timer.Stop()
select {
case <-t.writerDone: // success
case <-timer.C:
t.logger.Infof("Failed to write a GOAWAY frame as part of connection close after %s. Giving up and closing the transport.", goAwayLoopyWriterTimeout)
}
t.cancel()
t.conn.Close()
channelz.RemoveEntry(t.channelz.ID)
var st *status.Status
if len(goAwayDebugMessage) > 0 {
st = status.Newf(codes.Unavailable, "closing transport due to: %v, received prior goaway: %v", err, goAwayDebugMessage)
err = st.Err()
} else {
st = status.New(codes.Unavailable, err.Error())
}
WDYT?
Fixes #7606.
Couple of recent changes worth noting here:
loopyWriter
to exit (after enqueueing the above GOAWAY frame on thecontrolbuf
). This was done to ensure that the client transport shutdown can complete in the face of a hanging network connection that blocks forever when attempting to write the above GOAWAY frameDescription of the deadlock:
controlbuf
,http2Client.Close
callshttp2Client.GetGoAwayReason
to fetch the last GOAWAY's debug message, and the latter attempts to grabhttp2Client.mu
.http2Client.outgoingGoAwayHandler
holdshttp2Client.mu
when it is attempting to write the GOAWAY frame. So, if the underlying network connection is hanging, this method will not release the mutex, and thereforehttp2Client.GetGoAwayReason
will not be able to grab the same mutex, and therebyhttp2Client.Close
will deadlock.RELEASE NOTES: