Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC Server Sends RST_STREAM without trailers when TCP Reassembly occurs #7623

Open
DerekTBrown opened this issue Sep 11, 2024 · 10 comments
Open
Assignees
Labels

Comments

@DerekTBrown
Copy link

What version of gRPC are you using?

v1.59.0

What version of Go are you using (go version)?

1.22.0

What operating system (Linux, Windows, …) and version?

alpine 3.18 AMD64

Error description

I have a grpc-js client that communicates with via an istio service mesh to a service implemented using the grpc-go backend. The client observes an error:

error_message_string: 13 INTERNAL: Received RST_STREAM with code 0 (Call ended without gRPC status)

Upon inspection, I see that the gRPC server is sending RST_STREAM mid-response:

Screenshot 2024-09-11 at 2 05 47 PM

I would expect to see gRPC server send a HEADERS message containing trailers indicating the cause of the failure.

A few additional anecdotes:

  • Following the same TCP stream, I see numerous other requests succeeding before and after this method call. This generally leads me to rule-out TCP-level issues.
  • This error happens almost exclusively with one particular gRPC method handler, so its possible that there is something specific to the server behavior or characteristics of this request (large request sizes).
  • All of the cases I have seen include TCP re-assembly, so its possible this is related.
@arjan-bal
Copy link
Contributor

As the rst code is 0, it looks like the server successfully responded to the request. It seems strange that the gRPC Go server didn't write the HEADER frame with the trailer metadata. This is the code that handles writing the trailer before sending RST_FRAME with code 0

success, err := t.controlBuf.execute(t.checkForHeaderListSize, trailingHeader)
if !success {
if err != nil {
return err
}
t.closeStream(s, true, http2.ErrCodeInternal, false)
return ErrHeaderListSizeLimitViolation
}
// Send a RST_STREAM after the trailers if the client has not already half-closed.
rst := s.getState() == streamActive
t.finishStream(s, rst, http2.ErrCodeNo, trailingHeader, true)

I did find some issues related to the http handler transport:

@DerekTBrown are you using a http handler transport using ServeHTTP or a regular one using Server.Serve?

@DerekTBrown
Copy link
Author

@arjan-bal Thanks for the quick rely. This is using Server.Serve.

@arjan-bal
Copy link
Contributor

arjan-bal commented Sep 18, 2024

There is another code wherein an RST_STREAM with code 0 is sent when the server receives an RST_STREAM from the client first. This doesn't seem to be the case here.

I can't figure out a reason why gRPC Go would not send the trailer just by analysing the code.

@DerekTBrown can you please provide the gRPC Go server logs from the time this issue happens? See the docs for enabling logs. If you can provide a way to repro this issue, it would be even better.

@arjan-bal
Copy link
Contributor

arjan-bal commented Sep 20, 2024

It may also be helpful to look at traffic for stream ID 0 which is used for connection level communication. If a GOAWAY received on stream 0, it can cause all other streams to terminate.

@DerekTBrown
Copy link
Author

It may also be helpful to look at traffic for stream ID 0 which is used for connection level communication. If a GOAWAY received on stream 0, it can cause all other streams to terminate.

This doesn't appear to be happening. There aren't any messages on Stream 0 around this time:

Screenshot 2024-09-20 at 3 13 36 PM

@DerekTBrown
Copy link
Author

@DerekTBrown can you please provide the gRPC Go server logs from the time this issue happens? See the docs for enabling logs. If you can provide a way to repro this issue, it would be even better.

To give you a sense of the cardinality of the issue, we see this ~8 times per day out of 3M requests to this particular service. This just happens to be a very stateful service where errors are difficult to manage. Given the request rate, it has been very hard to correlate the tcpdump with logs.

Looking at server-transport logs in particular, the most common errors I see are:

message cardinality
[transport] [server-transport ] Closing: EOF ~1k per day
[transport] [server-transport] loopyWriter exiting with error: transport closed by client ~500 per day
[transport] [server-transport] loopyWriter exiting with error: connection error: desc = "transport is closing" ~30 per day
[transport] [server-transport] loopyWriter exiting with error: second GOAWAY written and no active streams left to process ~25 per day
[transport] [server-transport ] use of closed network connection ~7 occurences, but two lines per occurence with different traceback
[transport] [server-transport ] loopyWriter exiting with error: finished processing active streams while in draining mode ~7 per day

Is there something in particular I should be looking for in the logs?

@arjan-bal
Copy link
Contributor

arjan-bal commented Sep 25, 2024

I checked the code paths responsible for logs in #7623 (comment), but can't find a place where grpc Go returns RST_CODE 0 without trailers.

I found this issue on the grpc-node repo with similar symptoms: grpc/grpc-node#2569
The node issue points to this envoy issue with a fix which @DerekTBrown has already tried: envoyproxy/envoy#30149

The envoy issue had the same symptoms as the one seen here, i.e. an RST_FRAME without a header frame.

At this point I don't know where to look next. Let me check with other maintainers.

@arjan-bal arjan-bal removed their assignment Sep 25, 2024
@easwars
Copy link
Contributor

easwars commented Sep 26, 2024

@DerekTBrown : Are you seeing this behavior with a unary RPC or a streaming RPC? Thanks.

@DerekTBrown
Copy link
Author

@DerekTBrown : Are you seeing this behavior with a unary RPC or a streaming RPC? Thanks.

Unary RPCs

@arjan-bal
Copy link
Contributor

@DerekTBrown we don't have sufficient information to continue investigating.

Can you try to repro the issue without envoy proxies b/w the client and server?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants