[Support]: sudden disappearance of user-space TCP streams #286

xguerin · 2023-11-14T11:04:11Z

Preliminary Actions

I have searched the existing issues and didn't find a duplicate.
I have followed the AWS official troubleshoot documentation.
I have followed the driver readme and best practices.

Driver Type

DPDK PMD for Elastic Network Adapter (ENA)

Driver Tag/Commit

librte-net-ena23/mantic,now 22.11.3-1

Custom Code

No

OS Platform and Distribution

Linux 6.5.0-1009-aws #9-Ubuntu

Support request

This is probably a long shot, so I apologize in advance for the seemingly broad scope of my question.

I use the ENA PMD is conjunction with a user-space TCP/IP stack. Under heavy load (200+ connections), the device simply stop receiving packets for certain connections. By that I mean literally no packets: no retransmissions, no RST, nothing. Other active connections work just fine. Of course, a similar setup running on kernel sockets does not show that behavior. All the connections are long-lived and rarely reset.

I analyzed packet dumps collected off the device DPDK interface (tx_burst and rx_burst) and the affected streams look kosher: ACKs happen on time (either quick or delayed) and the window is properly advertised.

The queue is configured with its maximum RX and TX buffers (I'm running on a c6i so 1024 TX and 2048 RX) and the mempools are on average underutilized (so no buffer overrun). Driver stats don't report any error either. The igb_uio driver is loaded with wc_activate=1.

Packets for each connections are routed using their toeplitz hash in the device's RETA. The read logic looks like this:

struct rte_mbuf* mbufs[2048];
auto nbrx = rte_eth_rx_burst(m_portid, m_queueid, mbufs, 2048);

for (auto i = 0; i < nbrx; i += 1) {
  auto* buf = mbufs[i];
  /* do something */
  rte_pktmbuf_free(buf);
}

All IRQs are bound to CPU0 and the application is running on CPUs 1-3, using isolcpus. The instance is configured with 1 HW thread. There is virtually no starvation.

My question basically is: is there a chance, even minute, that such packet "disappearance" could have a low-level root cause, either from the HW or the PMD? Or or misconfiguration/misuse of the PMD?

Thanks,

The text was updated successfully, but these errors were encountered:

shaibran · 2023-11-14T12:41:48Z

Hi xguerin and thanks for choosing ec2!

Lets rule out first any issue on ec2 side. Can you please share the instance ID and the region where you launched the instance, ENI ID on which you ran the traffic, and the timeframe when the mentioned suspected packet lose happen? I can check for any HW related issues and see device stats.
the 6th gen instance types (such as the c6i you are using) had a small gap in its stats regarding packet drop errors, that we closed in our latest driver v2.8.0 that we released to DPDK23.11. We added a new xstat 'rx_overrun' that increases when a packet arrives but there are not enough free buffers in the RX ring to receive it. This usually indicates that the application did not refill the ring, or did not refill it fast enough. I can prepare a small patch on top of your dpdk version, however if you share the details from section [1] above, I believe I will be able to see it there too.
You can always open driver logs for the Rx flow and check for any issues, see instruction here.

best regards,
Shai

xguerin · 2023-11-14T15:51:05Z

Thanks @shaibran for your response! Here are the pieces of information requested:

i-036848c32cab669e5
eu-west-3
eni-02a12366ae4d9bc1d

Some timestamps where we lost activity (time in EST):

2023-11-14T10:48:04.542407
2023-11-14T10:48:29.723818
2023-11-14T10:49:52.701375
2023-11-14T10:50:13.436829
2023-11-14T11:12:58.243238

I'll edit that comment with some other occurrences as they happen.

xguerin · 2023-11-14T19:16:30Z

Could reprogramming the RETA dynamically explain this behavior? When the connections are live I don't see any spurious loss of connectivity. But then, every once in while, I recycle a subset of these connections, and then all hell break loose. Edit no material difference.

xguerin · 2023-11-15T21:19:54Z

I set up mirroring to analyze the traffic from the VPC standpoint. For some reason, I could not use my c6i, so I configured a c5n instead. The problem has virtually disappeared, as I am seeing an order of magnitude less disconnections. (no problem on c5 or c6in either).

shaibran · 2023-11-19T09:40:54Z

Hi, some update on the check on ec2 side:
I did not see any HW related issues in our dashboard. The interface dashboard shows that no instance limiters are being hit but there are indeed some DDP errors (rx_overruns) that corelate to some of the timestamps you mentioned. These Rx overruns means that the HW did not have enough Rx buffers to fill and had to drop ingress packets. From the code snippet you shared above it seems that you attempt to read the full ring size in a single burst and that what might have led to the drops.
Regarding the reta update, the driver and HW both support updates of the table in runtime and I am not familiar with any issues around that area, so we will need to check the behavior on different HWs to verify this behavior.

xguerin · 2023-11-19T17:11:36Z

Thanks @shaibran for the update.

I will keep an eye on those queues, and update to 23.11 so I can track the overruns. However, even if the reader ended up dropping packets, it does not explain why the flow stopped abruptly. Nor why retransmission and keep-alive requests would not get a response from the peer.

More interestingly, I can say with a high degree of confidence that this issue is happening only on c6i instances. The other instances I tested, namely c5, c5n, and c6in, are not affected.

shaibran · 2023-11-21T08:55:16Z

thanks for the important input, we will look into this issue

alexissavin · 2023-12-20T10:40:36Z

Hello,

Following up on this, thankfully to the unexpected link created with issue #9.

We do face similar issue with netmap for quite a while now (on FreeBSD) :

Jun 9 14:05:03 solid kernel: ena0: [nm] Bad buff in slot
Jun 9 14:05:03 solid syslogd: last message repeated 3503 times
Jun 9 14:05:03 solid kernel: ena0: Rx ring 0 is stalled. Triggering the refill function
Jun 9 14:05:03 solid kernel: ena0: Rx ring 1 is stalled. Triggering the refill function
Jun 9 14:05:08 solid kernel: ena0: Rx ring 0 is stalled. Triggering the refill function
Jun 9 14:05:08 solid kernel: ena0: Rx ring 1 is stalled. Triggering the refill function
Jun 9 14:05:08 solid kernel: ena0: Rx ring 0 is stalled. Triggering the refill function
Jun 9 14:05:08 solid kernel: ena0: Rx ring 1 is stalled. Triggering the refill function
Jun 9 14:05:08 solid kernel: ena0: Rx ring 0 is stalled. Triggering the refill function
Jun 9 14:05:08 solid kernel: ena0: Rx ring 1 is stalled. Triggering the refill function
Jun 9 14:05:13 solid kernel: ena0: Rx ring 0 is stalled. Triggering the refill function
Jun 9 14:05:13 solid kernel: ena0: Rx ring 1 is stalled. Triggering the refill function
Jun 9 14:05:13 solid kernel: ena0: Rx ring 0 is stalled. Triggering the refill function
Jun 9 14:05:13 solid kernel: ena0: Rx ring 1 is stalled. Triggering the refill function

This affecte randomly our customer's virtual appliance deployed on ec2.
It does not seems limited to specific instance types yet but we did focused our internal investigation on m5.
This mostly occurs on under high traffic load - DNS packets over UDP. But not only, some customers have faced this issue on netmap enabled interfaces but with non catched TCP traffic.

Surprisingly, using netmap in emulated mode does not helps and let us think this issue is with the ENA driver itself.

Kind regards

shaibran · 2023-12-21T14:47:37Z

The dpdk and freeBsd drivers are completely different, so it is unlikely that this is driver related. I will have the freeBSD developers look into this, but the log you shared might indicate that the threshold for refilling the rx ring should be modified, this might be more visible when the application does not release processed ingress packets fast enough so the device will not have free buffers in the rx ring.
You can do initial triage using the general stats (specifically drops when hw does not have suficient vacancy in rx ring; stat name varies between the operating systems), Customer metrics, that allows to see if instance limiters were reached, and enabling the logger. The readme of the relevant driver should cover this.

akiyano · 2023-12-21T22:20:23Z

Hi @alexissavin,
I'll quote the documentation of the function that prints the "Triggering the refill function" prints you see:

/* For the rare case where the device runs out of Rx descriptors and the
 * msix handler failed to refill new Rx descriptors (due to a lack of memory
 * for example).
 * This case will lead to a deadlock:
 * The device won't send interrupts since all the new Rx packets will be dropped
 * The msix handler won't allocate new Rx descriptors so the device won't be
 * able to send new packets.
 *
 * When such a situation is detected - execute rx cleanup task in another thread
 */

Also the print "Bad buff in slot" comes from the function that allocates buffs for the rx ring in netmap mode which fails.

So the driver is failing to refill the RX ring. And thus it retries over and over again.

Could you please share reproduction steps to [email protected] so that we could further investigate?
Please share instance type, ami (if public), driver version, kernel version, how do you generate the traffic? any other setup parameters you use.

alexissavin · 2023-12-22T16:09:07Z

Hello @shaibran and @akiyano,

Many thanks for your feedback.

I'm unfortunately unable to share an STR without providing you with our own software at the moment.

I'm trying to reproduce using pkt-gen and I will let you know the results by email or in a separate issue to avoid pollute this thread which seems not related.

I only want to share that the recent version of the ena driver does not seem to crash any longer in emulated mode (admode=2) in its latest version (Elastic Network Adapter (ENA)ena v2.6.3). However, it still crash rapidly in native mode (admode=1) with about 20 000 QPS. So I do believe the issue is within the driver but most likely in a section related to the netmap support.

Kind regards

shaibran · 2024-03-07T12:48:51Z

resolving this issue, feel free to reopen (on the specific ena driver) if needed

xguerin added support Ask a question or request support triage Determine the priority and severity labels Nov 14, 2023

akiyano added the DPDK driver label Nov 14, 2023

davidarinzon added the FreeBSD driver label Dec 24, 2023

shaibran closed this as completed Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Support]: sudden disappearance of user-space TCP streams #286

[Support]: sudden disappearance of user-space TCP streams #286

xguerin commented Nov 14, 2023 •

edited

Loading

shaibran commented Nov 14, 2023

xguerin commented Nov 14, 2023 •

edited

Loading

xguerin commented Nov 14, 2023 •

edited

Loading

xguerin commented Nov 15, 2023 •

edited

Loading

shaibran commented Nov 19, 2023

xguerin commented Nov 19, 2023

shaibran commented Nov 21, 2023

alexissavin commented Dec 20, 2023 •

edited

Loading

shaibran commented Dec 21, 2023 •

edited

Loading

akiyano commented Dec 21, 2023

alexissavin commented Dec 22, 2023

shaibran commented Mar 7, 2024

[Support]: sudden disappearance of user-space TCP streams #286

[Support]: sudden disappearance of user-space TCP streams #286

Comments

xguerin commented Nov 14, 2023 • edited Loading

Preliminary Actions

Driver Type

Driver Tag/Commit

Custom Code

OS Platform and Distribution

Support request

shaibran commented Nov 14, 2023

xguerin commented Nov 14, 2023 • edited Loading

xguerin commented Nov 14, 2023 • edited Loading

xguerin commented Nov 15, 2023 • edited Loading

shaibran commented Nov 19, 2023

xguerin commented Nov 19, 2023

shaibran commented Nov 21, 2023

alexissavin commented Dec 20, 2023 • edited Loading

shaibran commented Dec 21, 2023 • edited Loading

akiyano commented Dec 21, 2023

alexissavin commented Dec 22, 2023

shaibran commented Mar 7, 2024

xguerin commented Nov 14, 2023 •

edited

Loading

xguerin commented Nov 14, 2023 •

edited

Loading

xguerin commented Nov 14, 2023 •

edited

Loading

xguerin commented Nov 15, 2023 •

edited

Loading

alexissavin commented Dec 20, 2023 •

edited

Loading

shaibran commented Dec 21, 2023 •

edited

Loading