Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Support]: sudden disappearance of user-space TCP streams #286

Closed
3 tasks done
xguerin opened this issue Nov 14, 2023 · 12 comments
Closed
3 tasks done

[Support]: sudden disappearance of user-space TCP streams #286

xguerin opened this issue Nov 14, 2023 · 12 comments
Labels
DPDK driver FreeBSD driver support Ask a question or request support triage Determine the priority and severity

Comments

@xguerin
Copy link

xguerin commented Nov 14, 2023

Preliminary Actions

Driver Type

DPDK PMD for Elastic Network Adapter (ENA)

Driver Tag/Commit

librte-net-ena23/mantic,now 22.11.3-1

Custom Code

No

OS Platform and Distribution

Linux 6.5.0-1009-aws #9-Ubuntu

Support request

This is probably a long shot, so I apologize in advance for the seemingly broad scope of my question.

I use the ENA PMD is conjunction with a user-space TCP/IP stack. Under heavy load (200+ connections), the device simply stop receiving packets for certain connections. By that I mean literally no packets: no retransmissions, no RST, nothing. Other active connections work just fine. Of course, a similar setup running on kernel sockets does not show that behavior. All the connections are long-lived and rarely reset.

I analyzed packet dumps collected off the device DPDK interface (tx_burst and rx_burst) and the affected streams look kosher: ACKs happen on time (either quick or delayed) and the window is properly advertised.

The queue is configured with its maximum RX and TX buffers (I'm running on a c6i so 1024 TX and 2048 RX) and the mempools are on average underutilized (so no buffer overrun). Driver stats don't report any error either. The igb_uio driver is loaded with wc_activate=1.

Packets for each connections are routed using their toeplitz hash in the device's RETA. The read logic looks like this:

struct rte_mbuf* mbufs[2048];
auto nbrx = rte_eth_rx_burst(m_portid, m_queueid, mbufs, 2048);

for (auto i = 0; i < nbrx; i += 1) {
  auto* buf = mbufs[i];
  /* do something */
  rte_pktmbuf_free(buf);
}

All IRQs are bound to CPU0 and the application is running on CPUs 1-3, using isolcpus. The instance is configured with 1 HW thread. There is virtually no starvation.

My question basically is: is there a chance, even minute, that such packet "disappearance" could have a low-level root cause, either from the HW or the PMD? Or or misconfiguration/misuse of the PMD?

Thanks,

@xguerin xguerin added support Ask a question or request support triage Determine the priority and severity labels Nov 14, 2023
@shaibran
Copy link
Contributor

Hi xguerin and thanks for choosing ec2!

  1. Lets rule out first any issue on ec2 side. Can you please share the instance ID and the region where you launched the instance, ENI ID on which you ran the traffic, and the timeframe when the mentioned suspected packet lose happen? I can check for any HW related issues and see device stats.
  2. the 6th gen instance types (such as the c6i you are using) had a small gap in its stats regarding packet drop errors, that we closed in our latest driver v2.8.0 that we released to DPDK23.11. We added a new xstat 'rx_overrun' that increases when a packet arrives but there are not enough free buffers in the RX ring to receive it. This usually indicates that the application did not refill the ring, or did not refill it fast enough. I can prepare a small patch on top of your dpdk version, however if you share the details from section [1] above, I believe I will be able to see it there too.
  3. You can always open driver logs for the Rx flow and check for any issues, see instruction here.

best regards,
Shai

@xguerin
Copy link
Author

xguerin commented Nov 14, 2023

Thanks @shaibran for your response! Here are the pieces of information requested:

  1. i-036848c32cab669e5
  2. eu-west-3
  3. eni-02a12366ae4d9bc1d

Some timestamps where we lost activity (time in EST):

  • 2023-11-14T10:48:04.542407
  • 2023-11-14T10:48:29.723818
  • 2023-11-14T10:49:52.701375
  • 2023-11-14T10:50:13.436829
  • 2023-11-14T11:12:58.243238

I'll edit that comment with some other occurrences as they happen.

@xguerin
Copy link
Author

xguerin commented Nov 14, 2023

Could reprogramming the RETA dynamically explain this behavior? When the connections are live I don't see any spurious loss of connectivity. But then, every once in while, I recycle a subset of these connections, and then all hell break loose. Edit no material difference.

@xguerin
Copy link
Author

xguerin commented Nov 15, 2023

I set up mirroring to analyze the traffic from the VPC standpoint. For some reason, I could not use my c6i, so I configured a c5n instead. The problem has virtually disappeared, as I am seeing an order of magnitude less disconnections. (no problem on c5 or c6in either).

@shaibran
Copy link
Contributor

Hi, some update on the check on ec2 side:
I did not see any HW related issues in our dashboard. The interface dashboard shows that no instance limiters are being hit but there are indeed some DDP errors (rx_overruns) that corelate to some of the timestamps you mentioned. These Rx overruns means that the HW did not have enough Rx buffers to fill and had to drop ingress packets. From the code snippet you shared above it seems that you attempt to read the full ring size in a single burst and that what might have led to the drops.
Regarding the reta update, the driver and HW both support updates of the table in runtime and I am not familiar with any issues around that area, so we will need to check the behavior on different HWs to verify this behavior.

@xguerin
Copy link
Author

xguerin commented Nov 19, 2023

Thanks @shaibran for the update.

I will keep an eye on those queues, and update to 23.11 so I can track the overruns. However, even if the reader ended up dropping packets, it does not explain why the flow stopped abruptly. Nor why retransmission and keep-alive requests would not get a response from the peer.

More interestingly, I can say with a high degree of confidence that this issue is happening only on c6i instances. The other instances I tested, namely c5, c5n, and c6in, are not affected.

@shaibran
Copy link
Contributor

thanks for the important input, we will look into this issue

@alexissavin
Copy link

alexissavin commented Dec 20, 2023

Hello,

Following up on this, thankfully to the unexpected link created with issue #9.

We do face similar issue with netmap for quite a while now (on FreeBSD) :

Jun 9 14:05:03 solid kernel: ena0: [nm] Bad buff in slot
Jun 9 14:05:03 solid syslogd: last message repeated 3503 times
Jun 9 14:05:03 solid kernel: ena0: Rx ring 0 is stalled. Triggering the refill function
Jun 9 14:05:03 solid kernel: ena0: Rx ring 1 is stalled. Triggering the refill function
Jun 9 14:05:08 solid kernel: ena0: Rx ring 0 is stalled. Triggering the refill function
Jun 9 14:05:08 solid kernel: ena0: Rx ring 1 is stalled. Triggering the refill function
Jun 9 14:05:08 solid kernel: ena0: Rx ring 0 is stalled. Triggering the refill function
Jun 9 14:05:08 solid kernel: ena0: Rx ring 1 is stalled. Triggering the refill function
Jun 9 14:05:08 solid kernel: ena0: Rx ring 0 is stalled. Triggering the refill function
Jun 9 14:05:08 solid kernel: ena0: Rx ring 1 is stalled. Triggering the refill function
Jun 9 14:05:13 solid kernel: ena0: Rx ring 0 is stalled. Triggering the refill function
Jun 9 14:05:13 solid kernel: ena0: Rx ring 1 is stalled. Triggering the refill function
Jun 9 14:05:13 solid kernel: ena0: Rx ring 0 is stalled. Triggering the refill function
Jun 9 14:05:13 solid kernel: ena0: Rx ring 1 is stalled. Triggering the refill function

This affecte randomly our customer's virtual appliance deployed on ec2.
It does not seems limited to specific instance types yet but we did focused our internal investigation on m5.
This mostly occurs on under high traffic load - DNS packets over UDP. But not only, some customers have faced this issue on netmap enabled interfaces but with non catched TCP traffic.

Surprisingly, using netmap in emulated mode does not helps and let us think this issue is with the ENA driver itself.

Kind regards

@shaibran
Copy link
Contributor

shaibran commented Dec 21, 2023

The dpdk and freeBsd drivers are completely different, so it is unlikely that this is driver related. I will have the freeBSD developers look into this, but the log you shared might indicate that the threshold for refilling the rx ring should be modified, this might be more visible when the application does not release processed ingress packets fast enough so the device will not have free buffers in the rx ring.
You can do initial triage using the general stats (specifically drops when hw does not have suficient vacancy in rx ring; stat name varies between the operating systems), Customer metrics, that allows to see if instance limiters were reached, and enabling the logger. The readme of the relevant driver should cover this.

@akiyano
Copy link
Contributor

akiyano commented Dec 21, 2023

Hi @alexissavin,
I'll quote the documentation of the function that prints the "Triggering the refill function" prints you see:

/* For the rare case where the device runs out of Rx descriptors and the
 * msix handler failed to refill new Rx descriptors (due to a lack of memory
 * for example).
 * This case will lead to a deadlock:
 * The device won't send interrupts since all the new Rx packets will be dropped
 * The msix handler won't allocate new Rx descriptors so the device won't be
 * able to send new packets.
 *
 * When such a situation is detected - execute rx cleanup task in another thread
 */

Also the print "Bad buff in slot" comes from the function that allocates buffs for the rx ring in netmap mode which fails.

So the driver is failing to refill the RX ring. And thus it retries over and over again.

Could you please share reproduction steps to [email protected] so that we could further investigate?
Please share instance type, ami (if public), driver version, kernel version, how do you generate the traffic? any other setup parameters you use.

@alexissavin
Copy link

Hello @shaibran and @akiyano,

Many thanks for your feedback.

I'm unfortunately unable to share an STR without providing you with our own software at the moment.

I'm trying to reproduce using pkt-gen and I will let you know the results by email or in a separate issue to avoid pollute this thread which seems not related.

I only want to share that the recent version of the ena driver does not seem to crash any longer in emulated mode (admode=2) in its latest version (Elastic Network Adapter (ENA)ena v2.6.3). However, it still crash rapidly in native mode (admode=1) with about 20 000 QPS. So I do believe the issue is within the driver but most likely in a section related to the netmap support.

Kind regards

@shaibran
Copy link
Contributor

shaibran commented Mar 7, 2024

resolving this issue, feel free to reopen (on the specific ena driver) if needed

@shaibran shaibran closed this as completed Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DPDK driver FreeBSD driver support Ask a question or request support triage Determine the priority and severity
Projects
None yet
Development

No branches or pull requests

5 participants