-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ECS] [Container OOM]: Containers OOM with Amazon Linux 2023 ECS AMI #240
Comments
Hey @rixwan-sharif + all others with 👍 Thanks for the info and sharing your experience! We asked AWS ECS team via support case whether there is a known error or behavior like you described here - they said no.
Did you also forward the problem as a support case? Thanks, Robert |
Hello, I have transferred this issue to the ECS/EC2 AMI repo from containers-roadmap, since this sounds more like it could be a bug or change in behavior in the AMI, rather than a feature request. @rixwan-sharif could you let us know which AL2023 AMI version you used? Was it the latest available? Could you also provide the task and container limits that you have in your task definition(s)? Two differences that come to mind that may be relevant are the latest AL2023 AMI is using Docker 25.0 and cgroups v2, whereas the latest AL2 AMI is currently on Docker 20.10 and cgroups v1. |
If you were not using the latest AL2023 AMI, one thing to note is that the Amazon Linux team released a systemd fix in late-September 2023 for a bug in the cgroup OOM-kill behavior. (Search "systemd" in release notes here: https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.2.20230920.html) |
If anyone has data to provide to the engineering team that can't be shared here, please feel free to email it to |
Hi, This is the AMI version we are using AMI Version : al2023-ami-ecs-hvm-2023.0.20240409-kernel-6.1-x86_64 Task/Container details Base Docker Image: adoptopenjdk/openjdk14:x86_64-debian-jdk-14.0.2_12 Resources: CPU : 0.125 Docker stats on Amazon Linux 2 AMI. Docker stats on Amazon Linux 2023 AMI. (Increased Memory hard limit to 3GB as container was OOMing with 1GB of memory) And yes we already opened a support case too (Case ID 171387184800518). this is what we got from support. [+] Upon further troubleshooting, we found that there seems to be issue with AL2023 AMI which our internal team is already working on and below are wordings shared by them:
|
Hi @sparrc, after switching to AML 2023 from AML 2 we faced w/ a similar issue as well, we haven't got OOM yet but the memory consumption has nearly doubled, and memory consumption regularly increases as the application runs which looks like a memory leak. |
Also experiencing the same behavior with the latest AMI: Going from AL2 to AL2023 results in significant memory consumption increase that seems to just keep increasing over time. (seems generalized regardless of language/framework). This is especially troublesome since AWS is recommending AL2023 over AL2. If the ECS internal team is aware of this is there somewhere where we can track it? This thread doesn't really indicate anything is being done to investigate or fix. It seems like this has been an issue for months and would like to stay current on any progress or updates. |
We have memory limits set at the TaskDefinition level and the JVM is allocating heap based on the total physical RAM on the host system now, which is blowing up our RAM usage. We were able to set memory limits on the individual ContainerDefinitions within the TaskDefinition and that seems to have fixed it for us. |
Community Note
Tell us about your request
ECS Containers are getting killed due to Out of Memory with new Amazon Linux 2023 ECS AMI.
Which service(s) is this request for?
ECS - with EC2 (Autoscaling and Capacity Provider setup)
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
We are deploying our workloads on AWS ECS (EC2 based). We recently migrated our underlying cluster instances AMI to Amazon Linux 2023 (previously using Amazon Linux 2). After the migration, we are facing a lot of "OOM Container Killed" for our services without any change on the service side.
Are you currently working around this issue?
The text was updated successfully, but these errors were encountered: