-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenSearch Data Nodes memory exhaustion after upgrade from 2.9 to 2.12 (JDK 21 upgrade) #12454
Comments
Thanks @rlevytskyi for reporting the issue. did you try a heap dump? It will help us debug further here. (You can try with smaller heap, the issue might generate faster in that case). Couple of questions:
|
Thank you @shwetathareja for your reply!
|
Re heap dump, where should we collect it and when? |
@rlevytskyi one of the major changes in 2.12 is that it is bundled with JDK-21 by default, any chances you could downgrade JDK to 17 for your deployment (may need altering Docker image) to eliminate the JDK version change as a suspect? Thank you. |
Thank you Andriy for your reply. |
I think you need those https://github.com/opensearch-project/opensearch-build/tree/main/docker/release/dockerfiles, but may be simpler way is to "inherit" from 2.12 image and install/replace JDK version to run with. |
I am unable to build OpenSearch image yet.
First of all, it tells "1.x Only" So my question is if there is a way to build image exactly as yours to make sure we have the same configuration? |
@rlevytskyi I believe the new file is right next to the that one dockerfile. Take a look at the readme.md, maybe that will help if you are looking to construct a docker image from a custom configuration Note; following |
@rlevytskyi I'm not sure if you've managed to capture and investigate a heap dump of the OpenSearch process, see this guide |
Thank you Peter, |
Re Heap Dump, I managed to get and even sanitize it using Paypal's tool https://github.com/paypal/heap-dump-tool . |
Thank you again @peternied Peter for pointing out the https://github.com/opensearch-project/opensearch-build/blob/main/docker/release/README.md |
I managed to create an image based on 2.12 using the following Dockerfile: |
[Triage - attendees 1 2 3 4 5]
@rlevytskyi Without a root cause / and bugfix it is hard to qualify what next steps to take. I would recommend doing testing and have a mitigation plan if something happens, but your mileage my vary.
Since it has been a week and there is no root cause, we are closing out this issue. Feel free to open a new issue if you find a proximal cause from a heap analysis or a way to reproduce the leak. |
Want to chime in and say we were running into something similar after upgrading to 2.12. Suddenly all sorts of previously normal operations were causing the overall parent circuit breakers to trip, and there were significantly more GC logs emitted by opensearch overall. This problem was most exacerbated by the snapshot and reindex APIs. I applied the image changes from @rlevytskyi to use JDK17 and it has completely solved the issues and symptoms we were seeing. Average heap dropped considerably and is much more stable. |
Sounds like upgrading to JDK 21 is the change that caused this. Seems like a real problem. I am going to reopen this and edit the title to say something to this effect. @tophercullen do you think you can help us debug what's going on? There are a few suggestions above to take some heap dumps and compare. |
Using the above paypal tool that sanitizes them, I've generated heap dumps from all nodes in a new standalone cluster (nothing else using it) while taking a full cluster snapshot at 1x JDK17 and 2x JDK21. This is 24 files and ~5GB compressed. I'm unsure what I'm supposed to be comparing between them. From the stdout logging for the cluster, there were no GC logs with JDK17, and a bunch with JDK21. So it seems to be repeatable in an otherwise idle cluster, assuming that is not just a red herring. Might also consider the reproducer in #12694. That seems fairly similar to our real use case, and the operations we were seeing/getting circuit breakers tripped. Snaphots never directly tripped breakers and/or failed, and were seemingly just exacerbating the problem |
Maybe @backslasht has some ideas about what to do with this next? |
May be sharing class histogram first could help (even as a screenshot) , thanks @tophercullen |
#12694 could be related |
This might be related to this issue in JDK: https://bugs.openjdk.org/browse/JDK-8297639
In JDK 20, this flag was set to false by default and in JDK 21 it was completely removed in https://bugs.openjdk.org/browse/JDK-8293861. Summarizing the observations and reproducing efforts by the community around this JDK issue: removing this flag might have caused memory increase when sending and receiving document with chunks > 2MB. In JDK 20 we can add the |
@ansjcy that was suggested before (I think on the forum) but we did not use |
@rlevytskyi @tophercullen Do you still have your repro. Care you try with JDK 21 and |
@dblock I can do what I did before: create a new cluster and populate it with data, run snapshots. However based on what @ansjcy provided, that option is no longer available in JDK21. The issue tracker for openJDK links to a similar issue with Elasticsearch in this regard, which also has no solution using JDK21. |
Yes, my bad for not reading carefully enough. |
No, but if I'm understanding correctly, this flag was enabled by default in g1_globals.hpp for G1GC in JDK 17. Also, today I did some more experiments using https://github.com/kroepke/opensearch-jdk21-memory (Thanks, @kroepke! ). I ran bulk (20MB workload per request, ~5MB each document) with docker-based set up, each for 1 hour in the following scenarios:
captured the jvm usage results in the 1 hour run:
The results shows certain but not significant impact from disabling the flag |
@ansjcy - Do you think @tophercullen - Can you please share the heap dumps? @dblock - Is there a common share location where these heap dumps can be uploaded? |
AFAIK no, we don't have a place to host outputs from individual runs - I would just make an S3 bucket and give access to the folks in this thread offline if they don't have a place to put these |
@zakisaad Since downgrading the JVM version over a month ago, we haven't had any more issues. I would check the JVM version AWS is using. If its 21, you'll likely need to get in tough with AWS support to escalate this issue because its has not been tested thoroughly enough for actual production use from what we've found first hand (see also opensearch-project/performance-analyzer-rca#545 (comment)). If AWS is unable or unwilling to escalate this, I think your only options is to (somehow) revert to a previous version of the hosted service. |
We'll be reaching out to AWS support to get this resolved as it's essentially unusable in current state (we're rolling out a cluster reboot cron to mask GC issues until resolved). Upgrades to managed OS are one-way only, so downgrading our cluster will require a restore from a snapshot -- we may attempt this if AWS can't provide a remediation timeline. Thanks for confirming the JVM downgrade sorted this out for you, if we were self-hosting I'd jump on it. One day, we'll have the bandwidth to internally manage our OS cluster 🙇♂️ |
@zakisaad Yeah, there are pros and cons to the hosted service. My advice: don't hold your breath for AWS and/or this issue to be resolved. Create a new (older) cluster and determine if a snapshot restore is even possible, and plan an alternative data migration accordingly. |
@tophercullen sadly JDK-21 provides no workaround for this issue (#12454 (comment)), downgrading is the best option as suggested by @zakisaad |
The issue seemed to have been alleviated in Elasticsearch by stopping unnecessary copying of byte arrays. elastic/elasticsearch#99592 |
@hogesako Appreciate any fixes you can make to OpenSearch. Please make sure to no look at / copy non-APLv2-open-source code. |
Hi @dblock This issue is adversely affecting our clusters in production -- as I understand it, AWS maintains OpenSearch (and provides a managed OpenSearch service to monetise the product). As it stands, the default configuration of a fully up to date managed OS cluster on AWS exhibits memory-leak like behaviour. There are hacky fixes such as scheduled cluster reboots ~once a week (with over-provisioned nodes to accomodate the leaking memory...), but this is for sure a short term fix with various shortcomings. Our clusters aren't even that large, so I can bet other clients are seeing this issue for sure. As Amazon has forked ES specifically to be able to continue monetising the product via their managed service, I assume it is expected that AWS fixes or at least addresses this issue as important. We haven't bothered considering self-managed clusters yet as we assumed AWS would fix an issue of this magnitude, but if AWS won't prioritise it we'll be moving off the managed service for sure. If we were self-managed, we'd be able to downgrade the JVM and avoid this issue entirely, for instance. |
We're seeing the same issue here, I've rebuild an image with JDK17 as specified above, but this didn't solve it for us, also tested on 2.14.0 Even with JDK17 and increasing memory with 400% we need to restart our cluster every few days because of the memory issues. So it's not just the JVM, maybe there's a memory leak and a GC problem. |
@kroepke unfortunately as far as I know, Amazon managed OpenSearch doesn't allow us to specify/pin JDKs. We're at the mercy of whatever the development team Amazon has rolled out as part of the managed service. |
@zakisaad please reach out to AWS support to follow up on the fix for your clusters. |
I've updated the 2.14.0 image with the jackson-core to 2.17.1, keeping the default java (openjdk version "21.0.3" 2024-04-16 LTS) of that image. I've downscaled the cluster back to the original 100% and it is now running for 72h without issues. Next week we'll upgrade to 2.15.0 |
Thanks for the update @42wim , please share the outcomes , it would help us to pinpoint if the issue is still there (JDK related) or gone (Jackson related) |
Running 2.15.0 now for > 72h, issues are gone, so seems Jackson related. |
[Indexing Triage 09/16] Thanks @42wim for confirmation. Closing the issue now. |
Describe the bug
Hello OpenSearch Team,
We’ve just updated our OpenSearch cluster from version 2.9.0 to 2.12.0.
Among other issues, we’ve noticed that Opensearch is now consuming waaay more memory than previous version, i.e. it became unusable with the same configuration, even after providing it with 15% more RAM. To make it responsive again, we had to close many indices.
Related component
Other
To Reproduce
Expected behavior
We didn't expect significant memory usage increase at version upgrade
Additional Details
Plugins
Security pluging for SAML authn and authz
Screenshots
Please note almost horizontal Heap usage before upgrade, increase after upgrade, and horizontal again after closing some indices.
Host/Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: