-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Egeria Metadata Server crashes silently while processing massive amount of elements #235
Comments
Do you have further indications of what may have happened. Was anything recorded in the statefullset or pod stats ( Was the topic connector the only workload? |
Also, how was the behaviour after the crash? Did the system recover? Was any data lost? Did the connector resume retrieving topics? |
For example, I had an issue with my Jupyter container (a bug) which led to the container restarting and I'd see something like:
I'm not sure how useful it would be, but the exit code may help us confirm running out of heap (for example) could be the cause. What about the pod itself, rather than |
If none of that helps, we probably either need to try and
|
Just checking the log again - I presume this is the metadata server -- not where you are running the integration connector? What repository connector are you using? I'm guessing it could be XTDB? If we can't get any info from the suggestions above quickly, I wonder if it's worth testing also with local graph or in-memory - that wouldn't completely rule a core issue out (as this could relate to timing, concurrent requests etc) but may give some extra insight. if it is xtdb, what's the xtdb configuration like? |
We have limited licenses for 'yourkit' java profiler (there are other tools available including using intellij+async profiler for example), which can help in pinning down memory issues, may be an option if the other approaches don't get us answers. |
We use a local Janusgraph Repository. On the first restart after the crash, the server is unable to connect to the JanusGraph. This looks like a problem with JanusGraph. The Would you recommend using XTDB instead of JanusGraph? |
@juergenhemelt I think generally we would suggest xtdb - it looks like a more scalable, reliable approach, and some performance analysis has been done on it. That being said there's no evidence one way or the other yet to suggest your issue is specific to graph, we don't know. So whilst you could try it, equally it may be good to understand why it broken even with graph. Were you using the default JanusGraph setup (uses berkeleydb)? I think capturing the data would help, interesting to see if we can figure out if, for example, the jvm failed due to heap exhaustion. Or perhaps openshift killed it due to resource - perhaps you could also check the openshift dashboard to see what it's reporting? I'd be interesting in trying to reproduce, though I may struggle to do this until later next week. I presume you just pointed your topic integrator at a strimzi deployment hosting lots of topics? Do you have any assets to share that may speed me trying to reproduce? (server config, chart config etc)? |
I'm also intrigued by the fact you said the graph failed to work after the first restart? I think you are using customised helm charts? It's possible a problem there, or perhaps not. Do you have any logs from this failure to work on restart? For example if the metadata collection id stored in the server config document gets out of sync with the actual repository (graph) then egeria will throw an error and, at least in >=3.10, fail to start the server |
This is an interesting kind of issue - how do we debug apparently the jvm going away. If the cause isn't external and anything shows up in k8s log, I wonder if we need to do something like:
^ I'll raise an issue on this OR maybe we can somehow modify the startup script of the container so that it does not exit (as you'd normally want) on this error condition, so that we could then interactively connect and analyze But it still might be easier to just connect to the JVM using an agent, which should give us similar information (ie using YourKit profiler or similar) For your issue another alternative might be to try and run just the mds server locally |
I have just switched to XTDB in-memory. The good news is as the performance is tremendously better I can test faster than before. The bad news is that I still have the same problem. I will continue analyzing. |
Good to hear the performance is better. Please do share any data you capture especially from 'describe' against the k8s objects |
In deed the
Obviously an out of memory issue. |
The ressource limits in Kubernetes were at 1GB memory. I increased it and now it works. For the 547 topics it requires 3.4GB memory. I think that's expected behaviour as I am still on in-memory XTDB. |
Did you have explicit 'resources' entries in your modified charts? Is that where you were setting the 1GB limit. When the pod restarted, did it continue processing ok?. In a k8s environment we need to expect pods to fail - for example nodes may fail or be rebooted, and the pod rescheduled elsewhere. It's important processing then continues normally - including resuming where it left off. Are there any aspects of that which is problematic with the testing you have done? I think this aspect is worth pursuing - we'd mostly expect java heap to be exceeded (for example if we had an effective memory leak) with an exception before the allocated memory on the container ran out ? Need to look if we following best practice and/or how to do better. |
Yes, I have explicitly set resource limits. This is a policy in my organization. It was set to 1GB. I now switched from XTDB in-memory to RocksDB persistence. That brings the used memory down again to less than 1GB. So the in-memory XTDB was the one which took all the memory. It could be good to check that XTDB is not using more than the available memory and log an exception then. But I can imagine that this is not easy to implement. |
If we get killed by k8s for exceeding limits there's nothing we can do in the java application -- only ensure recovery through the k8s deployment. If we hit a memory exception (ie out of heap) in the jvm there's not much we can do (especially anything that allocates memory). A output message, Maybe (and agree). I'll keep the issue open to look into this behaviour. there may be something about the container / image we use/ settings that are allowing the jvm to continue unhindered (ie not running aggressive GC) yet where physical is restricted. could we do better. I don't know, but I see it as an important question for production. There's also questions like monitoring our heap best practice, and which gc algorithm. We also need to check ops continue normally after the pod is killed and restarted elsewhere, including for example how we handle commits to the Kafka offset. Likely a combination of understand, document, improve. I'll modify the title |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions. |
I am seeing similar (if not the same) behavior when running the Lab Helm charts with the notebook "working-with-standard-models". In particular, if I run with the in-memory repository, the pod silently fails. I am using Rancher-Desktop with an allocation of 14gb and 4 cores. When I switch to the graphDB repository, it usually works but occasionally dies (happened once or twice). I suspect that the JVM may need more resource? |
I'm fairly sure the orchestration environment is killing our jvm - hence why we see no last runtime exception from the jvm. I was wondering what memory the jvm is seeing In java 8 timeframe this was an issue. See https://docs.openshift.com/container-platform/3.11/dev_guide/application_memory_sizing.html for example, where it was essential to set flags like
Moving forward in jvm options, we now have better options. There's now a flag But we also had a change in the kernel from cgroup1 to cgroups2, and it was only Java 15 that added support for cgroups2. I expect Openshift & other modern distros are using kernels with that. So we're then back to the pre-container days of incorrect memory calculatinos. From Java 15 onwards this seems corrected. So.. now that we are with v4 of egeria, and using 17, it maybe these issues have been addressed by default. @dwolfson were you running with 3.15 or 4? I wonder if both fail in the same way |
Checking some real environments Egeria 3.15, in openshift, 16GB node, default resource limits
This may mean the problem is simpler - we're not setting a container size limit at all -- so it really can be as big as available on a node -- the subtlety above isn't relevant. And each of our processes will see the same, so the potential is for a massive over allocation So we need to ensure limits are applied - and configurable. This isn't an egeria or even container consideration, rather it's at the deployment level ie in the charts (or operator). We must allow this to be config. Just as other apps do, or when a user is creating a deployment directly in a yaml, they would add This consideration is part of the sizing activity needed in any deployment.. what to exactly set those values too Platforms like openshift may also have the ability to dynamically override resource specifications - for example see https://docs.openshift.com/container-platform/4.11/nodes/clusters/nodes-cluster-overcommit.html , though we note that the default is not to overcommit |
Running in Rancher desktop on macOS is interesting:
as here it is telling me there is NO limit !! (Unlimited) which is potentially even worse |
In summary:
|
@planetf1 My testing has been with 3.15 so far - I'm happy to switch to testing with 4 if you think that would be helpful. I was running more tests with the Jupyter Labs today and found several more ways to have Egeria crash - not just loading the huge archive in the working with standards lab, but also with the automated curation and other notebooks - it seems like garbage collection isn't happening (often enough?). It seems like it often crashes if I try to run 2 or 3 notebooks after configuring Egeria.. |
Thanks for testing. Can you try with v4? I think it would be most productive to focus there as we have changes in the base image, JVM level. That being said, I am almost certain the behaviour will be the same I do suspect there's no garbage collection with the current sizing - running (so maybe experimenting as below is more fruitful) Our pods need to have properties similar to
currently the charts use
'request' is the MINIMUM required to allow the pod to be scheduled (so adding up all the requests effectively defines min ram needed), whilst the LIMIT is the most that can be used. Units above are MB for ram, and 1/1000 of a cpu unit. CPU is less critical, and probably minimum can be pretty low, and max can be higher You can try editing the charts and adding in if you wish -- ie replace that section here with the follow example, and install the chart from disk (to do that, be in the directory which contains odpi-egeria-lab, and refer to the charts as just 'odpi-egeria-lab', not 'egeria/odpi-egeria-lab' we'll need to update the charts to set these with sensible default values, that are also configurable. So trying to figure out the limits is the next thing - we want them to be as small as possible Some platforms allow defaults to be set, perhaps using an admission controller (this modifies the kubernetes documents as they are ingested, it's a technique for validation & annotation or enforcing any kind of change/override) They may also be a default default setting for the k8s implementation - ie defined by the control plane. I don't know if rancher desktop has this -- it's looking that's set to unlimited. At least on openshift it's node physical. |
The image is from charts/odpi-egeria-lab/templates/egeria-core.yaml -- we have 4 distinct ones for each platform (some refactoring is needed really!) |
An alternative is to edit the definitions once deployed. Perhaps the easiest way is to edit the deployment ie
Then edit one of these ie with
Make the same edit as above to the resources section Then save Then delete the pod ie with There are other ways to edit also - you can read more at https://www.alibabacloud.com/help/en/container-service-for-kubernetes/latest/use-resource-controller-to-dynamically-modify-the-upper-limit-of-resources-for-a-pod |
The JVM says [jboss@lab-odpi-egeria-lab-core-0 ~]$ java -XshowSettings:system -version openjdk version "11.0.18" 2023-01-17 LTS I will try the above. |
Memory is unlimited. So why would java do garbage collection ... certainly likely route to having the container killed. The fix will be to ensure we have a resource declaration |
I've added support for setting memory limits to the 'lab' chart (prerelease of 4.0) Docs need to be added. Quick test - usual data catalog lab worked fine with default values set, whilst for the standard models lab, both 'core' and 'dev' platforms crashed There is still no exception logged at the end of the prior container run - ie it just stops |
Tested again and was able to run standard models when egeria pods allocated 2GB with usage peaking around 1400+ MB. Max heap is 80% so we're in that region I've updated the defaults It's in chart 4.0.0-prerelease.2 Perhaps you can find some better refinements & validate what values make sense -- at least as a first pass. I think we'll need to return to analyze in more depth |
Thanks Nigel, did you test it with "Working with Standard Models" notebook? Which repository was configured? I ran some quick tests this evening with the prerelease chart and had failures with all three repositories even though the memory didn't seem to exceed 1400 MB. I will try to test a bit more carefully tomorrow. |
Tested std models lab using inmem repo On a 3x16GB openshift cluster the test ran successully with no pod restarts For rancher desktop on mac, a 12GB cluster would not allow all pods to be scheduled due to insufficient memory. This demonstrates a challenge in setting values
Some reduction is likely possible with closer monitoring using metrics On rancher, after a test run the memory used was: So |
with the graph repository, this notebook took 16 minutes to run (instead of <2 minutes - rancher desktop). No failure, but rather unusable. It did complete with no pod restarts, and similar amount of memory used (but need jvm stats to properly analyse) |
Resource limits & jvm options can currently be set by creating a file ie ~/etc/rs.yaml containing
This entry is for 'core'. For our other platforms add similar sections for 'datalake', 'dev', 'factory', and also for our ui components 'ui', 'uistatic', 'presentation' Optionally settings can be added for our other containers 'nginx', 'zookeeper', 'kafka' with the same resources section ie:
etc Once we have understand the behaviour, and established sensible defaults, then we can consider how to make this simpler for the user. Another important part here is the 'jvmopts' which applies to all of our containers One useful value is to set it to Here's an example just to enable metrics (but no changes to default resources)
If this proves useful, we can make this a default setting - perhaps even in the egeria server chassis itself, or else just the container or charts See odpi/egeria-docs#718 where I've added some examples of using this URL, and through which we will provide further docs on this |
Signed-off-by: Nigel Jones <[email protected]>
See #245 which adds some example snippets to help define resources for the containers used in lan chart -- I've been looking at the memory profile when importing CIM using the YourKit(tm) analyser The default when run locally is to use G1 - this shows fairly smooth behaviour with minimal (<1ms) GC delays. It does allow memory to get close to the heap size before kicking in, leading to some sawtooth behaviour, but it's consistent. The default in our container is to use ParallelGC. This shows different behaviour - it cleans garbage much more quickly, and keeps the memory footprint smaller (as low as 250MB in a 1GB jvm), but we see large numbers of GC cycles. Additionally GCs block threads for an extended period, with frequent delays approaching 100s This was replicated locally using
Some differences from container -- it doesn't set -Xmx, but instead sets a fraction of ram. This makes sense in the resource limited container where we set a 1GB or 2GB limit, but on a 32GB machine locally we need te be more precise. Of these parameters the option to exit the vm on OOM makes sense, as this is in a k8s environment where we would want the pod to restart, rather than malfunction. These settings are inherited from the red hat ubi9 openjdk containe we use - they are defaults. However they seem inconsistent with our workload |
Further observation. As suspected, with different platforms returning varying values of physical memory, for now ie
for a container with a request/limit of 1GB However even with this set, the datalake container is killed by the OOS with OOM ie the last trace is
Note
So what is java doing? At least in terms of the heap & gc:
This shows the heap at the time of failure was only 410MB - which is well under the 1GB system limit. This means we need to look at other memory consumption within the pod - both in terms of other java data as well as any other processes. From local tests, the other java usage should be small (which is why 80% for heap is a good starting point) |
Restricting the metaspace to 100MB resulted in
200MB & egeria runs fine, but even with the resource limit set at ~1.2GB (heap limit about 800, metaspace 800), we still exceed permitted memory usage & be terminated. Even if we don't enable This might suggest memory is allocated from the native process heap. Kafka has a buffer.memory but this defaults to 32MB |
Retesting locally, and this same Java process was reporting 3390MB at an OS level:
|
This was taken with heap limited to 1GB, but no other limits/jvm customization. -Xmx:1024m |
Kafka produces will use up to buffer.size (ie 32MB) to store messages client side if they cannot be sent to the message broker quick enough, the client API will block for up to max.block.ms (60s) before returning a client API failure this buffer size will apply to each kafka producer. OMRS Instances will have the highest workload, and for coco we have 3 cohorts (so ~100MB ram) Worth trying
|
Tested again, 4GB resource, 1GB heap. Core & datalake still approach real memory Core - maxes at 2.7GB |
I’m going to persue:
Need to understand memory usage in more detail. There is a lot of potential in terms of file system buffers, even outside the heap & metaspace. For now the prudent approach is to remove memory limits & jvm configuration -- the option to add them remains, and snippets are provided in config/values. This at least should mean most of the scenarios work most of the time on a decent configuration (the goal was to be more precise than this, but we don't have all the information to do this) @dwolfson New charts are published. Would you like to try with the defaults? |
Is there an existing issue for this?
Current Behavior
While processing 500+ elements (KafkaTopics from StrimziMonitorIntegrationConnector) the Egeria Metadata Server crashes silently. Last few lines of log:
Expected Behavior
Should have processed every KafkaTopic element.
Steps To Reproduce
Environment
Any Further Information?
No response
The text was updated successfully, but these errors were encountered: