-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Continually Increasing Intermittently #267
Comments
Hey @ryorke1, I would appreciate it if you could share Dockerfile + compose yaml for the worker with cpu/memory capped so that we could use them to reproduce the issue. |
Hi @c4lm here is the dockerfile:
There is no compose file as we are running this in OpenShift. Once this image is built we push it into our image repository and deploy it using a generic helm deployment manifest. As for the URL, there is no proxy. In this case the test code was running it on localhost specifically on port 3500. I forgot to add the Here is the manifest we are using for this testing in OpenShift:
We have also tried to reproduce this locally in docker and it takes a bit of restarting and leaving it running but eventually it will have this issue. It's not consistent like it is on OpenShift. Here is a snippet of the logs received from the example script above:
With regards to the CPU utilization, we have given the pod enough resources based on the metrics from the pod such that the CPU limit shouldn't be an issue as shown in this graph. Right now we are allowing the pod to utilize 1/5 of a CPU but the pod never exceeds this limit. However, that being said, I increased the pod to 4 CPUs and ran it for a short period of time and still see the same results.
With regards to different version of the conductor-python-sdk, we have tried going back to 1.1.4 but still see the same issues. For the Python version itself, I had not yet tried downgrading from 3.11 to 3.10. I just tried that and no change. I also tried using 3.12 and have the same results. Please let me know if there is anything else I can provide to assist! |
@ryorke1 Using my compose file, memory usage consistently stays under 97Mb (cpu under 10%) according to Everything is complicated by CPython having version-specific leak issues and platform-specific leak issues, and then the libraries having them too (e.g. this one, and it's not the only one). Regarding the libraries we use here - metrics are disabled, so it can't be prometheus-client, connection is http, so it can't be SSL-related leaks.
Usually whenever I encounter leaks if http requests are involved, it's 1 or 3 (regardless of the language involved), so for now we could stick to that assumption. Rest client is per process, session is per rest client and we do not close it because it is reused, so we should not be leaking there...
Additionally, it would be great if you could try running the script with memray and capture the results. What I was using in my docker-based attempts:
|
Good afternoon. Just want to follow up to let you know I am still doing some testing. So far I have tried with just a generic Session() object and haven't had any memory issues (it stayed around 400MB for a day). I also tried with a generic HTTPAdapter() and it also stayed flat. I am now trying with the Retry() object from urllib3 to see what happens. I am starting with the same setup as the codebase does to see if it reproduced the same issue and then will try playing with these settings to see if there is one setting that's causing this issue. Just wanted to post to keep the thread alive. Hopefully will have some more details very soon. |
Good day @c4lm. We tried you suggested along with a few other modifications. Here are the results:
We also changed the worker to a single worker (removed the other 9) and oddly when we ran memray it produced 14 files instead of 3 (which I would assume it should have created). The images attached are the graphs from each and the states |
Hi @rydevops do you have some time early next week for a working session on this? I would like to spend some time to understand the behavior you are seeing, your setup and I would like to replicate the same setup to reproduce. |
Hi @rydevops ping! |
Hello everyone,
My team seems to have stumbled upon a really weird issue with the conductor python sdk where the memory of the worker continues to climb until the worker runs out of resources (in our case our OpenShift cluster OOMKills the pod due to the resource utilization exceeding it's allocated quota). What is odd is the timing of the memory increasing is not consistent. Sometimes it will start as soon as the worker starts while other times it will start 30 to 45 minutes after the application starts. Additionally, once in a while it seems to not have this issue. I have tried tracing through the polling functionality within the SDK and cannot seem to find anything that stands out as the cause of the memory. Wondering if anyone else is experiencing this issue and what is the root cause?
This graph shows the example code below running and being restarts. When you see a long flat line that is when we turned it off and tried looking at code. Considering this workflow does absolutely nothing (execute doesn't even get called) I wouldn't expect to see a bunch of spikes in memory and this should just be flat for the most part.
Steps to reproduce:
Sample Code:
`
from conductor.client.automator.task_handler import TaskHandler
from conductor.client.configuration.configuration import Configuration
from conductor.client.http.models import Task, TaskResult
from conductor.client.worker.worker_interface import WorkerInterface
configuration = Configuration(
server_api_url=f"http://localhost:3500/", debug=True
)
class SillyWorker(WorkerInterface):
def execute(self, task: Task) -> TaskResult | None:
return None
workers = [
SillyWorker(task_definition_name="silly-worker-1"),
SillyWorker(task_definition_name="silly-worker-2"),
SillyWorker(task_definition_name="silly-worker-3"),
SillyWorker(task_definition_name="silly-worker-4"),
SillyWorker(task_definition_name="silly-worker-5"),
SillyWorker(task_definition_name="silly-worker-6"),
SillyWorker(task_definition_name="silly-worker-7"),
SillyWorker(task_definition_name="silly-worker-8"),
SillyWorker(task_definition_name="silly-worker-9"),
SillyWorker(task_definition_name="silly-worker-10"),
SillyWorker(task_definition_name="silly-worker-11"),
SillyWorker(task_definition_name="silly-worker-12"),
]
def start_workers():
if name == "main":
start_workers()
`
Environment:
The text was updated successfully, but these errors were encountered: