Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Potential) Memory leak after exception handling and pipeline restarts #535

Open
clumsy9 opened this issue Mar 4, 2024 · 2 comments
Open
Labels
bug Something isn't working

Comments

@clumsy9
Copy link
Collaborator

clumsy9 commented Mar 4, 2024

During testing, we observed a sudden increase of memory consumption by almost all of our logprep instances:

opensearch-ssl

It turned out that we had configured a wrong TLS certificate for the Opensearch cluster, so that the OpensearchOutputConnector instances could not establish connections. This lead to various FatalOutputErorr exceptions (and following pipeline restarts):

2024-02-26 09:29:06,360 Logprep Pipeline 1 ERROR   : FatalOutputError in OpensearchOutput (opensearch) - Opensearch Output: ['os-cluster.opensearch-prod']: ConnectionError([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)) caused by: SSLError([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006))

In our core system, we made a similar observation: The memory consumption of our logprep instances started to increase. Although the increasing demand was not as strong as observed in the test-system, some of the pods ran out of memory. We made this observation twice at two different points in time.

opensearch-urllib-mysql

opensearch-urllib-mysql

A short review revealed that we had some network/DNS issues at both occaisons . Our pipelines could not reach our Opensearch cluster, which lead to a lot of FatalOutputError exceptions and pipeline restarts:

2024-02-25 20:05:44,131 opensearch WARNING : GET https://prod-os-cluster.opensearch:9200/ [status:N/A request:10.008s]
Traceback (most recent call last):
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/util/connection.py", line 72, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/socket.py", line 962, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno -3] Temporary failure in name resolution

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.pex/installed_wheels/e55d5dac054d07afab930a0d5f3de8475381721e9eca3728fbdda611fa0ed070/opensearch_py-2.4.2-py2.py3-none-any.whl/opensearchpy/connection/http_urllib3.py", line 264, in perform_request
    response = self.pool.urlopen(
               ^^^^^^^^^^^^^^^^^^
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connectionpool.py", line 799, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/util/retry.py", line 525, in increment
    raise six.reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connectionpool.py", line 404, in _make_request
    self._validate_conn(conn)
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connectionpool.py", line 1058, in _validate_conn
    conn.connect()
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connection.py", line 363, in connect
    self.sock = conn = self._new_conn()
                       ^^^^^^^^^^^^^^^^
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f246a56ddd0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution
2024-02-25 20:05:44,134 Logprep Pipeline 14 ERROR   : FatalOutputError in OpensearchOutput (opensearch) - Opensearch Output: ['os-cluster.opensearch']: ConnectionError(<urllib3.connection.HTTPSConnection object at 0x7f246a56ddd0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution) caused by: NewConnectionError(<urllib3.connection.HTTPSConnection object at 0x7f246a56ddd0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution)

It seems that when any of the above exceptions occur, and pipelines need to be re-started, something is not completely freed, which leads to the increasing amount of memory used. But it is not completely clear what is causing the memory issues here.

Expected behavior
Occurence of the above exceptions and/or pipeline re-starts should not cause logprep to consume more memory.

Environment

Logprep version: 2b16c19
Python version: 3.11

@clumsy9 clumsy9 added the bug Something isn't working label Mar 4, 2024
@clumsy9 clumsy9 assigned ppcad and unassigned ppcad Mar 4, 2024
@ekneg54
Copy link
Collaborator

ekneg54 commented Mar 4, 2024

Thank you for this report.
In my Opinion we should consider to redesign the failure reaction of logprep.

Restarting processes in failure cases should not be the responsibility of an aplication.

Instead we should exit in all critical Failure cases with the right exit code.

Restarting the aplication should be the task of an init System like Systemd or of the container runtime

@ekneg54
Copy link
Collaborator

ekneg54 commented Mar 8, 2024

possibly the log queue is not closed. this is fixed in logprep 10.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants