Make forwardmodelrunner async #9198

jonathan-eq · 2024-11-12T15:41:24Z

Issue
Resolves #9041

Approach
We will get a lot of those errors from the forward model runner (compute cluster) when we forcefully signal kill, and dont let it shut down gracefully (a lot of those errors are due to websocket connection being suddenly dropped, and not closed). Making it async would allow us to early return from the runner's run_method by calling Task.cancel on it, and we could do some cleanup on asyncio.CancellationError.

(Screenshot of new behavior in GUI if applicable)

PR title captures the intent of the changes, and is fitting for release notes.
Added appropriate release note label
Commit history is consistent and clean, in line with the contribution guidelines.
Make sure unit tests pass locally after every commit (git rebase -i main --exec 'pytest tests/ert/unit_tests -n logical -m "not integration_test"')

When applicable

When there are user facing changes: Updated documentation
New behavior or changes to existing untested code: Ensured that unit tests are added (See Ground Rules).
Large PR: Prepare changes in small commits for more convenient review
Bug fix: Add regression test for the bug
Bug fix: Create Backport PR to latest release

xjules · 2024-11-25T08:34:24Z

src/_ert/forward_model_runner/cli.py

@@ -71,26 +79,26 @@ def _setup_logging(directory: str = "logs"):
 JOBS_JSON_RETRY_TIME = 30


-def _wait_for_retry():
-    time.sleep(JOBS_JSON_RETRY_TIME)
+async def _wait_for_retry():


Wondering if we need this helper function at all?

Yes, we need it for one of the tests. test_job_dispatch.py::test_retry_of_jobs_json_file_read

Hm, this usage of that function is a bit strange though.

We mock it to lock.acquire in a test, so that it will stop here

src/_ert/forward_model_runner/cli.py

xjules · 2024-11-25T08:42:09Z

src/_ert/forward_model_runner/cli.py

+    message_queue: asyncio.Queue[Message],
+    done: asyncio.Event,
+):
+    while not done.is_set() or not message_queue.empty():


not message_queue.empty() looks dangerous. I think that this get handled in here:
job_status = await asyncio.wait_for(message_queue.get(), timeout=2) or ?

I wanted it to process all the events in the queue before exiting, but I can rewrite it to be clearer ✏️

This should help the forward model runner shutting down more gracefully, and removing some of the errors we are seeing in the logs.

xjules · 2024-11-26T11:55:27Z

src/_ert/forward_model_runner/cli.py

+        nonlocal reporters, forward_model_runner_task
+        forward_model_runner_task.cancel()
+        for reporter in reporters:
+            reporter.cancel()


try: await reporter except asyncio.CancelledError: pass

or maybe just asyncio.gather(*reporters, return ....)

The signal handler has to be synced, but we await the task anyways so it should be fine.

To shutdown gracefully, this is what chatgpt suggests:

def setup_signal_handlers(loop): """ Setup signal handlers for graceful shutdown. """ for sig in (signal.SIGINT, signal.SIGTERM): loop.add_signal_handler(sig, lambda: asyncio.create_task(shutdown(loop, signal=sig)))

wherein shutdown is an async function.

xjules · 2024-11-26T12:03:23Z

src/_ert/forward_model_runner/reporting/event.py

+        await self._dump_event(fm_checksum)
+
+    def cancel(self) -> None:
+        self._event_publishing_task.cancel()


If this suppose to be a "blocking" operation then we should await to task to finish afterwards; ie.

self._event_publishing_task.cancel() try: await self._event_publishing_task except asyncio.CancelledError: ...

This also has to be synced as it is used in the eventloop's signal_handler

right, I need to look deeper in the loop signal_handler 👍

src/_ert/forward_model_runner/reporting/event.py

jonathan-eq · 2024-11-26T12:33:55Z

src/_ert/forward_model_runner/reporting/event.py

            url=self._evaluator_url,
            token=self._token,
            cert=self._cert,
        ) as client:
            event = None
-            while True:


It was combined with the timeout_timestamp

…eout for simulation -> 60s

jonathan-eq force-pushed the main4 branch from 710af8e to f1b9b5f Compare November 13, 2024 06:57

jonathan-eq changed the title ~~Add just command helper tool to repository~~ Make forwardmodelrunner async Nov 13, 2024

jonathan-eq force-pushed the main4 branch 4 times, most recently from 098b7d4 to 784237d Compare November 19, 2024 08:32

jonathan-eq force-pushed the main4 branch 2 times, most recently from 3be3a96 to 4de22d3 Compare November 21, 2024 12:42

jonathan-eq marked this pull request as ready for review November 25, 2024 08:05

jonathan-eq self-assigned this Nov 25, 2024

jonathan-eq added release-notes:improvement Automatically categorise as improvement in release notes release-notes:bug-fix Automatically categorise as bug fix in release notes labels Nov 25, 2024

xjules reviewed Nov 25, 2024

View reviewed changes

src/_ert/forward_model_runner/cli.py Outdated Show resolved Hide resolved

xjules reviewed Nov 25, 2024

View reviewed changes

jonathan-eq added 15 commits November 26, 2024 12:35

Refactor forwardmodelrunner to be async

df42399

This should help the forward model runner shutting down more gracefully, and removing some of the errors we are seeing in the logs.

fix tests

e299ba6

fix more tests

f609f5e

fix even more tests

84e389a

Fix failed realization not being marked as failed

2b8c74a

Add ForwardModelRunnerExceptionclass

d2af117

Add more asyncio.sleep()

b4fb94a

Change pattern from generator to asyncio.Queue

e1c15df

Fix tests and remove statemachine

129642d

Fix asyncio.TimeoutError in event.py

5d3cac8

cleanup

fbd1c48

Fix no-such-process error

53f999a

further cleanup

aa5f66e

Fix process_tree test

a2a0b86

Fix tests

d01b408

jonathan-eq added 4 commits November 26, 2024 12:35

Have termination not raise asyncio.CancellationError for every real

22e625a

code review suggestions

c013b93

more cleanup

b797a75

Undo changes asyncio.Event->bool

6b66226

jonathan-eq force-pushed the main4 branch from 1c4959b to 6b66226 Compare November 26, 2024 11:46

jonathan-eq added 2 commits November 26, 2024 12:50

Remove import quotations of AsyncGenerator

408f9d9

Remove rogue prints

157d424

xjules reviewed Nov 26, 2024

View reviewed changes

src/_ert/forward_model_runner/reporting/event.py Show resolved Hide resolved

jonathan-eq commented Nov 26, 2024

View reviewed changes

jonathan-eq added 2 commits November 27, 2024 09:24

Bump test_custom_weights_stored_and_retrieved_from_metadata_esmda tim…

443584c

…eout for simulation -> 60s

Bump test_field_param_update_using_heat_equation timeout -> 600s

1c46165

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make forwardmodelrunner async #9198

Make forwardmodelrunner async #9198

jonathan-eq commented Nov 12, 2024 •

edited

Loading

xjules Nov 25, 2024 •

edited

Loading

jonathan-eq Nov 26, 2024

xjules Nov 26, 2024

jonathan-eq Nov 26, 2024

xjules Nov 25, 2024

jonathan-eq Nov 26, 2024

xjules Nov 26, 2024 •

edited

Loading

jonathan-eq Nov 26, 2024

xjules Nov 26, 2024 •

edited

Loading

xjules Nov 26, 2024

jonathan-eq Nov 26, 2024

xjules Nov 26, 2024

jonathan-eq Nov 26, 2024

Make forwardmodelrunner async #9198

Are you sure you want to change the base?

Make forwardmodelrunner async #9198

Conversation

jonathan-eq commented Nov 12, 2024 • edited Loading

When applicable

xjules Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xjules Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xjules Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonathan-eq commented Nov 12, 2024 •

edited

Loading

xjules Nov 25, 2024 •

edited

Loading

xjules Nov 26, 2024 •

edited

Loading

xjules Nov 26, 2024 •

edited

Loading