-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eager workflow: use event loop instead of asyncio.run #2737
Conversation
Signed-off-by: Niels Bantilan <[email protected]>
flytekit/bin/entrypoint.py
Outdated
@@ -107,7 +107,7 @@ def _dispatch_execute( | |||
if inspect.iscoroutine(outputs): | |||
# Handle eager-mode (async) tasks | |||
logger.info("Output is a coroutine") | |||
outputs = asyncio.run(outputs) | |||
outputs = asyncio.get_event_loop().run_until_complete(outputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a good way to add a unit test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think get_running_loop
is preferred over get_event_loop
, can we try with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment here for more information, but it's the more thread-safe variant... doesn't really come into play here, but i think better practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(You shared a serverless execution. How can I view it?)
Based on the exception in the description, can you explain why the AsyncUnionMetaFS filesystem is failing? More specifically, Why aren't we seeing the exception being bubbled up from there? I'm assuming that there's something weird going on with these two lines in the implementation of the unionmeta fs, but I'm not sure what yet.
Is it easy to repro this? If so, can we try using a Runner in the entrypoint instead?
As a minor issue, we have other uses of asyncio.run
in the codebase. Does it make sense to go over each one and ensure that its use is appropriate? This can be done in a separate PR / investigation.
Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>
@eapolinario I added a unit test on this PR, the edit: seems like there are flaky unit test failures
Runners are only supported in Python 3.11+ |
Signed-off-by: Niels Bantilan <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2737 +/- ##
==========================================
- Coverage 45.42% 45.40% -0.03%
==========================================
Files 193 194 +1
Lines 19645 19685 +40
Branches 2844 2854 +10
==========================================
+ Hits 8924 8937 +13
- Misses 10274 10301 +27
Partials 447 447 ☔ View full report in Codecov by Sentry. |
eaff0e5
to
26cec32
Compare
@eapolinario I can confirm that the error message
only crops up in serverless (the eager workflow succeeds in flyte sandbox). Going to investigate a solution in the |
Okay merging this PR if we need to to unblock something on serverless, but yeah the solution should be investigated on the union side. I'm still trying to work through some of the async loop semantics, but I think the unit test as it's written is kinda doing the opposite of what we want. I don't fully understand async loop lifecycles yet and how to get them to not conflict, but I feel like in
|
I think this might be relevant: grpc/grpc#32480 (comment) Maybe the original issue was that there was no event loop, so we grabbed it from the loop in the grpc.aio channel? The OP in the issue suggests it might work if we delay instantiation of the channel until we're already in an async context? Like, don't call I think the reason this hasn't come up in regular flytekit is because we just use grpc there, not grpc.aio. I think? |
Why is the unit test you're adding failing in CI? |
No idea why it's flaky... failing tests will pass when I retry it enough times. |
Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>
@eapolinario I changed the unit testing strategy: basically I tried to recapitulate what the UnionFS classes were doing prior to @thomasjpfan's PR: https://github.com/unionai/unionai/pull/393. Just testing that the event loop set in the test is still available (and is equal to) the event loop after the dispatch call. |
loop = asyncio.new_event_loop() | ||
asyncio.set_event_loop(loop) | ||
asyncio.events.set_event_loop(loop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is setting the event loop safe to do in a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure, but perplexity seems to think so: https://www.perplexity.ai/search/is-asyncio-set-event-loop-safe-MXeyb8O3QPyddzGacsrRqA
A relevant reference is: pytest-dev/pytest-asyncio#80
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even better, the pytest-asyncio library has an event_loop
fixture 👇
Signed-off-by: Niels Bantilan <[email protected]>
) as ctx: | ||
_dispatch_execute(ctx, lambda: eager_wf, "inputs path", "outputs prefix") | ||
loop_after_execute = asyncio.get_event_loop_policy().get_event_loop() | ||
assert event_loop == loop_after_execute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming the fix goes in on the union side, i still feel like this assert should fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
posted some thoughts on the other pr, which basically wraps the whole _dispatch_execute call in an event loop so one is there if necessary.
do we have to consider this? https://github.com/fsspec/filesystem_spec/blob/76ca4a68885d572880ac6800f079738df562f02c/fsspec/asyn.py#L128 the windows specific policy. they're planning on deprecating policies in favor of runners right? but for older versions of python will we need to do something similar? |
Can we merge this? @wild-endeavor @eapolinario @thomasjpfan |
i was planning on merging on monday if that's okay? was going to test it some more with the union lib just to make sure if that's alright. |
still testing this. give me until tomorrow - running into other issues and having trouble repro'ing this original issue in serverless. |
Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>
Why are the changes needed?
The
entrypoint._dispatch_execute
function usedasyncio.run
to run the coroutine. The problem with this is that it closes the event loop, which cause the following error related to task data persistence:Execution on Union serverless
What changes were proposed in this pull request?
Use
get_event_loop().run_until_complete(<coroutine>)
instead so that the event loop isn't closed.How was this patch tested?
Tested on serverless:
Run with: