Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for cancellation of potentially long-running operations #86

Open
aryanjassal opened this issue Dec 13, 2024 · 5 comments
Open
Labels
development Standard development

Comments

@aryanjassal
Copy link
Member

Specification

Currently, there is no way to cancel or abort an operation after it has been started. This is especially an issue for commands which can run for an arbitrarily long time. For example, the efs.readFile() can take extremely long to complete depending on the file.

If this operation is being used in, say, a RPC call, and the RPC call times out, then the operation will continue to run in the background. This will keep a connection and a file descriptor open in the background and not do anything with it, resulting in leaked resources and blocking the entire application from shutting down.

The easiest way to achieve this is to accept either a context object from js-contexts, or just accept an AbortController.signal.

Additional context

Tasks

  1. Implement support for abortion or cancellation
  2. Add tests to confirm this behaviour
@aryanjassal aryanjassal added the development Standard development label Dec 13, 2024
Copy link

linear bot commented Dec 13, 2024

Copy link
Member Author

This is pretty important to Polykey#846 as a lot of operations rely on the efs. From cloning vaults to reading secrets. Some secrets can be like PDFs, which can be extremely large. As such, ability to cancel them is critical.

I won't say that this is blocking to the PR, but this would need to be addressed to properly implement cancellation and avoid open file descriptors blocking connections.

@CMCDragonkai
Copy link
Member

Are you sure that EFS is what is taking the longest time? It's just a DB read of several blocks. And yes, if it is a large file it could take some time, but we are unlikely to have large files.

Furthermore, this may cause some delay in in other parts of the application, but I do not think EFS is likely cause of the problem.

Of course propagating the signal downwards is a good idea in general for any long running async operation, but it shouldn't be done without some comprehensive tests.

Copy link
Member Author

I remember a few months ago we were trying to upload a fairly large PDF in the vault, but it timed out. What would happen in this timeout? Won't a timed out RPC leave the operation in kind of a danging state and cause resource leaks? Or is that handled some other way in the handler?

Well, perhaps if i implement abortion after performing an efs operation, that might work on stopping the handler properly. Of course, when we add a context or a signal, then a comprehensive testing suite will be added to ensure this behaviour works as expected and doesn't break anything else in the process.

@CMCDragonkai
Copy link
Member

Well that's why I was talking about timer for long running operations is reset upon successful response.

So the problem with uploads is that client streaming which seems like the right RPC call, the response may be waiting for the client to finish.

So I believe we consider it to be a per-message timeout. The timeout should reset whether a message is sent or received.

This makes it a universal standard behaviour whether it is unary, client streaming, server streaming, duplex streaming.

This means when you do a client.methods.x({ timeout: ... }), that timeout has a different meaning than regular function calls.

However on the other hand, if we want to preserve the original ctx meaning of that "this operation" is completed. Then the issue is that for many streaming contexts, it has to be infinite, and instead you might swap for an idle timeout.

The reason for this is that the timeout is usually passed down to downstream operations on the handler side. So if the meaning changes from idle timeout to operational timeout, that can get very confusing.

However it's also important that this is different from transport level idle timeout.

The exact design isn't settled right now and I think we need to actually conclude on the proper design here: https://github.com/MatrixAI/MatrixAI-Graph/issues/43

Use chatgpt to help design a solution that makes sense, make suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development
Development

No branches or pull requests

2 participants