Streaming and op progress from a server's perspective #215

iboB · 2024-12-11T08:37:45Z

iboB
Dec 11, 2024
Maintainer

Currently op progress is provided via a callback¹. Streaming is not facilitated in any way. Streaming is currently envisioned to be a "pull": calling multiple ops.

Both of these don't play well with servers and the current progress in Acord. Especially the streaming.

Streaming

While "pull"-type streaming is ok for an edge app, it's bad idea for a server. If we propagate the pull to clients, every streamed item would have to add a full ping to its latency. If we hide it from clients, that means we would have a create an entirely different client-server API². In the context of schemas and interfaces, this is prohibitively expensive.

Streaming is easily designed in an asynchronous API. This was initial version of our Inference API, but we discarded it because it would mean hiding the parallelism from the implementers, and they may come with vastly different needs for this. For now we're keeping the synchronous API as a hard requirement.

How do we deal with streaming then?

(more on this below)

Progress

Op progress is more or less a stream of floating point values. Whatever decisions we make for streaming op results, would likely be applicable to progress as well. Let's keep that in mind when we discuss result streaming. Having a unified solution would be best.

Ideas

... or rather, notes that are not yet complete ideas

Pull is not a problem in and of its own. The problem is that a server would need to be able to identify that an op can produce a stream as opposed to a one-shot result and based on this, change the way it serves the result to clients.
Push would mean having a callback for results. Very unpleasant in terms of language wrappers. Especially C, which doesn't have lambdas and state must be captured via void pointers.
Push is easiest for loader/instance implementations though, as it would just require them to propagate a function and that's it.
Pull would require either some coroutine-like interface on their part, or for them to run a thread and aggregate the result to be consumed by the caller thread.
... thus in push result aggregation is always the responsibility of the caller, whereas in pull it's fuzzy. In pull we may end up with two levels of aggregation: lib and caller. Now, this is not necessarily bad: it may waste perf, but gives more freedom to implementations.
Knowing when to expect a stream can be integrated into the schema alone, but it would require following hard to enforce conventions. This does not seem like a good idea for a implement-by-convention feature. We should have lang-api-level methods to deal with it.
In an previous version of the API we did have a special streaming func for op results. Something like getOpStreamResult.
It would be nice if there is a single way to call ops. To do this for one-shot and stream, we would have to separate the setup from the result.
Some streams have a known end, some don't. How do we indicate this in stream packets?
Aborting a stream should be possible. Lib-side aggregation in pull makes the abort fuzzy as well. We would have no way of knowing when the abort actually happened.

...and our word that it's in the same call stack, but this doesn't really matter. We can remove the requirement with practically no repercussions as long as the calls are not concurrent and the call is synchronous. ↩
Entirely different from the current Inference API ↩

iboB · 2024-12-11T14:45:18Z

iboB
Dec 11, 2024
Maintainer Author

We had some discussions. We formalized potential async op outputs as channels, but in the midst of all this I found out that my initial reasoning was wrong.

My thinking was that we would need bespoke streaming ops for some one-shot ops (as in get_text/get_token).

But there is another way! We could design streaming as stream(get_token). In this case stream is server API to create a stream out of an inf-api op.

This means we can propagate the streaming calls entirely to clients. They decide what (and whether) to stream.

I think that we can completely ditch op progress too. It's too vague as it is and not applicable in most cases. We should rely on lower level APIs which can be wrapped in higher level ones independently. Progress can be implemented in terms of stream and the domain-specific knowledge it contains (say token 1 of 10, or intermediate image 3/5)

By that I mean that defining get_text in terms of stream(get_token) can be a separate (optional) library which works with a server and an LLM interface.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming and op progress from a server's perspective #215

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Streaming and op progress from a server's perspective #215

iboB Dec 11, 2024 Maintainer

Streaming

Progress

Ideas

Footnotes

Replies: 1 comment

iboB Dec 11, 2024 Maintainer Author

iboB
Dec 11, 2024
Maintainer

iboB
Dec 11, 2024
Maintainer Author