-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve WQ response stream format #707
Comments
will reopen the issue soon. |
@pooja1pathak @NEC-Vishal hi guys, this is one of the most cryptical TODOs in our code, my fault, I was the one who wrote it and I don't think it can possibly make sense to anyone else than me. Sorry about that! But I'm going to explain what I had in mind here so you guys can decide what to do next. So here goes. Stream processing techniques in Work QueueFirst off, a little digression about Work Queue performance. We designed WQ hoping to get an implementation with decent performance---well, for Python. Our approach was to put streams at the core of the design to be able to then perform most computations in linear time and constant space. This is key to get decent performance, b/c WQ could potentially handle very large datasets. So sucking any such a beast into RAM or iterating over it multiple times isn't an option. (I know we do exactly that all over the show in the rest of the code base, but eventually we should rewrite the rest of the code to adopt the same stream-based approach.) In particular queries stream data directly from the DB to the client in constant space and time. As the DB streams each row in the result set, WQ processes it, converts it to JSON and writes that JSON right away as a text line to the HTTP response stream as shown below---r[n] is a DB row, j[n] is the corresponding JSON object.
The JSON response is actually an array where each entry contains the JSON corresponding to a DB row. But how can we write a JSON array in the HTTP response body without holding all its entries in memory? One entry at a time? Sorry for the silly humour, but that's actually close to what happens. We start writing the response with a line containing the JSON array start delimiter JSON array streaming algoLet's flesh out the actual algorithm to convert rows to JSON and stream them to the client as a JSON array. I'm going to use Haskell as a spec language b/c it is very close to maths notation and has the advantage you can actually run the code---yes, we're going to put together an executable spec! Conceptually we'll think of a stream as a sequence of items---sequence as in maths, i.e. a function from the natural numbers to the set of items to stream. This maps cleanly to the Haskell list algebraic data type---think potentially infinite array or Python "iterable", etc. The items we want to stream are DB rows and let's call Now we're ready to give an exact definition of the JSON array streaming process we sketched out earlier. The key ingredient is the function arr rs = "[" : go rs
where
go [] = [ "]" ]
go [r] = [ json r, "]" ]
go (r:rest) = (json r ++ ",") : go rest With that under our belt, all we need is a write [] = send ""
write (line:rest) = do
send line
write rest
> write (arr [1, 2, 3])
[
1,
2,
3
] Notice that b/c Haskell evaluates expressions lazily, the argument to write (arr [1, 2, 3]) =
write ("[" : go [1, 2, 3]) =
send "["
write (go [1, 2, 3]) =
write ((json 1 ++ ",") : go [2, 3]) =
write ("1," : go [2, 3]) =
send "1,"
write (go [2, 3]) = ... So if the example list Python implementationNow you'd think you'd have an easy time translating the above executable spec into Python. In fact, Python 3 comes with iterators and generators that were pretty much modelled after Haskell lists. But I couldn't come up with a quick way of doing that, so I gave up and hacked together something similar which is still based on iterators & generators but it outputs an extra JSON
when given an iterable containing |
@pooja1pathak @NEC-Vishal, the TODO in the code asks if there's any easy way to get rid of that extra JSON Also another thing we can look into is streamlining the JSON response format, maybe adopt one of the formats Wikipedia mentions |
https://github.com/orchestracities/ngsi-timeseries-api/blob/master/src/wq/ql/flaskutils.py#L16
The text was updated successfully, but these errors were encountered: