-
-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parser support #49
base: master
Are you sure you want to change the base?
Conversation
The recommendation in the Cython docs is to distribute the
|
This pull request seems to have a couple of different changes in, and I'm not sure the tests fully cover the changes. I think it'd be useful to break this apart, and test it a bit more thoroughly. For instance, a good first piece might be the code that transforms jv values into Python values. This could be tested by constructing the jv values in a test for each value type (boolean, integer, float, empty array, simple array, nested array, etc.), and then just running the unpack code to make sure the value that is produced is correct. |
Having complete test coverage also means you can run the tests with valgrind to check for memory leaks. |
I broke the changes into three separate commits, not sure if you noticed that (some reviewers don't look at those). If that's not small enough, what other parts would you like me to break out? Sure, I can add more tests. What other tests except ones for FWIW, I did a few crude runs of this through Valgrind, and caught and fixed a couple issues already, but we can always run more :) BTW, how do you deal with the huge amount of "invalid access" Valgrind messages that come from Python? Do you have a suppression file you use? |
That makes sense, but is unfortunate, as installing from a Git repo could really be useful to start using this sooner. Or do you push to pypi often, perhaps? Sorry, I'm not familiar with that process. |
I'd prefer to deal with each change in isolation e.g. having one pull request just for the unpacking. For me, it helps with cognitive load, especially when dealing with C (well, Cython).
I was (perhaps foolishly!) only checking for leaks, so I just checking that I got a zero back at the end for definitely and indirectly lost. |
I normally update PyPI once the changes are ready. |
I see. Whatever works for you :) I come from open-source projects where a PR/MR/patchset could be dealing with a feature which needs multiple changes, but those changes have to be broken into logically-independent (and separately working and documented) commits. I.e. change unit is commit, not PR, and PR is a discussion unit. That helps preserve larger context across changes, I think. But then neither GitLab nor GitHub support CI for separate commits in an MR/PR. Regardless, I'll do a separate PR with the
Yeah, that's what I had to resort to, as well. Perhaps I'll generate a suppression file for the merged version and run with that next time. |
BTW, @mwilliamson, do you have an idea for how to implement those |
I was imagining something like: @istest
def jv_true_is_converted_to_python_true():
jv_value = jv_true()
result = jv_to_python(jv_value)
assert_equal(True, result) In other words, building up a jv value using the jv functions (which will need to be added to the Cython file), and then passing that jv value into the function that converts it to Python. I think the jv value will have to be wrapped in a class to allow it to be passed around, meaning there'll need to be a shim around (I'd suggest |
I think exposing functions to build jv values would be mostly useless to library users. To start with, we won't even have a use for any jv values at all in Python, because nothing in the library consumes them. I think an exposed I could implement and test those, along with the wrapper class (which I started writing earlier already). However, this is gonna take a while and kinda goes contrary to getting smaller changes merged sooner and iterating faster. The What do you say? |
Ah, scratch that, of course. Still, how about testing it by feeding the parser for the start? |
I don't have a problem with having some functions that are only used in tests, seeing as they should be simple functions that just wrap the functionality that's already implemented in jq. Having said that, if we want to avoid directly testing the conversion function, we could use the identity program with the different input and outputs. |
Alright :)
I'd love to test the function directly, I just think that there are too many drawbacks for that at the moment, and it's gonna take me a lot of extra effort, of course. Will do that in the new PR, and let's take it from there :) |
dd73a5a
to
add7be8
Compare
Pushed #50 as the base for this one! |
c70efdd
to
be9e197
Compare
Rebased, ready for review :)! |
be9e197
to
6b5af09
Compare
Ah, rebased again on the latest "comment" commit. Ready for review! @mwilliamson, I still need this merged to complete the stream parsing feature. |
52b3762
to
efdfee0
Compare
So the use case here is streaming text in, and getting a stream of JSON values out? I'd have imagined that an existing library would support the use case, although I suppose it's not a super common way of storing JSON. In any case, a couple of high-level thoughts. It feels like the natural abstraction is to do something similar to the for value in jv.reader(fileobj):
func(value) |
Yep. And no, there doesn't seem to be an existing Python library which can directly parse JSON out of a stream: https://stackoverflow.com/questions/6886283/how-i-can-i-lazily-read-multiple-json-values-from-a-file-stream-in-python None which don't require reading the whole stream in, or at least the complete text for a single object, and certainly none with a user-base (and thus reliability) and compatibility comparable to jq's. It covers three out of four popular streaming formats: https://en.wikipedia.org/wiki/JSON_streaming#Applications_and_tools
I don't want to require a file for reading a JSON stream. First, it might not come from a file, or even a file descriptor directly. It could be read from a socket, embedded in some other protocol, for example. In my case a def json_load_stream_fd(stream_fd, chunk_size=4*1024*1024):
"""
Load a series of JSON values from a stream file descriptor.
Args:
stream_fd: The file descriptor for the stream to read.
chunk_size: Maximum size of chunks to read from the file, bytes.
Returns:
An iterator returning loaded JSON values.
"""
def read_chunk():
while True:
chunk = os.read(stream_fd, chunk_size)
if chunk:
yield chunk
else:
break
return jq.parse(bytes_iter=read_chunk()) It could be used like this: for value in json_load_stream_fd(sys.stdin.fileno()):
func(value)
Could be, file-reading interfaces are generally perceived as such. However, I'd rather keep the generic interface available (for above-mentioned reasons) and perhaps add a specific one, doing something similar to the above. E.g. call it
Yeah, just Although the jq library does add another top-level namespace with its |
File objects don't necessarily need to be actual files on disk: for instance, sockets can be made into file objects using
If you have a raw file, then I believe
I this would simplify to: def json_load_stream_fd(stream_fd):
return jq.reader(open(stream_fd, "rb", buffering=0)) Admittedly untested, so I could be entirely wrong! |
Sure, you can wrap socket fds into file objects as you can any fd (in C as well, with
It wasn't easy to find and figure out, but yes, I can use My other point still stands, though: what if I need to do some processing after reading from the file, and before parsing incoming JSON? E.g. to unwrap the JSON data out of some other protocol/format? What if I'm just generating JSON some other way? Do I then need to create an "in-memory binary stream"? Memory buffers is the most fundamental communication method (let's not go into So, I say, let's keep the most basic interface still available ( I got stopped in my tracks by interfaces which only accept files so many times in the past, argh, I don't want that to happen again, if I can help it. However, if you're not swayed by my arguments, I can survive with just a file-based parser, even though it won't be pleasant 😂. I need to move on with a PR which depends on this and which only keeps growing. So please tell me which way you'd prefer one last time, and I'll amend the PRs. Thank you 😃 |
Oh, and if we want for the program interface to use the parser interface underneath, it will have to communicate via a file. Or we'll have to have two separate parser implementations. See my further PRs for the current implementation. |
efdfee0
to
1cf3225
Compare
I amended this PR to illustrate my proposal. Now it adds two functions: If we agree on the interface in principle, let's merge this, and improve the If you'd prefer to have only the file parsing interface, I can change the PR to do that. Meanwhile I'll start using my fork in production, so I can move on with dependent changes in another project. Thank you! |
Add an implementation of parse_json() function accepting either text or a text iterator and producing an iterable returning parsed values. Add a naive implementation of parse_json_file() function accepting a text file object and producing an iterable returning parsed values. This allows parsing JSON and JSON streams without passing them through a program.
1ac6493
to
9a05413
Compare
9a05413
to
6b8bf75
Compare
Hi Michael, it's been a while since we discussed this PR. Would you care to take a look again? I would love to have the parser support in your repo, as that would mean the pypy package and much easier deployments for us. If you'd like only the file interface, without the iterator parser that would be OK as well. Not ideal, but at least we would have the code we need right now upstream. If you'd like this done completely differently, I can consider that as well :) |
Closing this outdated PR for another attempt. |
Add an implementation of parse_json() function accepting either text or a text iterator and producing an iterable returning parsed values.
Add a naive implementation of parse_json_file() function accepting a text file object and producing an iterable returning parsed values.
This allows parsing JSON and JSON streams without passing them through a program.
This also adds unpacking jv values into Python values directly, bypassing JSON re-parsing, which is roughly twice as fast.
This is an initial PR fulfilling my requirements so far. I would also like to make the program interface use and/or accept the parser as the source of input data to reduce code duplication, and to support streaming JSON into programs. However, I'd like to get this in faster and also get your feedback on the direction, so I don't have to put a lot of work into wrong direction.
I'm thinking maybe we can make the parser optionally return "native" jv values wrapped in Python objects, and make the program interface accept an iterator returning these. What do you think?
Also, is there a reason you don't run Cython in setup.py? I suppose there is, but if that was done, then it would be possible to install jq.py with pip directly from a git repo.