Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parser support #49

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

spbnick
Copy link
Contributor

@spbnick spbnick commented Sep 8, 2020

Add an implementation of parse_json() function accepting either text or a text iterator and producing an iterable returning parsed values.

Add a naive implementation of parse_json_file() function accepting a text file object and producing an iterable returning parsed values.

This allows parsing JSON and JSON streams without passing them through a program.

This also adds unpacking jv values into Python values directly, bypassing JSON re-parsing, which is roughly twice as fast.

This is an initial PR fulfilling my requirements so far. I would also like to make the program interface use and/or accept the parser as the source of input data to reduce code duplication, and to support streaming JSON into programs. However, I'd like to get this in faster and also get your feedback on the direction, so I don't have to put a lot of work into wrong direction.

I'm thinking maybe we can make the parser optionally return "native" jv values wrapped in Python objects, and make the program interface accept an iterator returning these. What do you think?

Also, is there a reason you don't run Cython in setup.py? I suppose there is, but if that was done, then it would be possible to install jq.py with pip directly from a git repo.

@mwilliamson
Copy link
Owner

Also, is there a reason you don't run Cython in setup.py? I suppose there is, but if that was done, then it would be possible to install jq.py with pip directly from a git repo.

The recommendation in the Cython docs is to distribute the .c files:

It is strongly recommended that you distribute the generated .c files as well as your Cython sources, so that users can install your module without needing to have Cython available.

@mwilliamson
Copy link
Owner

This pull request seems to have a couple of different changes in, and I'm not sure the tests fully cover the changes. I think it'd be useful to break this apart, and test it a bit more thoroughly. For instance, a good first piece might be the code that transforms jv values into Python values. This could be tested by constructing the jv values in a test for each value type (boolean, integer, float, empty array, simple array, nested array, etc.), and then just running the unpack code to make sure the value that is produced is correct.

@mwilliamson
Copy link
Owner

Having complete test coverage also means you can run the tests with valgrind to check for memory leaks.

@spbnick
Copy link
Contributor Author

spbnick commented Sep 8, 2020

I broke the changes into three separate commits, not sure if you noticed that (some reviewers don't look at those). If that's not small enough, what other parts would you like me to break out?

Sure, I can add more tests. What other tests except ones for _jv_unpack() would you like me to add?

FWIW, I did a few crude runs of this through Valgrind, and caught and fixed a couple issues already, but we can always run more :) BTW, how do you deal with the huge amount of "invalid access" Valgrind messages that come from Python? Do you have a suppression file you use?

@spbnick
Copy link
Contributor Author

spbnick commented Sep 8, 2020

It is strongly recommended that you distribute the generated .c files as well as your Cython sources, so that users can install your module without needing to have Cython available.

That makes sense, but is unfortunate, as installing from a Git repo could really be useful to start using this sooner. Or do you push to pypi often, perhaps? Sorry, I'm not familiar with that process.

@mwilliamson
Copy link
Owner

I broke the changes into three separate commits, not sure if you noticed that (some reviewers don't look at those). If that's not small enough, what other parts would you like me to break out?

I'd prefer to deal with each change in isolation e.g. having one pull request just for the unpacking. For me, it helps with cognitive load, especially when dealing with C (well, Cython).

FWIW, I did a few crude runs of this through Valgrind, and caught and fixed a couple issues already, but we can always run more :) BTW, how do you deal with the huge amount of "invalid access" Valgrind messages that come from Python? Do you have a suppression file you use?

I was (perhaps foolishly!) only checking for leaks, so I just checking that I got a zero back at the end for definitely and indirectly lost.

@mwilliamson
Copy link
Owner

That makes sense, but is unfortunate, as installing from a Git repo could really be useful to start using this sooner. Or do you push to pypi often, perhaps? Sorry, I'm not familiar with that process.

I normally update PyPI once the changes are ready.

@spbnick
Copy link
Contributor Author

spbnick commented Sep 8, 2020

I'd prefer to deal with each change in isolation e.g. having one pull request just for the unpacking. For me, it helps with cognitive load, especially when dealing with C (well, Cython).

I see. Whatever works for you :)

I come from open-source projects where a PR/MR/patchset could be dealing with a feature which needs multiple changes, but those changes have to be broken into logically-independent (and separately working and documented) commits. I.e. change unit is commit, not PR, and PR is a discussion unit. That helps preserve larger context across changes, I think. But then neither GitLab nor GitHub support CI for separate commits in an MR/PR. Regardless, I'll do a separate PR with the jv_free() and _jv_unpack() commits.

I was (perhaps foolishly!) only checking for leaks, so I just checking that I got a zero back at the end for definitely and indirectly lost.

Yeah, that's what I had to resort to, as well. Perhaps I'll generate a suppression file for the merged version and run with that next time.

@spbnick
Copy link
Contributor Author

spbnick commented Sep 8, 2020

BTW, @mwilliamson, do you have an idea for how to implement those _jv_unpack() tests? I can't call it from the Python tests. Should I implement _jv_pack() first, perhaps?

@mwilliamson
Copy link
Owner

I was imagining something like:

@istest
def jv_true_is_converted_to_python_true():
    jv_value = jv_true()
    
    result = jv_to_python(jv_value)

    assert_equal(True, result)

In other words, building up a jv value using the jv functions (which will need to be added to the Cython file), and then passing that jv value into the function that converts it to Python. I think the jv value will have to be wrapped in a class to allow it to be passed around, meaning there'll need to be a shim around jv_to_python to allow it to be called from Python (and unwrap the jv value). Admittedly ugly, but hopefully straightforward. (I'm no Cython expert, so if there's a way to accomplish the same with less shenanigans, great!)

(I'd suggest jv_to_python is a bit more explicit than _jv_unpack about what's going on.)

@spbnick
Copy link
Contributor Author

spbnick commented Sep 9, 2020

I think exposing functions to build jv values would be mostly useless to library users. To start with, we won't even have a use for any jv values at all in Python, because nothing in the library consumes them. I think an exposed jv_unpack() and the reverse jv_pack() functions (or jv_to_python() and jv_from_python(), if you'd like) could be useful eventually, if we implement jv value output from the parser and jv value input to the program.

I could implement and test those, along with the wrapper class (which I started writing earlier already). However, this is gonna take a while and kinda goes contrary to getting smaller changes merged sooner and iterating faster. The _jv_unpack() implementation is rather trivial, I tested it locally with a ton of data I need to process, and it's new and separate functionality, not affecting anything else. We can add direct _jv_unpack() tests once it makes sense to expose it, but in the meantime we can test it by feeding the parser with more JSON containing various values.

What do you say?

@spbnick
Copy link
Contributor Author

spbnick commented Sep 9, 2020

and it's new and separate functionality, not affecting anything else

Ah, scratch that, of course. Still, how about testing it by feeding the parser for the start?

@mwilliamson
Copy link
Owner

I think exposing functions to build jv values would be mostly useless to library users.

I don't have a problem with having some functions that are only used in tests, seeing as they should be simple functions that just wrap the functionality that's already implemented in jq.

Having said that, if we want to avoid directly testing the conversion function, we could use the identity program with the different input and outputs.

@spbnick
Copy link
Contributor Author

spbnick commented Sep 9, 2020

I don't have a problem with having some functions that are only used in tests, seeing as they should be simple functions that just wrap the functionality that's already implemented in jq.

Alright :)

Having said that, if we want to avoid directly testing the conversion function, we could use the identity program with the different input and outputs.

I'd love to test the function directly, I just think that there are too many drawbacks for that at the moment, and it's gonna take me a lot of extra effort, of course.

Will do that in the new PR, and let's take it from there :)

@spbnick
Copy link
Contributor Author

spbnick commented Sep 9, 2020

Pushed #50 as the base for this one!

@spbnick spbnick force-pushed the add_parser_support branch 5 times, most recently from c70efdd to be9e197 Compare September 13, 2020 15:24
@spbnick
Copy link
Contributor Author

spbnick commented Sep 13, 2020

Rebased, ready for review :)!

@spbnick
Copy link
Contributor Author

spbnick commented Sep 14, 2020

Ah, rebased again on the latest "comment" commit. Ready for review!

@mwilliamson, I still need this merged to complete the stream parsing feature.

@spbnick spbnick force-pushed the add_parser_support branch 2 times, most recently from 52b3762 to efdfee0 Compare September 15, 2020 15:01
@mwilliamson
Copy link
Owner

So the use case here is streaming text in, and getting a stream of JSON values out? I'd have imagined that an existing library would support the use case, although I suppose it's not a super common way of storing JSON.

In any case, a couple of high-level thoughts. It feels like the natural abstraction is to do something similar to the csv module, and create a reader around a file. That might also make it clearer that the parsing is lazy. It'd also be nice to make it clearer what's being parsed -- perhaps by making it clear that it's jv that's being used? So perhaps something like:

for value in jv.reader(fileobj):
    func(value)

@spbnick
Copy link
Contributor Author

spbnick commented Sep 16, 2020

So the use case here is streaming text in, and getting a stream of JSON values out? I'd have imagined that an existing library would support the use case, although I suppose it's not a super common way of storing JSON.

Yep. And no, there doesn't seem to be an existing Python library which can directly parse JSON out of a stream: https://stackoverflow.com/questions/6886283/how-i-can-i-lazily-read-multiple-json-values-from-a-file-stream-in-python

None which don't require reading the whole stream in, or at least the complete text for a single object, and certainly none with a user-base (and thus reliability) and compatibility comparable to jq's. It covers three out of four popular streaming formats: https://en.wikipedia.org/wiki/JSON_streaming#Applications_and_tools

In any case, a couple of high-level thoughts. It feels like the natural abstraction is to do something similar to the csv module, and create a reader around a file.

I don't want to require a file for reading a JSON stream. First, it might not come from a file, or even a file descriptor directly. It could be read from a socket, embedded in some other protocol, for example. In my case a fileobj certainly doesn't work, because you can't read only whatever is available from it, its read() method insists on filling up to the requested size. That doesn't work with processing what comes down a pipe interactively (e.g. process an object as soon as it's written to the other end). For that I have to use file descriptors and os.read() at the moment:

def json_load_stream_fd(stream_fd, chunk_size=4*1024*1024):
    """
    Load a series of JSON values from a stream file descriptor.
    
    Args:
        stream_fd:  The file descriptor for the stream to read.
        chunk_size: Maximum size of chunks to read from the file, bytes.

    Returns:
        An iterator returning loaded JSON values.
    """ 
    def read_chunk():
        while True:
            chunk = os.read(stream_fd, chunk_size)
            if chunk:
                yield chunk
            else:
                break

    return jq.parse(bytes_iter=read_chunk())

It could be used like this:

for value in json_load_stream_fd(sys.stdin.fileno()):
    func(value)

That might also make it clearer that the parsing is lazy.

Could be, file-reading interfaces are generally perceived as such. However, I'd rather keep the generic interface available (for above-mentioned reasons) and perhaps add a specific one, doing something similar to the above. E.g. call it parse_fd() or parse_file().

It'd also be nice to make it clearer what's being parsed -- perhaps by making it clear that it's jv that's being used? So perhaps something like:

for value in jv.reader(fileobj):
    func(value)

Yeah, just jq.parse() could be perceived as parsing a program. I'm not sure adding another top-level namespace (jv) is worth it, and I think it could be confusing and detrimental. How about naming the functions jq.json_parse(), jq.json_parse_file(), and so on? Or jq.parse_json(), jq.parse_json_file()?

Although the jq library does add another top-level namespace with its jv_ names, and we could do that just to make the translation more obvious, but how would that work with packaging and user expectations? Do you want to create a separate package/repo?

@mwilliamson
Copy link
Owner

In any case, a couple of high-level thoughts. It feels like the natural abstraction is to do something similar to the csv module, and create a reader around a file.

I don't want to require a file for reading a JSON stream. First, it might not come from a file, or even a file descriptor directly. It could be read from a socket, embedded in some other protocol, for example.

File objects don't necessarily need to be actual files on disk: for instance, sockets can be made into file objects using makefile().

In my case a fileobj certainly doesn't work, because you can't read only whatever is available from it, its read() method insists on filling up to the requested size. That doesn't work with processing what comes down a pipe interactively (e.g. process an object as soon as it's written to the other end).

If you have a raw file, then I believe read() will read what's available up to the requested size. If you have a buffered file, then I believe read1() does the same (more or less).

For that I have to use file descriptors and os.read() at the moment:

def json_load_stream_fd(stream_fd, chunk_size=4*1024*1024):
    """
    Load a series of JSON values from a stream file descriptor.
    
    Args:
        stream_fd:  The file descriptor for the stream to read.
        chunk_size: Maximum size of chunks to read from the file, bytes.

    Returns:
        An iterator returning loaded JSON values.
    """ 
    def read_chunk():
        while True:
            chunk = os.read(stream_fd, chunk_size)
            if chunk:
                yield chunk
            else:
                break

    return jq.parse(bytes_iter=read_chunk())

I this would simplify to:

def json_load_stream_fd(stream_fd):
    return jq.reader(open(stream_fd, "rb", buffering=0))

Admittedly untested, so I could be entirely wrong!

@spbnick
Copy link
Contributor Author

spbnick commented Sep 18, 2020

File objects don't necessarily need to be actual files on disk: for instance, sockets can be made into file objects using makefile().

Sure, you can wrap socket fds into file objects as you can any fd (in C as well, with fdopen()).

If you have a raw file, then I believe read() will read what's available up to the requested size. If you have a buffered file, then I believe read1() does the same (more or less).

It wasn't easy to find and figure out, but yes, I can use read1() on an instance of BufferedIOBase (which can e.g. be available in sys.stdin.buffer) to execute just one read() syscall and read only what's ready.

My other point still stands, though: what if I need to do some processing after reading from the file, and before parsing incoming JSON? E.g. to unwrap the JSON data out of some other protocol/format? What if I'm just generating JSON some other way? Do I then need to create an "in-memory binary stream"?

Memory buffers is the most fundamental communication method (let's not go into in and out assembly instructions, and memory-mapped registers here), and that's why it's used in the underlying jq library, and not e.g. file descriptors, or FILE objects (with which you can have string streams too). Imagine how much more complicated the jq.pyx file was if it did that? You can build a file parser based on a memory-buffer parser easily, but the other way around is not so easy and introduces unnecessary complication and abstraction layers.

So, I say, let's keep the most basic interface still available (jq.parse_json()), and add a file interface on top of that (jq.parse_json_file(), or whatever other names you'd prefer). This way you get the convenience and stream-handling appearance, and I get the flexibility and simplicity 😁

I got stopped in my tracks by interfaces which only accept files so many times in the past, argh, I don't want that to happen again, if I can help it. However, if you're not swayed by my arguments, I can survive with just a file-based parser, even though it won't be pleasant 😂. I need to move on with a PR which depends on this and which only keeps growing. So please tell me which way you'd prefer one last time, and I'll amend the PRs. Thank you 😃

@spbnick
Copy link
Contributor Author

spbnick commented Sep 18, 2020

Oh, and if we want for the program interface to use the parser interface underneath, it will have to communicate via a file. Or we'll have to have two separate parser implementations. See my further PRs for the current implementation.

@spbnick
Copy link
Contributor Author

spbnick commented Sep 21, 2020

I amended this PR to illustrate my proposal. Now it adds two functions: jq.parse_json() and jq.parse_json_file(). The implementation of the latter is trivial (but at the same time similar to json.load()) for the start. This is how I propose we have both the generic parsing interface in jq.parse_json() and the file parsing interface in jq.parse_json_file(). I also renamed the _Parser class to _JSONParser to match the function rename.

If we agree on the interface in principle, let's merge this, and improve the jq.parse_json_file() implementation later, and expand its interface, if needed.

If you'd prefer to have only the file parsing interface, I can change the PR to do that.

Meanwhile I'll start using my fork in production, so I can move on with dependent changes in another project.
I'll stop bothering you and let you take your time deciding how you want the parser support done, or even if you want it at all :)

Thank you!

Add an implementation of parse_json() function accepting either text or
a text iterator and producing an iterable returning parsed values.

Add a naive implementation of parse_json_file() function accepting a
text file object and producing an iterable returning parsed values.

This allows parsing JSON and JSON streams without passing them through a
program.
@spbnick spbnick force-pushed the add_parser_support branch 2 times, most recently from 1ac6493 to 9a05413 Compare February 10, 2021 15:53
@spbnick
Copy link
Contributor Author

spbnick commented Feb 24, 2021

Hi Michael, it's been a while since we discussed this PR. Would you care to take a look again?

I would love to have the parser support in your repo, as that would mean the pypy package and much easier deployments for us.

If you'd like only the file interface, without the iterator parser that would be OK as well. Not ideal, but at least we would have the code we need right now upstream. If you'd like this done completely differently, I can consider that as well :)

@spbnick
Copy link
Contributor Author

spbnick commented May 30, 2024

Closing this outdated PR for another attempt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants