Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tolerant mode #7

Open
ales-t opened this issue Dec 17, 2021 · 2 comments
Open

Tolerant mode #7

ales-t opened this issue Dec 17, 2021 · 2 comments

Comments

@ales-t
Copy link
Owner

ales-t commented Dec 17, 2021

Currently, rjp will stop when encountering any error, such as:

  • Select/rename not finding the required fields in an instance.
  • serde failing to parse an input line.
  • Join not finding the keys to join on in an instance.
  • Merge when the stream lengths are mismatched.
  • ...

Oftentimes, JSON lines files are noisy and contain lines with problems. It would be helpful to have some way to request "tolerant" behavior. For example, rename_field would not change an instance if the input fields are not found. Alternatively, problematic instances may be skipped in the output stream.

It's currently not clear to me how the various processors should behave in this tolerant setting, or whether there should be more ways (for instance --skip-bad-instances, --stop-on-bad-instance, --keep-bad-instances?).

@zouharvi
Copy link
Collaborator

Sounds like every processor should have its own handling of errors (the default currently being panic). They should be modified by these flags but the question is whether we should also have flags that are specific to certain processors. I'd somewhat prefer if the number of optional command line arguments was kept as low as possible to make the CLI easier to use.

Also maybe they should not be flags but rather enums like --bad-instance {stop,skip,keep}? They seem pretty exclusive and we would want to error on the combination of rjp --skip-bad-instances --stop-on-bad-instance anyway.

What is going to be the default? This --stop-on-bad-instance? It's intuitive but then we should expect half-processed outputs which does not seem good.

The bad instances count should definitely go to the stderr final summary which is already there.

@ales-t
Copy link
Owner Author

ales-t commented Dec 17, 2021

Sounds like every processor should have its own handling of errors (the default currently being panic).

If you find that rjp panics, please create an issue. I'm under the impression that all errors are now transformed into RjpError and propagated into main in a clean way.

Sounds like every processor should have its own handling of errors (the default currently being panic). They should be modified by these flags but the question is whether we should also have flags that are specific to certain processors.

I agree -- each processor should have its own way of interpreting the flags but for simplicity, the flag should be global. If you need to handle errors differently in different parts of the processing pipeline, you can always just call rjp twice and connect them with a unix pipe.

Also maybe they should not be flags but rather enums like --bad-instance {stop,skip,keep}?

I like that solution a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants