Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to a string.replace approach #50

Closed
wants to merge 2 commits into from

Conversation

swansontec
Copy link

This approach is vastly simpler and faster, since it leverages the regular expression engine to to most of the work.

To handle comments inside of strings, simply search for strings as well, but then leave the strings as-is in the replacer function.

The benchmarks do run quite a bit faster:

  • strip JSON comments is 80% faster
  • strip JSON comments without whitespace is 180% faster
  • strip Big JSON comments is 50% faster

This approach is vastly simpler and faster, since it leverages the regular expression engine to to most of the work.

To handle comments inside of strings, simply search for strings as well, but then leave the strings as-is in the replacer function.

The benchmarks do run quite a bit faster:

- strip JSON comments is 80% faster
- strip JSON comments without whitespace is 180% faster
- strip Big JSON comments is 50% faster
@sindresorhus
Copy link
Owner

While I appreciate the pull request, I am unfortunately not going to accept this for multiple reasons:

  1. This package is already more than fast enough for what it's used for.
  2. The parser already works well and is more maintainable.
  3. Every change is a risk. There might be untested cases that the parser handles that the regex misses.
  4. But most importantly, regexes have a lot of problems. I have gotten an endless amount of ReDoS vulnerability reports in the past year for seemingly simple regexes and I don't want to risk that here. I am actually often doing the opposite of what's done here, moving an existing regex to a simple parser.

@swansontec
Copy link
Author

swansontec commented Nov 4, 2021

Thanks, I can appreciate this reasoning, and respect your decision.

However, regarding point 3, the current implementation still has at least two existing bugs not covered by tests:

  • The first bug is that malformed input like [] /* will be stripped to just [] , which is now valid JSON. This library should leave the invalid comment as-is, so users get a proper error.
  • The second bug is that a plain \r should terminate line comments (//), as it does in the ECMA-262 specification. Few modern editors use the \r line ending (it was an old Mac thing), but this is still a bug.

However, the new regular expression version covers these cases without any additional effort, because regular expressions are basically just a transliteration of comment grammar and sting grammar in those respective specs.

So, besides being shorter & simpler, this regular expression is actually safer and less bug-prone that what's already there.

Unfortunately, regular expressions can sometimes be slow. It's unfortunate that we live in a world where "bad performance" can now mean "security flaw", but here we are... Nevertheless, I did design the regular expression with performance in mind, so it doesn't contain any exponential cases. However, after digging deeper, it now seems that even quadratic slowdowns are getting the ReDoS label these days. I do see one case where that can happen, so I will go ahead and address that, even if you don't want to accept this PR.

@swansontec
Copy link
Author

swansontec commented Nov 4, 2021

I have pushed a fixup commit, to document the needed changes.

This regular expression now executes in linear time regardless of input text.

Why? Well, once the engine picks one of the three paths (" | // | /*), it enters a state every possible input character (including end-of-string) is a valid path forward towards a match. Therefore, there will be no situation where we start down a path, realize it doesn't match at the end, and then back-track.

In the case of strings, I just made the final quote optional. That way, if it's missing, or if the last character in the file is a dangling \, we can still treat that as a string match. The three choices inside the string body are mutually exclusive (quote, backslash, or anything else), so the path forward is unambiguous with no need for internal back-tracking.

Line comments were already safe, since the final newline was never part of the match.

Block comments are basically the same as strings. As with strings, my only change was to make the final */ optional, so an unterminated block quote will just match to the end of the file. The two options inside the block quote are mutually exclusive (any non-star character, or a star followed by a non-slash character), so there is no internal back-tracking.

Since the regular expression now matches broken block quotes, I have to filter those out too, just like quoted strings. Icky, but necessary.

Finally, I added tests & benchmarks for these cases.

I know you don't plan to merge this, but I sill want it to be correct & of high quality.

@sindresorhus
Copy link
Owner

The first bug is that malformed input like [] /* will be stripped to just [] , which is now valid JSON. This library should leave the invalid comment as-is, so users get a proper error.

Thanks for catching that: #51

The second bug is that a plain \r should terminate line comments (//), as it does in the ECMA-262 specification. Few modern editors use the \r line ending (it was an old Mac thing), but this is still a bug.

\r is dead. I don't plan to add any fix/workaround for it.

@sindresorhus
Copy link
Owner

I'm happy to link to your version if you decide to decide to publish it as separate package :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants