Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fuzzer: excersizes query code on xml doc #551

Merged
merged 3 commits into from
Oct 21, 2023
Merged

Conversation

nathaniel-brough
Copy link
Contributor

@nathaniel-brough nathaniel-brough commented Mar 5, 2023

This change increases code coverage significantly during fuzzing as it excersizes the query code against an arbitrary file. I've also updated the fuzzer to use LLVM's FuzzedDataProvider which makes the fuzzer a little easier to understand/reason about.

@zeux
Copy link
Owner

zeux commented Mar 6, 2023

It would be better if this change didn’t change pugixml.hpp; if the enumeration gets extended it will require changes in fuzzing code anyway so it seems better to explicitly switch/case or hardcode boolean as last enum member in the fuzzing harness.

Also the change should be converted to Allman brace style to fit the convention in the rest of the code.

Copy link
Contributor Author

@nathaniel-brough nathaniel-brough left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the enumeration gets extended it will require changes in fuzzing code

I might be missing something, but it was my intention that this would not be required. As long as kMaxValue is aliased to the last element in the enum it shouldn't require updates to the fuzzer (see comments on ConsumeEnum).

I'm more than happy to refactor all of this slightly to not require changes to the main library header files. Although the alternatives are slightly less readable, and would definitely require updates to the fuzzer if the enum is extended.

std::vector<std::string> var_name_storage = {};
for (size_t i = 0; i < var_count; i++) {
var_name_storage.push_back(fdp.ConsumeRandomLengthString(128));
vars.add(var_name_storage.back().c_str(), fdp.ConsumeEnum<pugi::xpath_value_type>());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kMaxValue (modification to pugixml.hpp) is required ConsumeEnum to work correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity here is where kMaxValue is actually used by FuzzedDataProvider::ConsumeEnum https://github.com/llvm/llvm-project/blob/main/compiler-rt/include/fuzzer/FuzzedDataProvider.h#L298

@nathaniel-brough
Copy link
Contributor Author

Also the change should be converted to Allman brace style to fit the convention in the rest of the code.

Not a problem, do you mind if I add a .clang-format file as a separate commit with the Allman style braces? This way most editors will autoformat to Allman style rather than some other default.

@zeux
Copy link
Owner

zeux commented Mar 7, 2023

I might be missing something, but it was my intention that this would not be required. As long as kMaxValue is aliased to the last element in the enum it shouldn't require updates to the fuzzer (see comments on ConsumeEnum).

Sorry what I meant is that the fuzzing code right now has explicit calls to evaluate_X for each X, which is of course correct. If new types are added, this will expand the surface of the API which would require changing the fuzzing code anyway. So it seems preferable to avoid changing the header in this case and just hardcode the enum selection via an array lookup or something similar.

As an aside, just out of curiosity, is fuzzing efficiency better/worse/the same when using a specific fixed XML document? (eg some existing .xml file from the repository, or we could add a new one for this specific purpose)

I'm wondering this because it feels like a lot of generated documents will not really be valid or interesting XML, and as such a lot of queries would terminate early or not run at all, whereas if we knew the document was valid and had a variety of nodes with different types and structure, we would be able to directly fuzz the evaluation engine on known-interesting document and unknown query.

@nathaniel-brough
Copy link
Contributor Author

Oh yeah I see what you are saying. I'll make some changes so that the header file isn't modified.

As an aside, just out of curiosity, is fuzzing efficiency better/worse/the same when using a specific fixed XML document? (eg some existing .xml file from the repository, or we could add a new one for this specific purpose)

The overall performance increase in code coverage >20%. But it's worth noting that this is using some other changes that I've made in the google/oss-fuzz configuration for this repo (I'll open a PR with oss-fuzz soon).

  • Concatenate fuzz_parse.dict and fuzz_xpath.dict so that the XML and query components are in the dictionary.
  • Merge the fuzz_parse and fuzz_xpath seed corpus

I haven't tested this change as standalone (i.e. without corpus/dictionary updates), but with the fairly minimal modifications (mentioned above), it's a fairly decent improvement.

But you raise a valid point that to really get the best performance a good set of handwritten seed corpus xml files would certainly help cover the most code for a query.

@nathaniel-brough nathaniel-brough force-pushed the master branch 2 times, most recently from 8511086 to 0d8c457 Compare March 10, 2023 01:41
@nathaniel-brough
Copy link
Contributor Author

Ok so this PR no longer changes the pugixml.hpp file and instead fuzzes this by casting a Consumed Integral back to the enum type.

You can find the PR with OSS-fuzz I mentioned earlier here google/oss-fuzz#9900.

I'll have a go at writing a few specific entries for the seed corpus later this week. But with the combination of these two PRs and fuzzing for a couple of hours on my laptop I've been able to increase code coverage by a bit over 20%.

zeux added 2 commits October 21, 2023 11:11
Fix code style, no exceptions, other tweaks.
Remove unused variables
@zeux zeux merged commit 33f7093 into zeux:master Oct 21, 2023
21 checks passed
@zeux
Copy link
Owner

zeux commented Oct 21, 2023

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants