Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save the corpus and use later as seed #633

Open
dank-cruise opened this issue Oct 12, 2023 · 4 comments
Open

Save the corpus and use later as seed #633

dank-cruise opened this issue Oct 12, 2023 · 4 comments

Comments

@dank-cruise
Copy link

dank-cruise commented Oct 12, 2023

Hi there!

A. Is it possible to save or dump the corpus that's been found so far? E.g. when I terminate the fuzzing run, it should save the corpus that's been discovered so far. Presumably the corpus path would be a command line flag.
B. When I fuzz the same target again later, using the same Domains and all that, can I reuse a previously saved corpus?

Obviously, this is not a new idea. For example, Chromium fuzzing talks about it.

A on its own is useful, even if B isn't done. I think it would be very useful to take the corpus from A, and create a unit test for every corpus element, and add that to continuous Integration and pre-commit testing.

@dank-cruise dank-cruise changed the title Save/dump corpus and use later as seed Save the corpus and use later as seed Oct 12, 2023
@irowebbn
Copy link

There appears to be a command line flag for this (I found it by running my test binary with the --helpfull flag).

--corpus_database (The directory containing all corpora for all fuzz tests
      in the project. For each test binary, there's a corresponding
      <binary_name> subdirectory in `corpus_database`, and the <binary_name>
      directory has the following structure: (1) For each fuzz test
      `SuiteName.TestName` in the binary, there's a sub-directory with the name
      of that test ('<binary_name>/SuiteName.TestName'). (2) For each fuzz test,
      there are three directories containing `regression`, `crashing`, and
      `coverage` directories. Files in the `regression` directory will always be
      used. Files in `crashing` directory will be used when
      --reproduce_findings_as_separate_tests flag is true. And finally, all
      files in `coverage` directory will be used when --replay_corpus flag is
      true.); default: "~/.cache/fuzztest";

Unfortunately, I have not been able to get it to work.

@racko
Copy link

racko commented Jan 19, 2024

There is an undocumented environment variable that helps us along one step: FUZZTEST_TESTSUITE_OUT_DIR

$ FUZZTEST_TESTSUITE_OUT_DIR=/some/path my_fuzztest --fuzz My.Test

will create /some/path and create lots of beautiful corpus files in it.

FUZZTEST_TESTSUITE_IN_DIR could be used in the same way to reuse the corpus later. (This is a separate mechanism from the --corpus_database stuff.)

However, the directory structure described in the --corpus_database flag documentation is not created.
As a workaround, you can create the directory structure yourself, e.g. by running

$ FUZZTEST_TESTSUITE_OUT_DIR=~/.cache/fuzztest/<binary_name>/SuiteName.TestName/coverage <binary_name> --fuzz SuiteName.TestName

Later, to use the corpus, run

$ <binary_name> --fuzz SuiteName.TestName --corpus_database ~/.cache/fuzztest --replay_coverage_inputs

You cannot skip the --corpus_database ~/.cache/fuzztest argument: fuzztest does try to use ~/.cache/fuzztest as a default, but this doesn't actually work because ~ is not resolved by the C++ library code. But it is by your shell when you pass the argument on the command line.

As far as I can tell, we cannot make fuzztest write samples to the --corpus_database just by passing the argument. The path is exclusively used in

std::string binary_corpus = absl::StrCat(
absl::GetFlag(FUZZTEST_FLAG(corpus_database)), "/", binary_identifier);
if (getenv("TEST_SRCDIR")) {
binary_corpus = absl::StrCat(getenv("TEST_SRCDIR"), "/", binary_corpus);
}
return internal::Configuration{
.corpus_database = internal::CorpusDatabase(
binary_corpus, absl::GetFlag(FUZZTEST_FLAG(replay_coverage_inputs)),
absl::GetFlag(FUZZTEST_FLAG(reproduce_findings_as_separate_tests))),

to create a CorpusDatabase object:
class CorpusDatabase {
public:
explicit CorpusDatabase(absl::string_view database_path,
bool use_coverage_inputs, bool use_crashing_inputs)
: database_path_(std::string(database_path)),
use_coverage_inputs_(use_coverage_inputs),
use_crashing_inputs_(use_crashing_inputs) {}
// Returns set of all regression inputs from `corpus_database` for a fuzz
// test.
std::vector<std::string> GetRegressionInputs(
absl::string_view test_name) const;
// Returns set of all corpus inputs from `corpus_database` for a fuzz test.
// Returns an empty set when `use_coverage_inputs_` is false.
std::vector<std::string> GetCoverageInputsIfAny(
absl::string_view test_name) const;
// Returns set of all crashing inputs from `corpus_database` for a fuzz test.
// Returns an empty set when `use_crashing_inputs_` is false.
std::vector<std::string> GetCrashingInputsIfAny(
absl::string_view test_name) const;
private:
std::string database_path_;
bool use_coverage_inputs_ = false;
bool use_crashing_inputs_ = false;
};

And as you can see, CorpusDatabase has no public API to get the database_path_ which would be necessary to write the new corpus files to it.

@chandlerc
Copy link
Contributor

Some way of seeding with a corpus, and minimizing a corpus of seeds is really needed.

For example, these workflows are well supported with libFuzzer already:
https://github.com/google/fuzzing/blob/master/tutorial/libFuzzerTutorial.md#seed-corpus
https://github.com/google/fuzzing/blob/master/tutorial/libFuzzerTutorial.md#minimizing-a-corpus

I'm trying to migrate from libFuzzer to FuzzTest, and currently this is the biggest issue I'm facing.

@davidben
Copy link

davidben commented May 4, 2024

Same. FuzzTest's model of putting all the fuzzers in one build target would be really attractive for BoringSSL (it would simplify keeping the same build across multiple build systems). But one of our workflows is that we record transcripts from our tests (a good sample of different TLS protocol flow and other hand-crafted interesting cases) and then minimize them as the starting corpus for the fuzzer, so it doesn't need to discover how the TLS protocol works from scratch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants