Important: Centipede is merged into FuzzTest to consolidate fuzzing development - see documentation here for user migration.
See also: Rich coverage signal and consequences for scaling (slides)
Why not? We are currently trying to fuzz some very large and very slow targets for which libFuzzer, AFL, and the like do not necessarily scale well. For one of our motivating examples see SiliFuzz. While working on Centipede we plan to experiment with new approaches to massive-scale differential fuzzing, that the existing fuzzing engines don't try to do.
Notable features:
-
Out-of-the-box support for libFuzzer-based fuzz targets. In order to use your favourite
LLVMFuzzerTestOneInput()
you only need to build your target with Centipede's compiler and linker options. -
Work-in-progress. We test centipede within a small team on a couple of targets. Unless you are part of the Centipede project, or want to help us, you probably don't want to read further just yet.
-
Scale. The intent is to be able to run any number of jobs concurrently, with very little communication overhead. We currently test with 100 local jobs and with 10k jobs on a cluster.
-
Out of process. The target runs in a separate process. Any crashes in it will not affect the fuzzer. Centipede can be used in-process as well, but this mode is not the main goal. If your target is small and fast you probably still want libFuzzer.
-
Integration with the sanitizers is achieved via separate builds. If during fuzzing you want to find bugs with ASAN, MSAN, or TSAN, you will need to provide separate binaries for every sanitizer as well as one main binary for Centipede itself. The main binary should not use any of the sanitizers.
-
No part of the internal interface is stable. Anything may change at this stage.
A program that produces an infinite stream of inputs for a target and orchestrates the execution.
A binary, a library, an API, or rather anything that can consume bytes for input and produce some sort of coverage data as an output. A libFuzzer's target can be a Centipede's target. Read more here.
A sequence of bytes that can be fed to a target. The input can be an arbitrary bag of bytes, or some structured data, e.g. serialized proto.
A number that represents some unique behavior of the target. E.g. a feature 1234567 may represent the fact that a basic block number 987 in the target has been executed 7 times. When executing an input with the target, the fuzzer collects the features that were observed during execution.
A set of features associated with one specific input.
Some information about the behaviour of the target when it executes a given input. Coverage is usually represented as feature set that the input has triggered in the target.
A function that takes bytes as input and outputs a small random mutation of the input. See also: structure-aware fuzzing.
A function that knows how to feed an input into a target and get coverage in return (i.e. to execute).
A customizable fuzzing engine that allows the user to substitute the Mutator and the Executor.
A library that implements the executor interface expected by the Centipede fuzzer. The runner knows how to run a sancov-instrumented target, collect the resulting coverage, and pass it back to Centipede. Prospective Centipede fuzz targets can be linked with this library to make them runnable by Centipede.
A set of inputs.
A process of choosing a subset of a larger corpus, such that the subset has the same coverage features as the original corpus.
A file representing a subset of the corpus and another file representing feature sets for that same subset of the corpus.
To merge shard B into shard A means: for every input in shard B that has features missing in shard A, add that input to A.
A single fuzzer process. One job writes only to one shard, but may read multiple shards.
A local or remote directory that contains data produced or consumed by a fuzzer.
git clone https://github.com/google/fuzztest.git
cd fuzztest
CENTIPEDE_SRC=`pwd`/centipede
BIN_DIR=`pwd`/bazel-bin/centipede
bazel build -c opt centipede:all
What you will need for the subsequent steps:
$BIN_DIR/centipede
- the binary of the engine (the fuzzer).$BIN_DIR/libcentipede_runner.pic.a
- the library you need to link with your fuzz target (the runner).$CENTIPEDE_SRC/clang-flags.txt
- recommended clang compilation flags for the target.
You can keep these files where they are or copy them somewhere.
We provide two examples of building the target: one tiny single-file target and libpng. Once you've built your target, proceed to the fuzz target running step.
This example uses one of the simple example fuzz targets, a.k.a. puzzles,
included in $CENTIPEDE_SRC/puzzles
of the repo.
NOTE: The commands below use the flags from $CENTIPEDE_SRC/clang-flags.txt. You may choose to use some other set of instrumentation flags: clang-flags.txt only provides a simple default option.
FUZZ_TARGET=byte_cmp_4 # or any other source under $CENTIPEDE_SRC/puzzles
clang++ @$CENTIPEDE_SRC/clang-flags.txt -c $CENTIPEDE_SRC/puzzles/$FUZZ_TARGET.cc -o $BIN_DIR/$FUZZ_TARGET.o
This step links the just-built fuzz target with libcentipede_runner.pic.a and other required libraries.
clang++ $BIN_DIR/$FUZZ_TARGET.o $BIN_DIR/libcentipede_runner.pic.a \
-ldl -lrt -lpthread -o $BIN_DIR/$FUZZ_TARGET
Skip to the fuzz target running step.
LIBPNG_BRANCH=v1.6.37 # You can experiment with other branches if you'd like
git clone --branch $LIBPNG_BRANCH --single-branch https://github.com/glennrp/libpng.git
cd libpng
CC=clang CFLAGS=@$CENTIPEDE_SRC/clang-flags.txt ./configure --disable-shared
make -j
FUZZ_TARGET=libpng_read_fuzzer
clang++ -include cstdlib \
./contrib/oss-fuzz/$FUZZ_TARGET.cc \
./.libs/libpng16.a \
$BIN_DIR/libcentipede_runner.pic.a \
-ldl -lrt -lpthread -lz \
-o $BIN_DIR/$FUZZ_TARGET
Running locally will not give the full scale, but it could be useful during the fuzzer development stage. We recommend that both the fuzzer and the target are copied to a local directory before running in order to avoid stressing a network file system.
WD=$HOME/centipede_run
mkdir -p $WD
NOTE: You may need to add
llvm-symbolizer
to your $PATH
for some of the Centipede functionality to work. The
symbolizer can be installed as part of the LLVM
distribution:
sudo apt install llvm
which llvm-symbolizer # normally /usr/bin/llvm-symbolizer
rm -rf $WD/*
$BIN_DIR/centipede --binary=$BIN_DIR/$FUZZ_TARGET --workdir=$WD --num_runs=100
See what's in the working directory
tree $WD
...
├── <fuzz target name>-d9d90139ee2ccc687f7c9d5821bcc04b8a847df5
│ └── features.000000
└── corpus.000000
WARNING: Do not exceed the number of cores on your machine for the --j
flag.
rm -rf $WD/*
$BIN_DIR/centipede --binary=$BIN_DIR/$FUZZ_TARGET --workdir=$WD --num_runs=100 --j=5
See what's in the working directory:
tree $WD
...
├── <fuzz target name>-d9d90139ee2ccc687f7c9d5821bcc04b8a847df5
│ ├── features.000000
│ ├── features.000001
│ ├── features.000002
│ ├── features.000003
│ └── features.000004
├── corpus.000000
├── corpus.000001
├── corpus.000002
├── corpus.000003
└── corpus.000004
Each Centipede shard typically does not cover all features that the entire corpus covers. Besides, all shards combined will have plenty of redundancy. In order to distill the corpus, a Centipede process will need to read all shards. Distillation works like this:
First, run fuzzing as described above, so that all shards have their feature sets computed. Stop fuzzing.
Then, run the same command line, but with --distill --total_shards=N --num_threads=K
. This will read N
corpus shards and produce K
independent
distilled corpus files and K
corresponding feature files. Each of the
distilled corpora should have the same features as the N
shards combined, but
the inputs might be different between the K
distilled corpora. In most cases
K==1
is sufficient, i.e. you simply omit --num_threads=K
.
The --distill
flag requires that you pass the --binary
or
--coverage_binary
so that it knows where to look for the features
files, but
it will not execute the binary. By default, when you pass --binary
or
--coverage_binary
, Centipede computes a hash of the binary file. If the binary
is not present on disk, you need to additionally pass --binary_hash=<HASH>
and
then you only need to pass the base name of the binary. E.g. if you fuzzed with
--binary=/path/to/foo
, and /path/to/foo
is not present on disk during
distillation, you can still pass --binary=/path/to/foo --binary_hash=<HASH>
,
but you can also pass --binary=foo --binary_hash=<HASH>
or
--binary=/invalid/path/foo --binary_hash=<HASH>
.
Unlike fuzzing, the distillation step is not distributed and needs to run on a single machine. The distillation is a much lighter-weight process than fuzzing because it does not require executing the target, and thus it doesn't need to be distributed. Distillation is however IO-bound.
$BIN_DIR/centipede --binary=$FUZZ_TARGET --workdir=$WD \
--binary_hash=a5e87c9b6057e5ffd3b32a5b9a9ef3978527e9cd --distill \
--total_shards=5 --num_threads=3
Note: --binary=$FUZZ_TARGET
in this example does not point to a real file and
so we also pass --binary_hash=<HASH>
.
The result of this command is that $WD
will now contain 3 distilled versions
of the corpus, while the features subdirectory, $WD/<fuzz target name>-<fuzz target hash>
, will contain 3 distilled versions of the features:
tree $WD
...
├── <fuzz target name>-d9d90139ee2ccc687f7c9d5821bcc04b8a847df5
│ ├── distilled-features-byte_cmp_4.000000
│ ├── distilled-features-byte_cmp_4.000001
│ ├── distilled-features-byte_cmp_4.000002
│ ├── features.000000
│ ...
├── corpus.000000
│ ...
├── distilled-byte_cmp_4.000000
├── distilled-byte_cmp_4.000001
├── distilled-byte_cmp_4.000002
Centipede provides two ways to write coverage reports: a simple text-based report and a human-readable HTML report. It is important to note that the HTML report requires additional setup and may impact performance.
The simple text coverage report is generated by default and is available as soon
as Centipede begins fuzzing. It is saved as a text file in the workdir
directory with the name coverage-report-BINARY.SHARD.txt
where BINARY
is
the name of the target and SHARD
is the shard number. The report reflects
the coverage as observed by the shard after loading the corpus.
The report shows functions that are fully covered (all control flow edges are observed at least once), not covered, or partially covered. For partially covered functions the report contains symbolic information for all covered and uncovered edges.
The report will look something like this:
FULL: FUNCTION_A a.cc:1:0
NONE: FUNCTION_BB bb.cc:1:0
PARTIAL: FUNCTION_CCC ccc.cc:1:0
+ FUNCTION_CCC ccc.cc:1:0
- FUNCTION_CCC ccc.cc:2:0
- FUNCTION_CCC ccc.cc:3:0
Centipede also provides the option to generate a human-readable HTML coverage report by taking advantage of source level coverage instrumentation provided by clang. The HTML report provides a more interactive way to analyze code coverage by providing a browsable view of the instrumented source tree with hit counts for lines of code. Source-level information from the compiler is used to track coverage, so it can be more precise than the text-based output.
The existing fuzz target used with Centipede contains instrumentation that is
helpful for fuzzing, but lacks some detail needed for the HTML coverage report.
In order to include this information in the binary, you need to build an
additional binary for your fuzz target with this source-level coverage
instrumentation. You can build this additional fuzz target with options
-fprofile-instr-generate -fcoverage-mapping
. Ensure llvm-cov
and
llvm-profdata
are available in your $PATH
, then pass your original fuzz
target binary with --binary
and pass this new binary using
--clang_coverage_binary
. A full command line invocation might look like this:
$BIN_DIR/centipede --binary=$BIN_DIR/$FUZZ_TARGET --workdir=$WD --clang_coverage_binary=$BIN_DIR/$FUZZ_TARGET_CLANG_COVERAGE
Report generation assumes that your current working directory is the root of
the source files for your built binary. Otherwise, you may encounter a
No such file or directory
error. We may add a flag to specify your source
directory in a future version of Centipede.
Note that generating the HTML report may impact performance since the additional coverage binary must be run on new inputs to collect coverage. Currently, this feature only works for local fuzz jobs. It does not merge coverage reports from remote fuzzing instances.
For more information about clang's source-based coverage reports, please see https://clang.llvm.org/docs/SourceBasedCodeCoverage.html.
TBD