tree: c9e11208c79119ce25534d2c4e491157c205171e [path history] [tgz]

centipede/README.md

Centipede - a distributed fuzzing engine. Work-in-progress.

Important: Centipede is being merged into FuzzTest to consolidate fuzzing development - see documentation here for user migration.

Why Centipede

Why not? We are currently trying to fuzz some very large and very slow targets for which libFuzzer, AFL, and the like do not necessarily scale well. For one of our motivating examples see SiliFuzz. While working on Centipede we plan to experiment with new approaches to massive-scale differential fuzzing, that the existing fuzzing engines don't try to do.

Notable features:

Out-of-the-box support for libFuzzer-based fuzz targets. In order to use your favourite LLVMFuzzerTestOneInput() you only need to build your target with Centipede's compiler and linker options.
Work-in-progress. We test centipede within a small team on a couple of targets. Unless you are part of the Centipede project, or want to help us, you probably don't want to read further just yet.
Scale. The intent is to be able to run any number of jobs concurrently, with very little communication overhead. We currently test with 100 local jobs and with 10k jobs on a cluster.
Out of process. The target runs in a separate process. Any crashes in it will not affect the fuzzer. Centipede can be used in-process as well, but this mode is not the main goal. If your target is small and fast you probably still want libFuzzer.
Integration with the sanitizers is achieved via separate builds. If during fuzzing you want to find bugs with ASAN, MSAN, or TSAN, you will need to provide separate binaries for every sanitizer as well as one main binary for Centipede itself. The main binary should not use any of the sanitizers.
No part of the internal interface is stable. Anything may change at this stage.

Terminology

Fuzzing engine a.k.a. fuzzer

A program that produces an infinite stream of inputs for a target and orchestrates the execution.

Fuzz target

A binary, a library, an API, or rather anything that can consume bytes for input and produce some sort of coverage data as an output. A libFuzzer‘s target can be a Centipede’s target. Read more here.

Input

A sequence of bytes that can be fed to a target. The input can be an arbitrary bag of bytes, or some structured data, e.g. serialized proto.

Feature

A number that represents some unique behavior of the target. E.g. a feature 1234567 may represent the fact that a basic block number 987 in the target has been executed 7 times. When executing an input with the target, the fuzzer collects the features that were observed during execution.

Feature set

A set of features associated with one specific input.

Coverage

Some information about the behaviour of the target when it executes a given input. Coverage is usually represented as feature set that the input has triggered in the target.

Mutator

A function that takes bytes as input and outputs a small random mutation of the input. See also: structure-aware fuzzing.

Executor

A function that knows how to feed an input into a target and get coverage in return (i.e. to execute).

Centipede

A customizable fuzzing engine that allows the user to substitute the Mutator and the Executor.

Centipede runner

A library that implements the executor interface expected by the Centipede fuzzer. The runner knows how to run a sancov-instrumented target, collect the resulting coverage, and pass it back to Centipede. Prospective Centipede fuzz targets can be linked with this library to make them runnable by Centipede.

Corpus (plural: corpora)

A set of inputs.

Distillation (creating a distilled corpus)

A process of choosing a subset of a larger corpus, such that the subset has the same coverage features as the original corpus.

Shard

A file representing a subset of the corpus and another file representing feature sets for that same subset of the corpus.

Merging shards

To merge shard B into shard A means: for every input in shard B that has features missing in shard A, add that input to A.

Job

A single fuzzer process. One job writes only to one shard, but may read multiple shards.

Workdir or WD

A local or remote directory that contains data produced or consumed by a fuzzer.

Build Centipede

git clone https://github.com/google/fuzztest.git
cd fuzztest
CENTIPEDE_SRC=`pwd`/centipede
BIN_DIR=`pwd`/bazel-bin/centipede
bazel build -c opt centipede:all

What you will need for the subsequent steps:

$BIN_DIR/centipede - the binary of the engine (the fuzzer).
$BIN_DIR/libcentipede_runner.pic.a - the library you need to link with your fuzz target (the runner).
$CENTIPEDE_SRC/clang-flags.txt - recommended clang compilation flags for the target.

You can keep these files where they are or copy them somewhere.

Build your fuzz target

We provide two examples of building the target: one tiny single-file target and libpng. Once you've built your target, proceed to the fuzz target running step.

The simple example

This example uses one of the simple example fuzz targets, a.k.a. puzzles, included in $CENTIPEDE_SRC/puzzles of the repo.

Compile

NOTE: The commands below use the flags from $CENTIPEDE_SRC/clang-flags.txt. You may choose to use some other set of instrumentation flags: clang-flags.txt only provides a simple default option.

FUZZ_TARGET=byte_cmp_4  # or any other source under $CENTIPEDE_SRC/puzzles
clang++ @$CENTIPEDE_SRC/clang-flags.txt -c $CENTIPEDE_SRC/puzzles/$FUZZ_TARGET.cc -o $BIN_DIR/$FUZZ_TARGET.o

Link

This step links the just-built fuzz target with libcentipede_runner.pic.a and other required libraries.

clang++ $BIN_DIR/$FUZZ_TARGET.o $BIN_DIR/libcentipede_runner.pic.a \
    -ldl -lrt -lpthread -o $BIN_DIR/$FUZZ_TARGET

Skip to the fuzz target running step.

The libpng example

Download and compile libpng


LIBPNG_BRANCH=v1.6.37  # You can experiment with other branches if you'd like
git clone --branch $LIBPNG_BRANCH --single-branch https://github.com/glennrp/libpng.git
cd libpng
CC=clang CFLAGS=@$CENTIPEDE_SRC/clang-flags.txt ./configure --disable-shared
make -j

Link libpng's own fuzz target with libcentipede_runner.pic.a

FUZZ_TARGET=libpng_read_fuzzer
clang++ -include cstdlib \
    ./contrib/oss-fuzz/$FUZZ_TARGET.cc \
    ./.libs/libpng16.a \
    $BIN_DIR/libcentipede_runner.pic.a \
    -ldl -lrt -lpthread -lz \
    -o $BIN_DIR/$FUZZ_TARGET

Run Centipede locally

Running locally will not give the full scale, but it could be useful during the fuzzer development stage. We recommend that both the fuzzer and the target are copied to a local directory before running in order to avoid stressing a network file system.

Prepare for a run

WD=$HOME/centipede_run
mkdir -p $WD

NOTE: You may need to add llvm-symbolizer to your $PATH for some of the Centipede functionality to work. The symbolizer can be installed as part of the LLVM distribution:

sudo apt install llvm
which llvm-symbolizer  # normally /usr/bin/llvm-symbolizer

Run one fuzzing job

rm -rf $WD/*
$BIN_DIR/centipede --binary=$BIN_DIR/$FUZZ_TARGET --workdir=$WD --num_runs=100

See what's in the working directory

tree $WD

...
├── <fuzz target name>-d9d90139ee2ccc687f7c9d5821bcc04b8a847df5
│   └── features.000000
└── corpus.000000

Run 5 concurrent fuzzing jobs

WARNING: Do not exceed the number of cores on your machine for the --j flag.

rm -rf $WD/*
$BIN_DIR/centipede --binary=$BIN_DIR/$FUZZ_TARGET --workdir=$WD --num_runs=100 --j=5

See what's in the working directory:

tree $WD

...
├── <fuzz target name>-d9d90139ee2ccc687f7c9d5821bcc04b8a847df5
│   ├── features.000000
│   ├── features.000001
│   ├── features.000002
│   ├── features.000003
│   └── features.000004
├── corpus.000000
├── corpus.000001
├── corpus.000002
├── corpus.000003
└── corpus.000004

Corpus distillation

Each Centipede shard typically does not cover all features that the entire corpus covers. Besides, all shards combined will have plenty of redundancy. In order to distill the corpus, a Centipede process will need to read all shards. Distillation works like this:

First, run fuzzing as described above, so that all shards have their feature sets computed. Stop fuzzing.

Then, run the same command line, but with --distill --total_shards=N --num_threads=K. This will read N corpus shards and produce K independent distilled corpus files. Each of the distilled corpora should have the same features as the N shards combined, but the inputs might be different between the K distilled corpora. In most cases K==1 is sufficient, i.e. you simply omit --num_threads=K.

The --distill flag requires that you pass the --binary or --coverage_binary so that it knows where to look for the features files, but it will not execute the binary. By default, when you pass --binary or --coverage_binary, Centipede computes a hash of the binary file. If the binary is not present on disk, you need to additionally pass --binary_hash=<HASH> and then you only need to pass the base name of the binary. E.g. if you fuzzed with --binary=/path/to/foo, and /path/to/foo is not present on disk during distillation, you can still pass --binary=/path/to/foo --binary_hash=<HASH>, but you can also pass --binary=foo --binary_hash=<HASH> or --binary=/invalid/path/foo --binary_hash=<HASH>.

Unlike fuzzing, the distillation step is not distributed and needs to run on a single machine. The distillation is a much lighter-weight process than fuzzing because it does not require executing the target, and thus it doesn't need to be distributed. Distillation is however IO-bound.

$BIN_DIR/centipede --binary=$FUZZ_TARGET --workdir=$WD \
  --binary_hash=a5e87c9b6057e5ffd3b32a5b9a9ef3978527e9cd  --distill \
  --total_shards=5 --num_threads=3

Note: --binary=$FUZZ_TARGET in this example does not point to a real file and so we also pass --binary_hash=<HASH>.

The result of this command is that $WD will now contain 3 distilled versions of the corpus:

tree $WD

...
├── distilled-byte_cmp_4.000000
├── distilled-byte_cmp_4.000001
├── distilled-byte_cmp_4.000002

Deprecated but still works

Use the deprecated flag --distill_shards=N to produce N independent distilled corpus files. This option requires the target binary to be present on disk. If you need to also export the distilled corpus to a libFuzzer-style directory (local dir with one file per input), add --corpus_dir=DIR.

Coverage Reports

Centipede provides two ways to write coverage reports: a simple text-based report and a human-readable HTML report. It is important to note that the HTML report requires additional setup and may impact performance.

Simple Text Coverage Report

The simple text coverage report is generated by default and is available as soon as Centipede begins fuzzing. It is saved as a text file in the workdir directory with the name coverage-report-BINARY.SHARD.txt where BINARY is the name of the target and SHARD is the shard number. The report reflects the coverage as observed by the shard after loading the corpus.

The report shows functions that are fully covered (all control flow edges are observed at least once), not covered, or partially covered. For partially covered functions the report contains symbolic information for all covered and uncovered edges.

The report will look something like this:

FULL: FUNCTION_A a.cc:1:0
NONE: FUNCTION_BB bb.cc:1:0
PARTIAL: FUNCTION_CCC ccc.cc:1:0
+ FUNCTION_CCC ccc.cc:1:0
- FUNCTION_CCC ccc.cc:2:0
- FUNCTION_CCC ccc.cc:3:0

HTML Coverage Report

Centipede also provides the option to generate a human-readable HTML coverage report by taking advantage of source level coverage instrumentation provided by clang. The HTML report provides a more interactive way to analyze code coverage by providing a browsable view of the instrumented source tree with hit counts for lines of code. Source-level information from the compiler is used to track coverage, so it can be more precise than the text-based output.

The existing fuzz target used with Centipede contains instrumentation that is helpful for fuzzing, but lacks some detail needed for the HTML coverage report. In order to include this information in the binary, you need to build an additional binary for your fuzz target with this source-level coverage instrumentation. You can build this additional fuzz target with options -fprofile-instr-generate -fcoverage-mapping. Ensure llvm-cov and llvm-profdata are available in your $PATH, then pass your original fuzz target binary with --binary and pass this new binary using --clang_coverage_binary. A full command line invocation might look like this:

$BIN_DIR/centipede --binary=$BIN_DIR/$FUZZ_TARGET --workdir=$WD --clang_coverage_binary=$BIN_DIR/$FUZZ_TARGET_CLANG_COVERAGE

Report generation assumes that your current working directory is the root of the source files for your built binary. Otherwise, you may encounter a No such file or directory error. We may add a flag to specify your source directory in a future version of Centipede.

Note that generating the HTML report may impact performance since the additional coverage binary must be run on new inputs to collect coverage. Currently, this feature only works for local fuzz jobs. It does not merge coverage reports from remote fuzzing instances.

For more information about clang's source-based coverage reports, please see https://clang.llvm.org/docs/SourceBasedCodeCoverage.html.

Customization

TBD