centipede/README.md - third_party/github/google/fuzztest - Git at Google


 # Centipede - a distributed fuzzing engine. Work-in-progress.

 **Important:** Centipede is merged into
 [FuzzTest](https://github.com/google/fuzztest)
 to consolidate fuzzing development - see documentation
 [here](https://github.com/google/fuzztest/blob/main/centipede/USER_MIGRATION.md)
 for user migration.

 See also:
 [Rich coverage signal and consequences for scaling (slides)](https://docs.google.com/presentation/d/16Zov-QmGZjGrEoTr6Qh2fyp-bvgOa4cmqxPrXw4fNeE/)

 ## Why Centipede

 Why not? We are currently trying to fuzz some very large and very slow targets
 for which [libFuzzer](https://llvm.org/docs/LibFuzzer.html),
 [AFL](https://lcamtuf.coredump.cx/afl/), and the like do not necessarily scale
 well. For one of our motivating examples
 see [SiliFuzz](https://arxiv.org/abs/2110.11519). While working on Centipede we
 plan to experiment with new approaches to massive-scale differential fuzzing,
 that the existing fuzzing engines don't try to do.

 Notable features:

 * Out-of-the-box support for libFuzzer-based fuzz targets. In order to use your
   favourite `LLVMFuzzerTestOneInput()` you only need to build your target with
   Centipede's compiler and linker options.

 * **Work-in-progress**. We test centipede within a small team on a couple
   of targets. Unless you are part of the Centipede project, or want to help us,
   **you probably don't want to read further just yet**.

 * Scale. The intent is to be able to run any number of jobs concurrently, with
   very little communication overhead. We currently test with 100 local jobs and
   with 10k jobs on a cluster.

 * Out of process. The target runs in a separate process. Any crashes in it will
   not affect the fuzzer. Centipede can be used in-process as well, but this mode
   is not the main goal. If your target is small and fast you probably still want
   libFuzzer.

 * Integration with the sanitizers is achieved via separate builds. If during
   fuzzing you want to find bugs with
   [ASAN](https://github.com/google/sanitizers/wiki/AddressSanitizer),
   [MSAN](https://github.com/google/sanitizers/wiki/MemorySanitizer), or
   [TSAN](https://github.com/google/sanitizers/wiki/ThreadSanitizer), you will
   need to provide separate binaries for every sanitizer as well as one main
   binary for Centipede itself. The main binary should not use any of the
   sanitizers.

 * No part of the internal interface is stable. Anything may change at this
   stage.

 ## Terminology

 #### Fuzzing engine a.k.a. fuzzer <a id='fuzzer'></a>

 A program that produces an infinite stream of inputs for a target and
 orchestrates the execution.

 #### Fuzz target <a id='target'></a>

 A binary, a library, an API, or rather anything that can consume bytes for input
 and produce some sort of coverage data as an output.
 A [libFuzzer](https://llvm.org/docs/LibFuzzer.html)'s
 target can be a Centipede's target. Read
 more [here](https://github.com/google/fuzzing/blob/master/docs/good-fuzz-target.md).

 #### Input <a id='input'></a>

 A sequence of bytes that can be fed to a target. The input can be an arbitrary
 bag of bytes, or some structured data, e.g. serialized proto.

 #### Feature

 A number that represents some unique behavior of the target. E.g. a feature
 1234567 may represent the fact that a basic block number 987 in the target has
 been executed 7 times. When executing an input with the target, the fuzzer
 collects the features that were observed during execution.

 #### Feature set

 A set of features associated with one specific input.

 #### Coverage

 Some information about the behaviour of the target when it executes a given
 input. Coverage is usually represented as feature set that the input has
 triggered in the target.

 #### Mutator

 A function that takes bytes as input and outputs a small random mutation of the
 input. See also:
 [structure-aware fuzzing](https://github.com/google/fuzzing/blob/master/docs/structure-aware-fuzzing.md).

 #### Executor <a id='executor'></a>

 A function that knows how to feed an input into a target and get coverage in
 return (i.e. to **execute**).

 #### Centipede

 A customizable fuzzing engine that allows the user to substitute the Mutator and
 the Executor.

 #### Centipede runner <a id='runner'></a>

 A library that implements the executor interface expected by the Centipede
 fuzzer. The runner knows how to run
 a [sancov](https://clang.llvm.org/docs/SanitizerCoverage.html)-instrumented
 target, collect the resulting coverage, and pass it back to Centipede.
 Prospective Centipede fuzz targets can be linked with this library to make them
 runnable by Centipede.

 #### Corpus (_plural: corpora_)

 A set of inputs.

 #### Distillation (creating a _distilled corpus_)

 A process of choosing a subset of a larger corpus, such that the subset has the
 same coverage features as the original corpus.

 #### Shard

 A file representing a subset of the corpus and another file representing feature
 sets for that same subset of the corpus.

 #### Merging shards

 To merge shard B into shard A means: for every input in shard B that has
 features missing in shard A, add that input to A.

 #### Job

 A single fuzzer process. One job writes only to one shard, but may read multiple
 shards.

 #### Workdir or WD

 A local or remote directory that contains data produced or consumed by a fuzzer.

 ## Build Centipede

 ```shell
 git clone https://github.com/google/fuzztest.git
 cd fuzztest
 CENTIPEDE_SRC=`pwd`/centipede
 BIN_DIR=`pwd`/bazel-bin/centipede
 bazel build -c opt centipede:all
 ```

 What you will need for the subsequent steps:

 * `$BIN_DIR/centipede` - the binary of the engine (the fuzzer).
 * `$BIN_DIR/libcentipede_runner.pic.a` - the library you need to link with your
   fuzz target (the runner).
 * `$CENTIPEDE_SRC/clang-flags.txt` - recommended clang compilation flags for the
   target.

 You can keep these files where they are or copy them somewhere.

 ## Build your fuzz target

 We provide two examples of building the target: one tiny single-file target and
 libpng. Once you've built your target, proceed to the fuzz target running step.

 ### The simple example

 This example uses one of the simple example fuzz targets, a.k.a. *puzzles*,
 included in `$CENTIPEDE_SRC/puzzles` of the repo.

 #### Compile

 NOTE: The commands below use the flags from $CENTIPEDE_SRC/clang-flags.txt.
 You may choose to use some other set of instrumentation flags:
 clang-flags.txt only provides a simple default option.

 ```shell
 FUZZ_TARGET=byte_cmp_4  # or any other source under $CENTIPEDE_SRC/puzzles
 clang++ @$CENTIPEDE_SRC/clang-flags.txt -c $CENTIPEDE_SRC/puzzles/$FUZZ_TARGET.cc -o $BIN_DIR/$FUZZ_TARGET.o
 ```

 #### Link

 This step links the just-built fuzz target with libcentipede_runner.pic.a and
 other required libraries.

 ```shell
 clang++ $BIN_DIR/$FUZZ_TARGET.o $BIN_DIR/libcentipede_runner.pic.a \
     -ldl -lrt -lpthread -o $BIN_DIR/$FUZZ_TARGET
 ```

 Skip to the fuzz target running step.

 ### The libpng example

 #### Download and compile libpng

 ```shell

 LIBPNG_BRANCH=v1.6.37  # You can experiment with other branches if you'd like
 git clone --branch $LIBPNG_BRANCH --single-branch https://github.com/glennrp/libpng.git
 cd libpng
 CC=clang CFLAGS=@$CENTIPEDE_SRC/clang-flags.txt ./configure --disable-shared
 make -j
 ```

 #### Link libpng's own fuzz target with libcentipede_runner.pic.a

 ```shell
 FUZZ_TARGET=libpng_read_fuzzer
 clang++ -include cstdlib \
     ./contrib/oss-fuzz/$FUZZ_TARGET.cc \
     ./.libs/libpng16.a \
     $BIN_DIR/libcentipede_runner.pic.a \
     -ldl -lrt -lpthread -lz \
     -o $BIN_DIR/$FUZZ_TARGET
 ```

 ## Run Centipede locally <a id='run-step'></a>

 Running locally will not give the full scale, but it could be useful during the
 fuzzer development stage. We recommend that both the fuzzer and the target are
 copied to a local directory before running in order to avoid stressing a network
 file system.

 ### Prepare for a run

 ```shell
 WD=$HOME/centipede_run
 mkdir -p $WD
 ```

 NOTE: You may need to add
 [`llvm-symbolizer`](https://llvm.org/docs/CommandGuide/llvm-symbolizer.html)
 to your `$PATH` for some of the Centipede functionality to work. The
 symbolizer can be installed as part of the [LLVM](https://releases.llvm.org)
 distribution:

 ```shell
 sudo apt install llvm
 which llvm-symbolizer  # normally /usr/bin/llvm-symbolizer
 ```

 ### Run one fuzzing job

 ```shell
 rm -rf $WD/*
 $BIN_DIR/centipede --binary=$BIN_DIR/$FUZZ_TARGET --workdir=$WD --num_runs=100
 ```

 See what's in the working directory

 ```shell
 tree $WD
 ```

 ```
 ...
 ├── <fuzz target name>-d9d90139ee2ccc687f7c9d5821bcc04b8a847df5
 │   └── features.000000
 └── corpus.000000
 ```

 ### Run 5 concurrent fuzzing jobs

 WARNING: Do not exceed the number of cores on your machine for the `--j` flag.

 ```shell
 rm -rf $WD/*
 $BIN_DIR/centipede --binary=$BIN_DIR/$FUZZ_TARGET --workdir=$WD --num_runs=100 --j=5
 ```

 See what's in the working directory:

 ```shell
 tree $WD
 ```

 ```
 ...
 ├── <fuzz target name>-d9d90139ee2ccc687f7c9d5821bcc04b8a847df5
 │   ├── features.000000
 │   ├── features.000001
 │   ├── features.000002
 │   ├── features.000003
 │   └── features.000004
 ├── corpus.000000
 ├── corpus.000001
 ├── corpus.000002
 ├── corpus.000003
 └── corpus.000004
 ```

 ## Corpus distillation

 Each Centipede shard typically does not cover all features that the entire
 corpus covers. Besides, all shards combined will have plenty of redundancy. In
 order to distill the corpus, a Centipede process will need to read all shards.
 Distillation works like this:

 First, run fuzzing as described above, so that all shards have their feature
 sets computed. Stop fuzzing.

 Then, run the same command line, but with `--distill --total_shards=N
 --num_threads=K`. This will read `N` corpus shards and produce `K` independent
 distilled corpus files and `K` corresponding feature files. Each of the
 distilled corpora should have the same features as the `N` shards combined, but
 the inputs might be different between the `K` distilled corpora. In most cases
 `K==1` is sufficient, i.e. you simply omit `--num_threads=K`.

 The `--distill` flag requires that you pass the `--binary` or
 `--coverage_binary` so that it knows where to look for the `features` files, but
 it will not execute the binary. By default, when you pass `--binary` or
 `--coverage_binary`, Centipede computes a hash of the binary file. If the binary
 is not present on disk, you need to additionally pass `--binary_hash=<HASH>` and
 then you only need to pass the base name of the binary. E.g. if you fuzzed with
 `--binary=/path/to/foo`, and `/path/to/foo` is not present on disk during
 distillation, you can still pass `--binary=/path/to/foo --binary_hash=<HASH>`,
 but you can also pass `--binary=foo --binary_hash=<HASH>` or
 `--binary=/invalid/path/foo --binary_hash=<HASH>`.

 Unlike fuzzing, the distillation step is not distributed and needs to run on a
 single machine. The distillation is a much lighter-weight process than fuzzing
 because it does not require executing the target, and thus it doesn't need to be
 distributed. Distillation is however IO-bound.

 ```shell
 $BIN_DIR/centipede --binary=$FUZZ_TARGET --workdir=$WD \
   --binary_hash=a5e87c9b6057e5ffd3b32a5b9a9ef3978527e9cd --distill \
   --total_shards=5 --num_threads=3
 ```

 Note: `--binary=$FUZZ_TARGET` in this example does not point to a real file and
 so we also pass `--binary_hash=<HASH>`.

 The result of this command is that `$WD` will now contain 3 distilled versions
 of the corpus, while the features subdirectory, `$WD/<fuzz target name>-<fuzz
 target hash>`, will contain 3 distilled versions of the features:

 ```shell
 tree $WD
 ```

 ```
 ...
 ├── <fuzz target name>-d9d90139ee2ccc687f7c9d5821bcc04b8a847df5
 │   ├── distilled-features-byte_cmp_4.000000
 │   ├── distilled-features-byte_cmp_4.000001
 │   ├── distilled-features-byte_cmp_4.000002
 │   ├── features.000000
 │   ...
 ├── corpus.000000
 │   ...
 ├── distilled-byte_cmp_4.000000
 ├── distilled-byte_cmp_4.000001
 ├── distilled-byte_cmp_4.000002

 ```

 ## Coverage Reports

 Centipede provides two ways to write coverage reports: a simple text-based
 report and a human-readable HTML report. It is important to note that the HTML
 report requires additional setup and may impact performance.

 ### Simple Text Coverage Report

 The simple text coverage report is generated by default and is available as soon
 as Centipede begins fuzzing. It is saved as a text file in the `workdir`
 directory with the name `coverage-report-BINARY.SHARD.txt` where `BINARY` is
 the name of the target and `SHARD` is the shard number. The report reflects
 the coverage as observed by the shard after loading the corpus.

 The report shows functions that are fully covered (all control flow edges are
 observed at least once), not covered, or partially covered. For partially
 covered functions the report contains symbolic information for all covered and
 uncovered edges.

 The report will look something like this:

 ```
 FULL: FUNCTION_A a.cc:1:0
 NONE: FUNCTION_BB bb.cc:1:0
 PARTIAL: FUNCTION_CCC ccc.cc:1:0
 + FUNCTION_CCC ccc.cc:1:0
 - FUNCTION_CCC ccc.cc:2:0
 - FUNCTION_CCC ccc.cc:3:0
 ```

 ### HTML Coverage Report

 Centipede also provides the option to generate a human-readable HTML coverage
 report by taking advantage of source level coverage instrumentation provided
 by clang. The HTML report provides a more interactive way to analyze code
 coverage by providing a browsable view of the instrumented source tree with hit
 counts for lines of code. Source-level information from the compiler is used to
 track coverage, so it can be more precise than the text-based output.

 The existing fuzz target used with Centipede contains instrumentation that is
 helpful for fuzzing, but lacks some detail needed for the HTML coverage report.
 In order to include this information in the binary, you need to build an
 additional binary for your fuzz target with this source-level coverage
 instrumentation. You can build this additional fuzz target with options
 `-fprofile-instr-generate -fcoverage-mapping`. Ensure `llvm-cov` and
 `llvm-profdata` are available in your `$PATH`, then pass your original fuzz
 target binary with `--binary` and pass this new binary using
 `--clang_coverage_binary`. A full command line invocation might look like this:

 ```shell
 $BIN_DIR/centipede --binary=$BIN_DIR/$FUZZ_TARGET --workdir=$WD --clang_coverage_binary=$BIN_DIR/$FUZZ_TARGET_CLANG_COVERAGE
 ```

 Report generation assumes that your current working directory is the root of
 the source files for your built binary. Otherwise, you may encounter a
 `No such file or directory` error. We may add a flag to specify your source
 directory in a future version of Centipede.

 Note that generating the HTML report may impact performance since the additional
 coverage binary must be run on new inputs to collect coverage. Currently, this
 feature only works for local fuzz jobs. It does not merge coverage reports from
 remote fuzzing instances.

 For more information about clang's source-based coverage reports, please see
 https://clang.llvm.org/docs/SourceBasedCodeCoverage.html.

 ## Customization

 TBD

 ## Related Reading

 * [Centipede Design](doc/DESIGN.md)

	# Centipede - a distributed fuzzing engine. Work-in-progress.

	Important: Centipede is merged into
	[FuzzTest](https://github.com/google/fuzztest)
	to consolidate fuzzing development - see documentation
	[here](https://github.com/google/fuzztest/blob/main/centipede/USER_MIGRATION.md)
	for user migration.

	See also:
	[Rich coverage signal and consequences for scaling (slides)](https://docs.google.com/presentation/d/16Zov-QmGZjGrEoTr6Qh2fyp-bvgOa4cmqxPrXw4fNeE/)

	## Why Centipede

	Why not? We are currently trying to fuzz some very large and very slow targets
	for which [libFuzzer](https://llvm.org/docs/LibFuzzer.html),
	[AFL](https://lcamtuf.coredump.cx/afl/), and the like do not necessarily scale
	well. For one of our motivating examples
	see [SiliFuzz](https://arxiv.org/abs/2110.11519). While working on Centipede we
	plan to experiment with new approaches to massive-scale differential fuzzing,
	that the existing fuzzing engines don't try to do.

	Notable features:

	* Out-of-the-box support for libFuzzer-based fuzz targets. In order to use your
	favourite `LLVMFuzzerTestOneInput()` you only need to build your target with
	Centipede's compiler and linker options.

	* Work-in-progress. We test centipede within a small team on a couple
	of targets. Unless you are part of the Centipede project, or want to help us,
	you probably don't want to read further just yet.

	* Scale. The intent is to be able to run any number of jobs concurrently, with
	very little communication overhead. We currently test with 100 local jobs and
	with 10k jobs on a cluster.

	* Out of process. The target runs in a separate process. Any crashes in it will
	not affect the fuzzer. Centipede can be used in-process as well, but this mode
	is not the main goal. If your target is small and fast you probably still want
	libFuzzer.

	* Integration with the sanitizers is achieved via separate builds. If during
	fuzzing you want to find bugs with
	[ASAN](https://github.com/google/sanitizers/wiki/AddressSanitizer),
	[MSAN](https://github.com/google/sanitizers/wiki/MemorySanitizer), or
	[TSAN](https://github.com/google/sanitizers/wiki/ThreadSanitizer), you will
	need to provide separate binaries for every sanitizer as well as one main
	binary for Centipede itself. The main binary should not use any of the
	sanitizers.

	* No part of the internal interface is stable. Anything may change at this
	stage.

	## Terminology

	#### Fuzzing engine a.k.a. fuzzer <a id='fuzzer'></a>

	A program that produces an infinite stream of inputs for a target and
	orchestrates the execution.

	#### Fuzz target <a id='target'></a>

	A binary, a library, an API, or rather anything that can consume bytes for input
	and produce some sort of coverage data as an output.
	A [libFuzzer](https://llvm.org/docs/LibFuzzer.html)'s
	target can be a Centipede's target. Read
	more [here](https://github.com/google/fuzzing/blob/master/docs/good-fuzz-target.md).

	#### Input <a id='input'></a>

	A sequence of bytes that can be fed to a target. The input can be an arbitrary
	bag of bytes, or some structured data, e.g. serialized proto.

	#### Feature

	A number that represents some unique behavior of the target. E.g. a feature
	1234567 may represent the fact that a basic block number 987 in the target has
	been executed 7 times. When executing an input with the target, the fuzzer
	collects the features that were observed during execution.

	#### Feature set

	A set of features associated with one specific input.

	#### Coverage

	Some information about the behaviour of the target when it executes a given
	input. Coverage is usually represented as feature set that the input has
	triggered in the target.

	#### Mutator

	A function that takes bytes as input and outputs a small random mutation of the
	input. See also:
	[structure-aware fuzzing](https://github.com/google/fuzzing/blob/master/docs/structure-aware-fuzzing.md).

	#### Executor <a id='executor'></a>

	A function that knows how to feed an input into a target and get coverage in
	return (i.e. to execute).

	#### Centipede

	A customizable fuzzing engine that allows the user to substitute the Mutator and
	the Executor.

	#### Centipede runner <a id='runner'></a>

	A library that implements the executor interface expected by the Centipede
	fuzzer. The runner knows how to run
	a [sancov](https://clang.llvm.org/docs/SanitizerCoverage.html)-instrumented
	target, collect the resulting coverage, and pass it back to Centipede.
	Prospective Centipede fuzz targets can be linked with this library to make them
	runnable by Centipede.

	#### Corpus (_plural: corpora_)

	A set of inputs.

	#### Distillation (creating a _distilled corpus_)

	A process of choosing a subset of a larger corpus, such that the subset has the
	same coverage features as the original corpus.

	#### Shard

	A file representing a subset of the corpus and another file representing feature
	sets for that same subset of the corpus.

	#### Merging shards

	To merge shard B into shard A means: for every input in shard B that has
	features missing in shard A, add that input to A.

	#### Job

	A single fuzzer process. One job writes only to one shard, but may read multiple
	shards.

	#### Workdir or WD

	A local or remote directory that contains data produced or consumed by a fuzzer.

	## Build Centipede

	```shell
	git clone https://github.com/google/fuzztest.git
	cd fuzztest
	CENTIPEDE_SRC=`pwd`/centipede
	BIN_DIR=`pwd`/bazel-bin/centipede
	bazel build -c opt centipede:all
	```

	What you will need for the subsequent steps:

	* `$BIN_DIR/centipede` - the binary of the engine (the fuzzer).
	* `$BIN_DIR/libcentipede_runner.pic.a` - the library you need to link with your
	fuzz target (the runner).
	* `$CENTIPEDE_SRC/clang-flags.txt` - recommended clang compilation flags for the
	target.

	You can keep these files where they are or copy them somewhere.

	## Build your fuzz target

	We provide two examples of building the target: one tiny single-file target and
	libpng. Once you've built your target, proceed to the fuzz target running step.

	### The simple example

	This example uses one of the simple example fuzz targets, a.k.a. puzzles,
	included in `$CENTIPEDE_SRC/puzzles` of the repo.

	#### Compile

	NOTE: The commands below use the flags from $CENTIPEDE_SRC/clang-flags.txt.
	You may choose to use some other set of instrumentation flags:
	clang-flags.txt only provides a simple default option.

	```shell
	FUZZ_TARGET=byte_cmp_4 # or any other source under $CENTIPEDE_SRC/puzzles
	clang++ @$CENTIPEDE_SRC/clang-flags.txt -c $CENTIPEDE_SRC/puzzles/$FUZZ_TARGET.cc -o $BIN_DIR/$FUZZ_TARGET.o
	```

	#### Link

	This step links the just-built fuzz target with libcentipede_runner.pic.a and
	other required libraries.

	```shell
	clang++ $BIN_DIR/$FUZZ_TARGET.o $BIN_DIR/libcentipede_runner.pic.a \
	-ldl -lrt -lpthread -o $BIN_DIR/$FUZZ_TARGET
	```

	Skip to the fuzz target running step.

	### The libpng example

	#### Download and compile libpng

	```shell

	LIBPNG_BRANCH=v1.6.37 # You can experiment with other branches if you'd like
	git clone --branch $LIBPNG_BRANCH --single-branch https://github.com/glennrp/libpng.git
	cd libpng
	CC=clang CFLAGS=@$CENTIPEDE_SRC/clang-flags.txt ./configure --disable-shared
	make -j
	```

	#### Link libpng's own fuzz target with libcentipede_runner.pic.a

	```shell
	FUZZ_TARGET=libpng_read_fuzzer
	clang++ -include cstdlib \
	./contrib/oss-fuzz/$FUZZ_TARGET.cc \
	./.libs/libpng16.a \
	$BIN_DIR/libcentipede_runner.pic.a \
	-ldl -lrt -lpthread -lz \
	-o $BIN_DIR/$FUZZ_TARGET
	```

	## Run Centipede locally <a id='run-step'></a>

	Running locally will not give the full scale, but it could be useful during the
	fuzzer development stage. We recommend that both the fuzzer and the target are
	copied to a local directory before running in order to avoid stressing a network
	file system.

	### Prepare for a run

	```shell
	WD=$HOME/centipede_run
	mkdir -p $WD
	```

	NOTE: You may need to add
	[`llvm-symbolizer`](https://llvm.org/docs/CommandGuide/llvm-symbolizer.html)
	to your `$PATH` for some of the Centipede functionality to work. The
	symbolizer can be installed as part of the [LLVM](https://releases.llvm.org)
	distribution:

	```shell
	sudo apt install llvm
	which llvm-symbolizer # normally /usr/bin/llvm-symbolizer
	```

	### Run one fuzzing job

	```shell
	rm -rf $WD/*
	$BIN_DIR/centipede --binary=$BIN_DIR/$FUZZ_TARGET --workdir=$WD --num_runs=100
	```

	See what's in the working directory

	```shell
	tree $WD
	```

	```
	...
	├── <fuzz target name>-d9d90139ee2ccc687f7c9d5821bcc04b8a847df5
	│ └── features.000000
	└── corpus.000000
	```

	### Run 5 concurrent fuzzing jobs

	WARNING: Do not exceed the number of cores on your machine for the `--j` flag.

	```shell
	rm -rf $WD/*
	$BIN_DIR/centipede --binary=$BIN_DIR/$FUZZ_TARGET --workdir=$WD --num_runs=100 --j=5
	```

	See what's in the working directory:

	```shell
	tree $WD
	```

	```
	...
	├── <fuzz target name>-d9d90139ee2ccc687f7c9d5821bcc04b8a847df5
	│ ├── features.000000
	│ ├── features.000001
	│ ├── features.000002
	│ ├── features.000003
	│ └── features.000004
	├── corpus.000000
	├── corpus.000001
	├── corpus.000002
	├── corpus.000003
	└── corpus.000004
	```

	## Corpus distillation

	Each Centipede shard typically does not cover all features that the entire
	corpus covers. Besides, all shards combined will have plenty of redundancy. In
	order to distill the corpus, a Centipede process will need to read all shards.
	Distillation works like this:

	First, run fuzzing as described above, so that all shards have their feature
	sets computed. Stop fuzzing.

	Then, run the same command line, but with `--distill --total_shards=N
	--num_threads=K`. This will read `N` corpus shards and produce `K` independent
	distilled corpus files and `K` corresponding feature files. Each of the
	distilled corpora should have the same features as the `N` shards combined, but
	the inputs might be different between the `K` distilled corpora. In most cases
	`K==1` is sufficient, i.e. you simply omit `--num_threads=K`.

	The `--distill` flag requires that you pass the `--binary` or
	`--coverage_binary` so that it knows where to look for the `features` files, but
	it will not execute the binary. By default, when you pass `--binary` or
	`--coverage_binary`, Centipede computes a hash of the binary file. If the binary
	is not present on disk, you need to additionally pass `--binary_hash=<HASH>` and
	then you only need to pass the base name of the binary. E.g. if you fuzzed with
	`--binary=/path/to/foo`, and `/path/to/foo` is not present on disk during
	distillation, you can still pass `--binary=/path/to/foo --binary_hash=<HASH>`,
	but you can also pass `--binary=foo --binary_hash=<HASH>` or
	`--binary=/invalid/path/foo --binary_hash=<HASH>`.

	Unlike fuzzing, the distillation step is not distributed and needs to run on a
	single machine. The distillation is a much lighter-weight process than fuzzing
	because it does not require executing the target, and thus it doesn't need to be
	distributed. Distillation is however IO-bound.

	```shell
	$BIN_DIR/centipede --binary=$FUZZ_TARGET --workdir=$WD \
	--binary_hash=a5e87c9b6057e5ffd3b32a5b9a9ef3978527e9cd --distill \
	--total_shards=5 --num_threads=3
	```

	Note: `--binary=$FUZZ_TARGET` in this example does not point to a real file and
	so we also pass `--binary_hash=<HASH>`.

	The result of this command is that `$WD` will now contain 3 distilled versions
	of the corpus, while the features subdirectory, `$WD/<fuzz target name>-<fuzz
	target hash>`, will contain 3 distilled versions of the features:

	```shell
	tree $WD
	```

	```
	...
	├── <fuzz target name>-d9d90139ee2ccc687f7c9d5821bcc04b8a847df5
	│ ├── distilled-features-byte_cmp_4.000000
	│ ├── distilled-features-byte_cmp_4.000001
	│ ├── distilled-features-byte_cmp_4.000002
	│ ├── features.000000
	│ ...
	├── corpus.000000
	│ ...
	├── distilled-byte_cmp_4.000000
	├── distilled-byte_cmp_4.000001
	├── distilled-byte_cmp_4.000002

	```

	## Coverage Reports

	Centipede provides two ways to write coverage reports: a simple text-based
	report and a human-readable HTML report. It is important to note that the HTML
	report requires additional setup and may impact performance.

	### Simple Text Coverage Report

	The simple text coverage report is generated by default and is available as soon
	as Centipede begins fuzzing. It is saved as a text file in the `workdir`
	directory with the name `coverage-report-BINARY.SHARD.txt` where `BINARY` is
	the name of the target and `SHARD` is the shard number. The report reflects
	the coverage as observed by the shard after loading the corpus.

	The report shows functions that are fully covered (all control flow edges are
	observed at least once), not covered, or partially covered. For partially
	covered functions the report contains symbolic information for all covered and
	uncovered edges.

	The report will look something like this:

	```
	FULL: FUNCTION_A a.cc:1:0
	NONE: FUNCTION_BB bb.cc:1:0
	PARTIAL: FUNCTION_CCC ccc.cc:1:0
	+ FUNCTION_CCC ccc.cc:1:0
	- FUNCTION_CCC ccc.cc:2:0
	- FUNCTION_CCC ccc.cc:3:0
	```

	### HTML Coverage Report

	Centipede also provides the option to generate a human-readable HTML coverage
	report by taking advantage of source level coverage instrumentation provided
	by clang. The HTML report provides a more interactive way to analyze code
	coverage by providing a browsable view of the instrumented source tree with hit
	counts for lines of code. Source-level information from the compiler is used to
	track coverage, so it can be more precise than the text-based output.

	The existing fuzz target used with Centipede contains instrumentation that is
	helpful for fuzzing, but lacks some detail needed for the HTML coverage report.
	In order to include this information in the binary, you need to build an
	additional binary for your fuzz target with this source-level coverage
	instrumentation. You can build this additional fuzz target with options
	`-fprofile-instr-generate -fcoverage-mapping`. Ensure `llvm-cov` and
	`llvm-profdata` are available in your `$PATH`, then pass your original fuzz
	target binary with `--binary` and pass this new binary using
	`--clang_coverage_binary`. A full command line invocation might look like this:

	```shell
	$BIN_DIR/centipede --binary=$BIN_DIR/$FUZZ_TARGET --workdir=$WD --clang_coverage_binary=$BIN_DIR/$FUZZ_TARGET_CLANG_COVERAGE
	```

	Report generation assumes that your current working directory is the root of
	the source files for your built binary. Otherwise, you may encounter a
	`No such file or directory` error. We may add a flag to specify your source
	directory in a future version of Centipede.

	Note that generating the HTML report may impact performance since the additional
	coverage binary must be run on new inputs to collect coverage. Currently, this
	feature only works for local fuzz jobs. It does not merge coverage reports from
	remote fuzzing instances.

	For more information about clang's source-based coverage reports, please see
	https://clang.llvm.org/docs/SourceBasedCodeCoverage.html.

	## Customization

	TBD

	## Related Reading

	* [Centipede Design](doc/DESIGN.md)