In programming and software development, fuzzing or fuzz testing is an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a computer program. The program is then monitored for exceptions such as crashes, failing built-in code assertions, or potential . Typically, fuzzers are used to test programs that take structured inputs. This structure is specified, such as in a file format or protocol and distinguishes valid from invalid input. An effective fuzzer generates semi-valid inputs that are "valid enough" in that they are not directly rejected by the parser, but do create unexpected behaviors deeper in the program and are "invalid enough" to expose that have not been properly dealt with.
For the purpose of security, input that crosses a trust boundary is often the most useful. For example, it is more important to fuzz code that handles a file uploaded by any user than it is to fuzz the code that parses a configuration file that is accessible only to a privileged user.
According to Prof. Barton Miller, "In the process of writing the project description, I needed to give this kind of testing a name. I wanted a name that would evoke the feeling of random, unstructured data. After trying out several ideas, I settled on the term fuzz."
A key contribution of this early work was simple (almost simplistic) oracle. A program failed its test if it crashed or hung under the random input and was considered to have passed otherwise. While test oracles can be challenging to construct, the oracle for this early fuzz testing was simple and universal to apply.
In April 2012, Google announced ClusterFuzz, a cloud-based fuzzing infrastructure for security-critical components of the Chromium web browser. Security researchers can upload their own fuzzers and collect bug bounties if ClusterFuzz finds a crash with the uploaded fuzzer.
In September 2014, Shellshock was disclosed as a family of in the widely used UNIX Bash shell; most vulnerabilities of Shellshock were found using the fuzzer AFL. (Many Internet-facing services, such as some web server deployments, use Bash to process certain requests, allowing an attacker to cause vulnerable versions of Bash to execute arbitrary commands. This can allow an attacker to gain unauthorized access to a computer system.)
In April 2015, Hanno Böck showed how the fuzzer AFL could have found the 2014 Heartbleed vulnerability. (The Heartbleed vulnerability was disclosed in April 2014. It is a serious vulnerability that allows adversaries to decipher otherwise encrypted communication. The vulnerability was accidentally introduced into OpenSSL which implements TLS and is used by the majority of the servers on the internet. Shodan reported 238,000 machines still vulnerable in April 2016; 200,000 in January 2017.)
In August 2016, the Defense Advanced Research Projects Agency (DARPA) held the finals of the first Cyber Grand Challenge, a fully automated capture-the-flag competition that lasted 11 hours. The objective was to develop automatic defense systems that can discover, exploit, and correct software flaws in real-time. Fuzzing was used as an effective offense strategy to discover flaws in the software of the opponents. It showed tremendous potential in the automation of vulnerability detection. The winner was a system called "Mayhem" developed by the team ForAllSecure led by David Brumley.
In September 2016, Microsoft announced Project Springfield, a cloud-based fuzz testing service for finding security critical bugs in software.
In December 2016, Google announced OSS-Fuzz which allows for continuous fuzzing of several security-critical open-source projects.
At Black Hat 2018, Christopher Domas demonstrated the use of fuzzing to expose the existence of a hidden RISC core in a processor. This core was able to bypass existing security checks to execute Protection ring commands from Ring 3.
In September 2020, Microsoft released OneFuzz, a self-hosted fuzzing-as-a-service platform that automates the detection of . It supports Windows and Linux. It was archived three years later on November 1, 2023.
The execution of random inputs is also called random testing or monkey testing.
In 1981, Duran and Ntafos formally investigated the effectiveness of testing a program with random inputs. While random testing had been widely perceived to be the worst means of testing a program, the authors could show that it is a cost-effective alternative to more systematic testing techniques.
In 1983, Steve Capps at Apple developed "The Monkey", a tool that would generate random inputs for classic Mac OS applications, such as MacPaint. The figurative "monkey" refers to the infinite monkey theorem which states that a monkey hitting keys at random on a typewriter keyboard for an infinite amount of time will eventually type out the entire works of Shakespeare. In the case of testing, the monkey would write the particular sequence of inputs that would trigger a crash.
In 1991, the crashme tool was released, which was intended to test the robustness of Unix and Unix-like operating systems by randomly executing systems calls with randomly chosen parameters.
A generation-based fuzzer generates inputs from scratch. For instance, a smart generation-based fuzzer takes the input model that was provided by the user to generate new inputs. Unlike mutation-based fuzzers, a generation-based fuzzer does not depend on the existence or quality of a corpus of seed inputs.
Some fuzzers have the capability to do both, to generate inputs from scratch and to generate inputs by mutation of existing seeds.
A smart (model-based, grammar-based, or protocol-based) fuzzer leverages the input model to generate a greater proportion of valid inputs. For instance, if the input can be modelled as an abstract syntax tree, then a smart mutation-based fuzzer would employ random transformations to move complete subtrees from one node to another. If the input can be modelled by a formal grammar, a smart generation-based fuzzer would instantiate the production rules to generate inputs that are valid with respect to the grammar. However, generally the input model must be explicitly provided, which is difficult to do when the model is proprietary, unknown, or very complex. If a large corpus of valid and invalid inputs is available, a grammar induction technique, such as Dana Angluin's L* algorithm, would be able to generate an input model.
A dumb fuzzer does not require the input model and can thus be employed to fuzz a wider variety of programs. For instance, AFL is a dumb mutation-based fuzzer that modifies a seed file by flipping random bits, by substituting random bytes with "interesting" values, and by moving or deleting blocks of data. However, a dumb fuzzer might generate a lower proportion of valid inputs and stress the parser code rather than the main components of a program. The disadvantage of dumb fuzzers can be illustrated by means of the construction of a valid checksum for a cyclic redundancy check (CRC). A CRC is an error-detecting code that ensures that the data integrity of the data contained in the input file is preserved during transmission. A checksum is computed over the input data and recorded in the file. When the program processes the received file and the recorded checksum does not match the re-computed checksum, then the file is rejected as invalid. Now, a fuzzer that is unaware of the CRC is unlikely to generate the correct checksum. However, there are attempts to identify and re-compute a potential checksum in the mutated input, once a dumb mutation-based fuzzer has modified the protected data.
A black-box fuzzer treats the program as a black box and is unaware of internal program structure. For instance, a random testing tool that generates inputs at random is considered a blackbox fuzzer. Hence, a blackbox fuzzer can execute several hundred inputs per second, can be easily parallelized, and can scale to programs of arbitrary size. However, blackbox fuzzers may only scratch the surface and expose "shallow" bugs. Hence, there are attempts to develop blackbox fuzzers that can incrementally learn about the internal structure (and behavior) of a program during fuzzing by observing the program's output given an input. For instance, LearnLib employs active learning to generate an automaton that represents the behavior of a web application.
A white-box fuzzer leverages program analysis to systematically increase code coverage or to reach certain critical program locations. For instance, SAGE leverages symbolic execution to systematically explore different paths in the program (a technique known as concolic execution). If the program's specification is available, a whitebox fuzzer might leverage techniques from model-based testing to generate inputs and check the program outputs against the program specification. A whitebox fuzzer can be very effective at exposing bugs that hide deep in the program. However, the time used for analysis (of the program or its specification) can become prohibitive. If the whitebox fuzzer takes relatively too long to generate an input, a blackbox fuzzer will be more efficient. Hence, there are attempts to combine the efficiency of blackbox fuzzers and the effectiveness of whitebox fuzzers.
A gray-box testing fuzzer leverages instrumentation rather than program analysis to glean information about the program. For instance, AFL and libFuzzer utilize lightweight instrumentation to trace basic block transitions exercised by an input. This leads to a reasonable performance overhead but informs the fuzzer about the increase in code coverage during fuzzing, which makes gray-box fuzzers extremely efficient vulnerability detection tools.
Typically, a fuzzer distinguishes between crashing and non-crashing inputs in the absence of specifications and to use a simple and objective measure. Crashes can be easily identified and might indicate potential vulnerabilities (e.g., denial of service or arbitrary code execution). However, the absence of a crash does not indicate the absence of a vulnerability. For instance, a program written in C may or may not crash when an input causes a buffer overflow. Rather the program's behavior is undefined.
To make a fuzzer more sensitive to failures other than crashes, sanitizers can be used to inject assertions that crash the program when a failure is detected. There are different sanitizers for different kinds of bugs:
Fuzzing can also be used to detect "differential" bugs if a reference implementation is available. For automated regression testing,
Previously unreported, triaged bugs might be automatically crash reporting to a bug tracking system. For instance, OSS-Fuzz runs large-scale, long-running fuzzing campaigns for several security-critical software projects where each previously unreported, distinct bug is reported directly to a bug tracker. The OSS-Fuzz bug tracker automatically informs the maintainer of the vulnerable software and checks in regular intervals whether the bug has been fixed in the most recent revision using the uploaded minimized failure-inducing input.
| AFL | Gray | Dumb | C | Apache License 2.0 | |
| AFL++ | Gray | Dumb | C | Apache License 2.0 | |
| AFLFast | Gray | Dumb | C | Apache License 2.0 | |
| Angora | Gray | Dumb | C++ | Apache License 2.0 | |
| honggfuzz | Gray | Dumb | C | Apache License 2.0 | |
| QSYM | |||||
| SymCC | White | C++ | GPL, LGPL | ||
| T-Fuzz | |||||
| VUzzer | |||||
| Syzkaller | Gray | Go | Apache License 2.0 | ||
| LibFuzzer | White | Dumb | C++ | Apache License 2.0 |
|
|