DL library testing and using LLM for fuzzing

Speakers:
external page Chenyuan Yang, external page University of Illinois Urbana-Champaign
external page Yinlin Deng, external page University of Illinois Urbana-Champaign

Date/Time: Thursday, May 4, 16:30 - 18:00

Talk 1:
Title: Fuzzing Automatic Differentiation in Deep-Learning Libraries (external page Chenyuan Yang, 16:30 - 17:00)

Abstract: Deep learning (DL) has attracted wide attention and has been widely deployed in recent years. As a result, more and more research efforts have been dedicated to testing DL libraries and frameworks. However, existing work largely overlooked one crucial component of any DL system, automatic differentiation (AD), which is the basis for the recent development of DL. To this end, we propose ∇Fuzz, the first general and practical approach specifically targeting the critical AD component in DL libraries. Our key insight is that each DL library API can be abstracted into a function processing tensors/vectors, which can be differentially tested under various execution scenarios (for computing outputs/gradients with different implementations). We have implemented ∇Fuzz as a fully automated API-level fuzzer targeting AD in DL libraries, which utilizes differential testing on different execution scenarios to test both first-order and high-order gradients, and also includes automated filtering strategies to remove false positives caused by numerical instability. We have performed an extensive study on four of the most popular and actively-maintained DL libraries, PyTorch, TensorFlow, JAX, and OneFlow. The result shows that ∇Fuzz substantially outperforms state-of-the-art fuzzers in terms of both code coverage and bug detection. To date, ∇Fuzz has detected 173 bugs for the studied DL libraries, with 144 already confirmed by developers (117 of which are previously unknown bugs and 107 are related to AD). Remarkably, ∇Fuzz contributed 58.3% (7/12) of all high-priority AD bugs for PyTorch and JAX during a two-month period. None of the confirmed AD bugs were detected by existing fuzzers.

Talk 2:
Title: Large Language Models are Effective Fuzzers (speaker: external page Yinlin Deng, ~17:15 - 17:45)

Abstract: Deep Learning (DL) systems have become ubiquitous in our everyday life. Detecting bugs in DL libraries (e.g., TensorFlow and PyTorch) is critical since they provide building blocks for almost all downstream DL systems. Meanwhile, traditional fuzzing techniques can be hardly effective for such a challenging domain since the input DL programs need to satisfy both input language syntax/semantics and DL API input/shape constraints for tensor computations.
In this talk, we will present our work on directly leveraging Large Language Models (LLMs) to generate input programs for fuzzing DL libraries. LLMs are titanic models trained on billions of code snippets and can autoregressively generate human-like code snippets. Our key insight is that modern LLMs can also include numerous code snippets invoking DL library APIs in their training corpora, and thus can implicitly learn both language syntax/semantics and intricate DL API constraints for valid DL program generation. We will present TitanFuzz, which uses both generative and infilling LLMs (such as Codex and InCoder) to generate and mutate valid/diverse input DL programs for fuzzing. To date, our work has detected over 41 confirmed new bugs for popular DL libraries (such as TensorFlow and PyTorch). This demonstrates that modern titanic LLMs can be leveraged to directly perform both generation-based and mutation-based fuzzing studied for decades, while being fully automated, generalizable, and applicable to domains challenging for traditional approaches (such as DL systems). We hope our work can stimulate more work in this exciting direction of LLMs for fuzzing.