ISSTA 2024
Mon 16 - Fri 20 September 2024 Vienna, Austria
co-located with ISSTA/ECOOP 2024

As bugs are inevitable and prevalent in real-world programs, many Automated Program Repair (APR) techniques have been proposed to generate patches for them. However, due to the lack of a standard for evaluating APR techniques, prior works tend to use different settings and benchmarks in evaluation, threatening the trustworthiness of the evaluation results. Additionally, they typically only adopt plausibility and genuineness as evaluation metrics, which may potentially mask some underlying issues in APR techniques. To overcome these issues, in this paper, we conduct an extensive and multi-dimensional evaluation of nine learning-based and three traditional state-of-the-art APR techniques under the same environment and settings. We employ the widely studied Defects4J V2.0.0 benchmark, as well as a newly constructed large-scale mutation-based benchmark, derived from Defects4J and including 1700 artificial bugs generated by a variety of mutators, to uncover potential limitations in these APR techniques. We also introduce multi-dimensional metrics, including compilability/plausibility/genuineness metrics, as well as SYE (syntactic equivalence) and TCE (trivial compiler equivalence) metrics, to analyze the 1,814,652 generated patches. The paper presents noteworthy findings from the evaluation of diverse APR techniques: Firstly, recent APR techniques harnessing Large Language Models (LLMs) demonstrate less susceptibility to overfitting on the Defects4J V1.2.0 dataset and fix the most number of bugs. Secondly, the study suggests a promising future for combining traditional and learning-based APR techniques, as they exhibit complementary advantages in fixing different types of bugs. Additionally, this work highlights the necessity for further enhancing patch compilability of learning-based techniques, despite the presence of various existing strategies. The study also reveals other guidelines for enhancing APR, including the need for handling unresolvable symbol compilability issues and reducing duplicate/no-op patch generation. Finally, our study uncovers seven implementation issues in the studied techniques, with five of them confirmed and fixed by the corresponding authors.