ISSTA 2024
Mon 16 - Fri 20 September 2024 Vienna, Austria
co-located with ISSTA/ECOOP 2024

Recently, there has been a growing interest in automatic software vulnerability detection.
Pre-trained model-based approaches have demonstrated superior performance than other Deep Learning (DL)-based approaches in detecting vulnerabilities.
However, the existing pre-trained model-based approaches generally employ code sequences as input during prediction, and may ignore vulnerability-related structural information, as reflected in the following two aspects.
First, they tend to fail to infer the semantics of the code statements with complex logic such as those containing multiple operators and pointers.
Second, they are hard to comprehend various code execution sequences, which is essential for precise vulnerability detection.

To mitigate the challenges, we propose a {\textbf{S}tructured Natural Language \textbf{C}omment tree-based} vulner\textbf{A}bi\textbf{L}ity d\textbf{E}tection framework based on the pre-trained models, named \textbf{\tool}. The proposed Structured Natural Language Comment Tree (SCT) integrates the semantics of code statements with code execution sequences based on the Abstract Syntax Trees (ASTs).Specifically, \tool comprises three main modules:
(1) \textit{Comment Tree Construction}, which aims at enhancing the model's ability to infer the semantics of code statements by first incorporating Large Language Models (LLMs) for comment generation and then adding the comment node to ASTs.
(2) \textit{Structured Natural Language Comment Tree Construction}, which aims at explicitly involving code execution sequence by combining the code syntax templates with the comment tree.
(3) \textit{SCT-Enhanced Representation}, which finally incorporates the constructed SCTs for well capturing vulnerability patterns.
Experimental results demonstrate that \tool outperforms the best-performing baseline, including the pre-trained model and LLMs, with improvements of 2.96%, 13.47%, and 3.75% in terms of F1 score on the FFMPeg+Qemu, Reveal, and SVulD datasets, respectively. Furthermore, \tool can be applied to different pre-trained models, such as CodeBERT and UniXcoder, yielding the F1 score performance enhancements ranging from 1.37% to 10.87%.