CoSec: On-the-Fly Security Hardening of Code LLMs via Supervised Co-decoding (ISSTA 2024 - Technical Papers)

Who

Dong Li, Meng Yan, Yaosheng Zhang, Zhongxin Liu, Chao Liu, Xiaohong Zhang, Ting Chen, David Lo

Track

ISSTA 2024 Technical Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 20 Sep 2024 13:50 - 14:10 at EI 7 - LLMs for Code Chair(s): Jacques Klein

Abstract

Large Language Models (LLMs) specialized in code have shown exceptional proficiency across various programming-related tasks, particularly code generation. Nonetheless, due to its nature of pretraining on massive uncritically filtered data, prior studies have shown that code LLMs are prone to generate code with potential vulnerabilities. Existing approaches to mitigate this risk involve crafting data without vulnerability and subsequently retraining or fine-tuning the model. As the number of parameters exceeds a billion, the computation and data demands of the above approaches will be enormous. Moreover, an increasing number of code LLMs tend to be distributed as services, where the internal representation is not accessible, and the API is the only way to reach the LLM, making the prior mitigation strategies non-applicable.

To cope with this, we propose \textbf{CoSec}, an on-the-fly \textbf{Sec}urity hardening method of code LLMs based on security model-guided \textbf{Co}-decoding, to reduce the likelihood of code LLMs to generate code containing vulnerabilities. Our key idea is to train a separate but much smaller security model to co-decode with a target code LLM. Since the trained secure model has higher confidence for secure tokens, it guides the generation of the target base model towards more secure code generation. By adjusting the probability distributions of tokens during each step of the decoding process, our approach effectively influences the tendencies of generation without accessing the internal parameters of the target code LLM. We have conducted extensive experiments across various parameters in multiple code LLMs (i.e., CodeGen, StarCoder, and DeepSeek-Coder), and the results show that our approach is effective in security hardening. Specifically, our approach improves the average security ratio of six base models by 5.02%-37.14%, while maintaining the functional correctness of the target model.

DOI

https://doi.org/10.1145/3650212.3680371

Dong Li

Chongqing University

China

Meng Yan

Chongqing University

China

Yaosheng Zhang

Chongqing University

China

Zhongxin Liu

Zhejiang University

China

Chao Liu

Chongqing University

China

Xiaohong Zhang

Chongqing University

China

Ting Chen

University of Electronic Science and Technology of China

China

David Lo

Singapore Management University

Singapore

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Fri 20 Sep
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

13:30 - 14:50	LLMs for CodeTechnical Papers at EI 7 Chair(s): Jacques Klein University of Luxembourg

13:30 20m Talk		Bridge and Hint: Extending Pre-trained Language Models for Long-Range Code Technical Papers Yujia Chen Harbin Institute of Technology, Cuiyun Gao Harbin Institute of Technology, Zezhou Yang Harbin Institute of Technology, Hongyu Zhang Chongqing University, Qing Liao Harbin Institute of Technology DOI
13:50 20m Talk		CoSec: On-the-Fly Security Hardening of Code LLMs via Supervised Co-decoding Technical Papers Dong Li Chongqing University, Meng Yan Chongqing University, Yaosheng Zhang Chongqing University, Zhongxin Liu Zhejiang University, Chao Liu Chongqing University, Xiaohong Zhang Chongqing University, Ting Chen University of Electronic Science and Technology of China, David Lo Singapore Management University DOI
14:10 20m Talk		Oracle-Guided Program Selection from Large Language Models Technical Papers Zhiyu Fan National University of Singapore, Haifeng Ruan National University of Singapore, Sergey Mechtaev Peking University, Abhik Roychoudhury National University of Singapore DOI
14:30 20m Talk		How Effective Are They? Exploring Large Language Model Based Fuzz Driver Generation Technical Papers Cen Zhang Nanyang Technological University, Yaowen Zheng Nanyang Technological University, Mingqiang Bai Institute of Information Engineering at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Yeting Li Institute of Information Engineering at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Wei Ma Nanyang Technological University, Xiaofei Xie Singapore Management University, Yuekang Li UNSW, Limin Sun Institute of Information Engineering at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Yang Liu Nanyang Technological University DOI

Information for Participants

Fri 20 Sep 2024 13:30 - 14:50 at EI 7 - LLMs for Code Chair(s): Jacques Klein

Info for room EI 7:

Map: https://tuw-maps.tuwien.ac.at/?q=CDEG13

Room tech: https://raumkatalog.tiss.tuwien.ac.at/room/15417