Bridge and Hint: Extending Pre-trained Language Models for Long-Range Code (ISSTA 2024 - Technical Papers)

Who

Yujia Chen, Cuiyun Gao, Zezhou Yang, Hongyu Zhang, Qing Liao

Track

ISSTA 2024 Technical Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 20 Sep 2024 13:30 - 13:50 at EI 7 - LLMs for Code Chair(s): Jacques Klein

Abstract

In the field of code intelligence, effectively modeling long-range code poses a significant challenge. Existing pre-trained language models (PLMs) such as UniXcoder have achieved remarkable success, but they still face difficulties with long code inputs. This is mainly due to their limited capacity to maintain contextual continuity and memorize the key information over long-range code. To alleviate the difficulties, we propose EXPO, a framework for EXtending Pre-trained language models for lOng-range code. EXPO incorporates two innovative memory mechanisms we propose in this paper: Bridge Memory and Hint Memory. Bridge Memory uses a tagging mechanism to connect disparate snippets of long-range code, helping the model maintain contextual coherence. Hint Memory focuses on crucial code elements throughout the global context, such as package imports, by integrating a 𝑘NN attention layer to adaptively select the relevant code elements. This dual-memory approach bridges the gap between understanding local code snippets and maintaining global code coherence, thereby enhancing the model’s overall comprehension of long code sequences. We validate the effectiveness of EXPO on five popular pre-trained language models such as UniXcoder and two code intelligence tasks including API recommendation and vulnerability detection. Experimental results demonstrate that EXPO significantly improves the pre-training language models.

DOI

https://doi.org/10.1145/3650212.3652127

Yujia Chen

Harbin Institute of Technology

China

Cuiyun Gao

Harbin Institute of Technology

China

Zezhou Yang

Harbin Institute of Technology

China

Hongyu Zhang

Chongqing University

China

Qing Liao

Harbin Institute of Technology

China

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Fri 20 Sep
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

13:30 - 14:50	LLMs for CodeTechnical Papers at EI 7 Chair(s): Jacques Klein University of Luxembourg

13:30 20m Talk		Bridge and Hint: Extending Pre-trained Language Models for Long-Range Code Technical Papers Yujia Chen Harbin Institute of Technology, Cuiyun Gao Harbin Institute of Technology, Zezhou Yang Harbin Institute of Technology, Hongyu Zhang Chongqing University, Qing Liao Harbin Institute of Technology DOI
13:50 20m Talk		CoSec: On-the-Fly Security Hardening of Code LLMs via Supervised Co-decoding Technical Papers Dong Li Chongqing University, Meng Yan Chongqing University, Yaosheng Zhang Chongqing University, Zhongxin Liu Zhejiang University, Chao Liu Chongqing University, Xiaohong Zhang Chongqing University, Ting Chen University of Electronic Science and Technology of China, David Lo Singapore Management University DOI
14:10 20m Talk		Oracle-Guided Program Selection from Large Language Models Technical Papers Zhiyu Fan National University of Singapore, Haifeng Ruan National University of Singapore, Sergey Mechtaev Peking University, Abhik Roychoudhury National University of Singapore DOI
14:30 20m Talk		How Effective Are They? Exploring Large Language Model Based Fuzz Driver Generation Technical Papers Cen Zhang Nanyang Technological University, Yaowen Zheng Nanyang Technological University, Mingqiang Bai Institute of Information Engineering at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Yeting Li Institute of Information Engineering at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Wei Ma Nanyang Technological University, Xiaofei Xie Singapore Management University, Yuekang Li UNSW, Limin Sun Institute of Information Engineering at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Yang Liu Nanyang Technological University DOI

Information for Participants

Fri 20 Sep 2024 13:30 - 14:50 at EI 7 - LLMs for Code Chair(s): Jacques Klein

Info for room EI 7:

Map: https://tuw-maps.tuwien.ac.at/?q=CDEG13

Room tech: https://raumkatalog.tiss.tuwien.ac.at/room/15417