Guardian: A Runtime Framework for LLM-Based UI Exploration
Tests for feature-based UI testing have been indispensable for ensuring the quality of mobile applications (\textit{apps} for short).
The high manual labor costs to create such tests have led to a strong interest in \textit{automated feature-based UI testing}, where an approach automatically explores the App under Test (AUT) to find correct sequences of UI events achieving the target test objective, given only a high-level \emph{test objective description}.
Given that the task of automated feature-based UI testing resembles conventional AI planning problems, large language models (LLMs), known for their effectiveness in AI planning, could be ideal for this task.
However, our study reveals that LLMs struggle with following specific instructions for UI testing and replanning based on new information. This limitation results in reduced effectiveness of LLM-driven solutions for automated feature-based UI testing, despite the use of advanced prompting techniques.
Toward addressing the preceding limitation, we propose Guardian, a runtime system framework to improve the effectiveness of automated feature-based UI testing by offloading computational tasks from LLMs with two major strategies.
First, Guardian refines UI action space that the LLM can plan over, enforcing the instruction following of the LLM by construction.
Second, Guardian deliberately checks whether the gradually enriched information invalidates previous planning by the LLM.
Guardian removes the invalidated UI actions from the UI action space that the LLM can plan over, restores the state of the AUT to the state before the execution of the invalidated UI actions, and prompts the LLM to re-plan with the new UI action space.
We instantiate Guardian with ChatGPT and construct a benchmark named \textit{FestiVal} with 58 tasks from 23 highly popular apps.
Evaluation results on FestiVal show that Guardian achieves 48.3% success rate and 64.0% average completion proportion, outperforming state-of-the-art approaches with 154% and 132% relative improvement with respect to the two metrics, respectively.
Wed 18 SepDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
10:30 - 11:50 | |||
10:30 20mTalk | Toward the Automated Localization of Buggy Mobile App UIs from Bug Descriptions Technical Papers Antu Saha William & Mary, Yang Song William & Mary, Junayed Mahmud University of Central Florida, Ying Zhou George Mason University, Kevin Moran University of Central Florida, Oscar Chaparro William & Mary DOI | ||
10:50 20mTalk | Reproducing Timing-Dependent GUI Flaky Tests in Android Apps via a Single Event Delay Technical Papers Xiaobao Cai Fudan University, Zhen Dong Fudan University, Yongjiang Wang Fudan University, Abhishek Tiwari University of Passau, Xin Peng Fudan University DOI | ||
11:10 20mTalk | Semantic Constraint Inference for Web Form Test Generation Technical Papers Parsa Alian University of British Columbia, Noor Nashid University of British Columbia, Mobina Shahbandeh University of British Columbia, Ali Mesbah University of British Columbia DOI | ||
11:30 20mTalk | Guardian: A Runtime Framework for LLM-Based UI Exploration Technical Papers Dezhi Ran Peking University, Hao Wang Peking University, Zihe Song University of Texas at Dallas, Mengzhou Wu Peking University, Yuan Cao Peking University, Ying Zhang Peking University, Wei Yang University of Texas at Dallas, Tao Xie Peking University DOI |