Can LLMs deeply detect complex malicious queries? A framework for jailbreaking via obfuscating intent

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 09 Dec 2024Embargo end date: 01 Jan 2024 English Publisher:Oxford University Press (OUP)Journal:The Computer Journal, volume 68, pages 460-478 (issn: 0010-4620, eissn: 1460-2067,

Copyright policy )

Authors: Shang Shang; Xinqiang Zhao; Zhongjiang Yao; Yepeng Yao; Liya Su; Zijing Fan; Xiaodan Zhang; +1 Authors

doi: 10.1093/comjnl/bxae124 , 10.48550/arxiv.2405.03654

arXiv: 2405.03654

Can LLMs deeply detect complex malicious queries? A framework for jailbreaking via obfuscating intent

- Summary
- Subjects
- Metrics

Abstract

Abstract This paper delves into a possible security flaw in large language models (LLMs), particularly in their capacity to identify malicious intent within intricate or ambiguous inquiries. We have discovered that LLMs might overlook the malicious nature of highly veiled requests, even without alterations to the malevolent text in those queries, thus exposing a significant weakness in their content analysis systems. To be specific, we pinpoint and scrutinize two aspects of this vulnerability: (i) LLMs’ diminished capability to perceive maliciousness when parsing extremely obscured queries, and (ii) LLMs’ inability to discern malicious intent in queries that have been intentionally altered to increase their ambiguity by modifying the malevolent content itself. To illustrate and tackle this problem, we propose a theoretical framework and analytical strategy, and introduce a novel black-box jailbreak attack technique called IntentObfuscator. This technique exploits the identified vulnerability by concealing the genuine intentions behind user prompts, thereby compelling LLMs to inadvertently produce restricted content and circumvent their inherent content safety protocols. We elaborate on two specific applications within this framework: ”Obscure Intention” and ”Create Ambiguity,” which skillfully manipulate the complexity and ambiguity of queries to effectively dodge the detection of malicious intent. We empirically confirm the efficacy of the IntentObfuscator approach across various models, including ChatGPT-3.5, ChatGPT-4, Qwen, and Baichuan, achieving an average jailbreak success rate of 69.21%. Remarkably, our tests on ChatGPT-3.5, boasting 100 million weekly active users, yielded an impressive success rate of 83.65%. Additionally, we verify our approach across a range of sensitive content categories, including graphic violence, racism, sexism, political sensitivity, cybersecurity threats, and criminal techniques, further highlighting the considerable impact of our findings on refining ”Red Team” tactics against LLM content security frameworks.

Related Organizations

China Electronics Standardization Institute
China (People's Republic of)
Institute of Information Engineering
China (People's Republic of)
University of Chinese Academy of Sciences
China (People's Republic of)
China Electronics Standardization Institute
China (People's Republic of)
Chinese Academy of Sciences
China (People's Republic of)

Keywords

FOS: Computer and information sciences, Computer Science - Cryptography and Security, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Cryptography and Security (cs.CR)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average