Red Teaming Language Model Detectors with Language Models

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2024Embargo end date: 01 Jan 2023 United States English Publisher:MIT PressJournal:Transactions of the Association for Computational Linguistics, volume 12, pages 174-189 (eissn: 2307-387X,

Copyright policy )Funded by:NSF | Collaborative Research: S..., NSF | Robust and Generalizable ..., NSF | CAREER: Robustness Verifi... +1 projects

Authors: Zhouxing Shi; Yihan Wang; Fan Yin; Xiangning Chen; Kai-Wei Chang; Cho-Jui Hsieh;

doi: 10.1162/tacl_a_00639 , 10.48550/arxiv.2305.19713

arXiv: 2305.19713

Red Teaming Language Model Detectors with Language Models

- Summary
- Subjects
- Metrics

Abstract

Abstract The prevalence and strong capability of large language models (LLMs) present significant safety and ethical risks if exploited by malicious users. To prevent the potentially deceptive usage of LLMs, recent work has proposed algorithms to detect LLM-generated text and protect LLMs. In this paper, we investigate the robustness and reliability of these LLM detectors under adversarial attacks. We study two types of attack strategies: 1) replacing certain words in an LLM’s output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation. In both strategies, we leverage an auxiliary LLM to generate the word replacements or the instructional prompt. Different from previous works, we consider a challenging setting where the auxiliary LLM can also be protected by a detector. Experiments reveal that our attacks effectively compromise the performance of all detectors in the study with plausible generations, underscoring the urgent need to improve the robustness of LLM-generated text detection systems. Code is available at https://github.com/shizhouxing/LLM-Detector-Robustness.

Country

United States

Related Organizations

University of California, Los Angeles
United States
University of California System
United States
University of California, San Francisco
United States

Keywords

FOS: Computer and information sciences, Artificial intelligence, Computer Science - Machine Learning, Computer Science - Computation and Language, Artificial Intelligence and Image Processing, Linguistics, Machine Learning (cs.LG), Information and Computing Sciences, Computational linguistics. Natural language processing, Cognitive Sciences, Communication and Culture, P98-98.5, Computation and Language (cs.CL), Language

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	18
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%