Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao Closed Access logo, derived from PLoS Open Access logo. This version with transparent background. http://commons.wikimedia.org/wiki/File:Closed_Access_logo_transparent.svg Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao ZENODOarrow_drop_down
image/svg+xml Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao Closed Access logo, derived from PLoS Open Access logo. This version with transparent background. http://commons.wikimedia.org/wiki/File:Closed_Access_logo_transparent.svg Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao
image/svg+xml Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao Closed Access logo, derived from PLoS Open Access logo. This version with transparent background. http://commons.wikimedia.org/wiki/File:Closed_Access_logo_transparent.svg Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao
versions View all 3 versions
addClaim

HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

Authors: Shen, Xinyue; Wu, Yixin; Qu, Yiting; Backes, Michael; Zannettou, Savvas; Zhang, Yang;

HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

Abstract

HateBench This is the official repository for the paper "HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns." In this paper, we propose HateBench, a framework designed to benchmark hate speech detectors on LLM-generated content. Disclaimer. This repo contains examples of hateful and abusive language. Reader discretion is recommended. This repo is intended for research purposes only. Any misuse is strictly prohibited. Overview Our artifact repository includes: HateBench, the framework designed to benchmark hate speech detectors on LLM-generated content. HateBenchSet, the manually-annotated dataset, comprising 7,838 samples across 34 identity groups, generated by LLMs. Code for reproducing the LLM hate campaign, including both the adversarial hate campaign and stealthy hate campaign. Scripts to generate the key result tables and figures from the paper, including: Table 3: Performance on LLM-generated samples. Table 4: F1-score on LLM-generated and human-written samples. Table 6: Performance of adversarial hate campaign Table 8: Performance of model stealing attacks. Table 9: Performance of stealthy hate campaign with black-box attacks. Table 10: Performance of stealthy hate campaign with white-box gradient optimization. Environment Requirements All our experiments are tested in a conda environment on Ubuntu 20.04.6 LTS with Python 3.9.0. Tables 3 and 4 can be reproduced directly on a local PC without requiring a GPU environment. For other experiments, we recommend using an environment with NVIDIA GeForce RTX 3090 or more powerful GPUs, such as the RTX 4090 or A100. The results presented in this paper were obtained using an NVIDIA GeForce RTX 3090. Environment Setup conda create -n hatebench python=3.9.0conda activate hatebenchpip install -r requirements.txt Then python import nltknltk.download('averaged_perceptron_tagger_eng')exit() HateBench HateBenchSet HateBenchSet is provided in measurement/data/HateBenchSet.csv. Column Description model Model used to generate responses. status Status of the model, i.e., original or jailbreak. status_prompt Prompt used to set the model. main_target The category of identity groups, e.g., race, religion, etc. sub_target The identity group. target_name The complete name of the identity group. pid Prompt id. prompt The prompt. text The sample generated by the model. hate_label 1 denotes Hate, 0 refers to Non-Hate. Besides, we also provide measurement/data/HateBenchSet_labeled.csv, which is HateBenchSet with the predictions of the six detectors evaluated in our paper. Specifically, for each detector, the predictions are recorded in the following columns: {detector}: the complete record returned by the detector. {detector}_score: the hate score of the sample. {detector}_flagged: whether the sample is predicted as hate or not. Reproduce Paper results Table 3: python measurement/calculate_detector_performance.py Table 4: python measurement/calculate_detector_LLM_performance.py LLM-Driven Hate Campaign Adversarial Hate Campaign (Table 6) During our experiment, we consider three target models: Perspective, Moderation, and TweetHate. The first two are commercial models, while the last one is an open-source model. Considering the potential ethical risks and the need for API keys when attacking commercial models, we provide a script to reproduce the results of TweetHate presented in Table 6. cd hate_campaignbash scripts/run_adversarial_hate_campaign.sh Results will be automatically stored in the ./logs/ directory. The naming convention for the log files is adv_hate_campaign_{target_model}_{attack}.log. For example, to check the results of TextFooler attack on TweetHate model, refer to the log file named adv_hate_campaign_TweetHate_TextFooler.log and the results are printed at the end of the log. ----------------------------------------ARGUMENT VALUE attack_strategy adv_hate_campaigntarget_model TweetHate attack_method textfoolerdataset HateBench num_examples 120 ...----------------------------------------[Succeeded / Failed / Skipped / Total] 117 / 3 / 0 / 120: 100%|██████████████████████████████████████████████████████████████████| 120/120 [03:15 ./logs/TweetHate_roberta.log & nohup python model_stealing.py --target_model TweetHate --surrogate_model bert > ./logs/TweetHate_bert.log & Check the end of the log file to view the results of model stealing attacks for each target model and surrogate model. -------------------- EPOCH 9 --------------------100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 262/262 [01:42 ./logs/Perspective_roberta.log & nohup python model_stealing.py --target_model Perspective --surrogate_model bert > ./logs/Perspective_bert.log & nohup python model_stealing.py --target_model Moderation --surrogate_model roberta > ./logs/Moderation_roberta.log & nohup python model_stealing.py --target_model Moderation --surrogate_model bert > ./logs/Moderation_bert.log & After generating corresponding surrogate models, run the following script to conduct the stealthy hate campaign. bash scripts/run_stealthy_hate_campaign_Perspective.shbash scripts/run_stealthy_hate_campaign_Moderation.sh Ethics & Disclosure Our work relies on LLMs to generate samples, and all the manual annotations are performed by the authors of this study. Therefore our study is not considered human subjects research by our Institutional Review Board (IRB). Also, by doing annotations ourselves, we ensure that no human subjects were exposed to harmful information during our study. Since our work involves the assessment of LLM-driven hate campaigns, it is inevitable to disclose how attackers can evade a hate speech detector. We have taken great care to responsibly share our findings. We disclosed the paper and the labeled dataset to OpenAI, Google Jigsaw, and the developers of open-source detectors. In our disclosure letter, we explicitly highlighted the high attack success rates in the LLM-driven hate campaigns. We have received the acknowledgment from OpenAI and Google Jigsaw. This repo is intended for research purposes only. Any misuse is strictly prohibited. Citation If you find this useful in your research, please consider citing: @inproceedings{SWQBZZ25, author = {Xinyue Shen and Yixin Wu and Yiting Qu and Michael Backes and Savvas Zannettou and Yang Zhang}, title = {{HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns}}, booktitle = {{USENIX Security Symposium (USENIX Security)}}, publisher = {USENIX}, year = {2025}}

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Related to Research communities