Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Conference object . 2025
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Conference object . 2026
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Conference object . 2026
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Conference object . 2025
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Conference object . 2026
Data sources: ZENODO
image/svg+xml Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao Closed Access logo, derived from PLoS Open Access logo. This version with transparent background. http://commons.wikimedia.org/wiki/File:Closed_Access_logo_transparent.svg Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao
ZENODO
Conference object . 2025
License: CC BY
Data sources: ZENODO
ZENODO
Article . 2025
Data sources: Datacite
ZENODO
Article . 2026
Data sources: Datacite
ZENODO
Article . 2026
Data sources: Datacite
ZENODO
Article . 2026
Data sources: Datacite
ZENODO
Article . 2025
Data sources: Datacite
ZENODO
Article . 2025
License: CC BY
Data sources: Datacite
ZENODO
Article . 2026
Data sources: Datacite
versions View all 7 versions
addClaim

Quantifying Large Language Model Attacks Through the Lens of Model Cognition

Authors: Xiuming, Liu; Chaoxiang, He; Xuanran, Yu; Jichen, Chai; Feiyue, Xu; Sheng, Hang; Hanqing, Hu; +5 Authors

Quantifying Large Language Model Attacks Through the Lens of Model Cognition

Abstract

Artifact Evaluation: Quantifying Large Language Model Attacks Through the Lens of Model Cognition Paper ID: #1781 (USENIX Security '26) Title: Quantifying Large Language Model Attacks Through the Lens of Model Cognition đź“– Overview This repository contains the artifact of the paper "Quantifying Large Language Model Attacks Through the Lens of Model Cognition" in USENIX Security 2026. It provides all necessary data, code, and scripts to validate our claims and reproduce the experimental results reported in the paper. We provide two primary modes for evaluation: Instant Verification (brief.py): Instantly query and retrieve specific experimental results (Accuracy, AUC, etc.) directly from the pre-computed logs used in the paper. Full Reproduction (quick_start.ipynb): A comprehensive Jupyter Notebook that executes the entire pipeline—from model downloading and hidden state extraction to training probes and evaluating sentinels—using Qwen3-4B-Instruct as a representative example. đź“‚ Directory Structure . â”śâ”€â”€ data/ # Datasets used for training probes and conducting attacks (e.g., fixed.json, adversarial prompts). â”śâ”€â”€ models/ # Directory where LLMs and baseline models will be downloaded. â”śâ”€â”€ results/ # JSON files containing the finalized experimental metrics reported in the paper. â””── src/ # Source code and executable scripts. ├── brief.py # CLI tool to query results directly from the 'results/' folder. └── quick_start.ipynb # The main reproduction notebook (End-to-End pipeline). đź’» Hardware & Performance Reference The code is optimized to run on standard research hardware. The provided reproduction example uses Qwen3-4B-Instruct. Recommended GPU: 1x NVIDIA A100 (40GB VRAM) is fully sufficient. Estimated Runtime: Approximately 3 hours for the complete pipeline (Taking Qwen3-4B as an Example, Model download \to Extraction \to Training \to Evaluation). 🚀 Getting Started 1. Environment Setup Our experiments were conducted on Ubuntu 22.04 using Python 3.10.19. Detailed environment configuration steps (Conda setup, pip installs) are provided inside the src/quick_start.ipynb notebook. The notebook contains a dedicated setup cell that allows you to configure the environment and verify dependencies immediately. 2. Mode A: Instant Result Verification (via brief.py) If you wish to quickly verify specific numbers cited in the paper (e.g., Table 2) without running the full GPU pipeline, you can use the brief.py script. It parses the stored data in the results/ folder. Usage: cd src python brief.py --model --method --dataset Example: To retrieve the performance of our Multi-layer sentinel on Qwen3-4B against the Sneaky dataset: python brief.py --model Qwen3-4B --method Multi-layer --dataset Sneaky Note: Run python brief.py --help to see all available models, methods, and datasets. 3. Mode B: Full Reproduction (via quick_start.ipynb) For a complete verification of the methodology, use the provided Jupyter Notebook. This notebook integrates the entire workflow: Environment Config: Sets up the lac conda environment. Model Download: Automatically fetches open source models(Qwen3-4B, etc.) and baselines (Llama-Guard, etc.). Extraction: Extracts hidden states from the LLM. Training: Trains single-layer probes and the Multi-Layer Sentinel. Evaluation: Tests the Sentinel against adversarial datasets and compares it with baselines. How to Run: Navigate to the src/ directory. Launch Jupyter Lab or Notebook. Open quick_start.ipynb. Execute cells sequentially. The notebook contains detailed markdown descriptions linking each code block to specific sections and figures in the paper. ⚙️ Customization & Other Models The default reproduction script is configured for Qwen3-4B to ensure execution within reasonable time and memory limits. To reproduce results for other models (e.g., Llama-3.1-8B-Instruction, Qwen2.5-7B-Instruction) as cited in the paper, you need to modify two variables in the notebook: Model ID: Update the Hugging Face model loading path (e.g., change Qwen/Qwen3-4B to the desired model). Layer Count: Update the layer range loop (e.g., Qwen3-4B has 36 layers, while Llama-3.1-8B-Instruction has 32 layers). đź”— Correspondence with Open Science Policy In accordance with the USENIX Security 2026 open-science policy, this artifact fulfills the specific commitments made in the Open Science section of our paper. The table below maps our paper's promises to the components provided in this artifacts: đź”’ Open Science Commitment and Artifact Mapping This section maps the artifacts provided in this repository to the specific commitments made in the paper's Open Science section. 1. Source Code: Description in Paper: We committed to releasing the "Full codebase for our probing framework" including scripts for training, sentinel construction, and evaluation metrics. Corresponding Artifact Component: src/ Folder: Contains the implementation using Hugging Face Transformers and PyTorch. The quick_start.ipynb specifically embodies the full pipeline code. 2. Data Access: Description in Paper: We promised to provide "scripts and instructions to reconstruct" the training sets (toxic: filtered text from NSFW-56k; non-toxic: GPT-4o-generated "home scene" prompts) and evaluation benchmarks (I2P, Sneaky, MMA, Labelled). Corresponding Artifact Component: data/ Folder: Contains the prepared datasets (e.g., fixed.json, adversarial prompts) necessary to run the pipeline immediately without requiring external scraping. 3. Probe Training: Description in Paper: We stated that "Training code and hyperparameters are provided; trained probes are not released". Corresponding Artifact Component: src/quick_start.ipynb (Training Step): The notebook contains the exact training logic and hyperparameters. It trains the probes from scratch locally, fulfilling the requirement to provide code over pre-trained binaries. 4. Hidden States: Description in Paper: We noted that "Precomputed hidden states are not released" but our code supports "on-the-fly extraction using publicly available LLMs". Corresponding Artifact Component: src/quick_start.ipynb (Extraction Step): The notebook demonstrates on-the-fly extraction from the local LLM (downloaded to models/), verifying that our method does not rely on cached proprietary tensors. 5. Reproducibility: Description in Paper: We committed to providing a README to "reproduce key results (e.g., Table 2, Figure 5)". Corresponding Artifact Component: src/brief.py & results/: The brief.py script allows instantaneous verification of the final data points reported in the paper (Table 2). The quick_start.ipynb allows full regeneration of these results.

Related Organizations
  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Green