Quantifying Large Language Model Attacks Through the Lens of Model Cognition

Artifact Evaluation: Quantifying Large Language Model Attacks Through the Lens of Model Cognition Paper ID: #1781 (USENIX Security '26) Title: Quantifying Large Language Model Attacks Through the Lens of Model Cognition 📖 Overview This repository contains the artifact of the paper "Quantifying Large Language Model Attacks Through the Lens of Model Cognition" in USENIX Security 2026. It provides all necessary data, code, and scripts to validate our claims and reproduce the experimental results reported in the paper. We provide two primary modes for evaluation: Instant Verification (brief.py): Instantly query and retrieve specific experimental results (Accuracy, AUC, etc.) directly from the pre-computed logs used in the paper. Full Reproduction (quick_start.ipynb): A comprehensive Jupyter Notebook that executes the entire pipeline—from model downloading and hidden state extraction to training probes and evaluating sentinels—using Qwen3-4B-Instruct as a representative example. 📂 Directory Structure . ├── data/ # Datasets used for training probes and conducting attacks (e.g., fixed.json, adversarial prompts). ├── models/ # Directory where LLMs and baseline models will be downloaded. ├── results/ # JSON files containing the finalized experimental metrics reported in the paper. └── src/ # Source code and executable scripts. ├── brief.py # CLI tool to query results directly from the 'results/' folder. └── quick_start.ipynb # The main reproduction notebook (End-to-End pipeline). 💻 Hardware & Performance Reference The code is optimized to run on standard research hardware. The provided reproduction example uses Qwen3-4B-Instruct. Recommended GPU: 1x NVIDIA A100 (40GB VRAM) is fully sufficient. Estimated Runtime: Approximately 3 hours for the complete pipeline (Taking Qwen3-4B as an Example, Model download \to Extraction \to Training \to Evaluation). 🚀 Getting Started 1. Environment Setup Our experiments were conducted on Ubuntu 22.04 using Python 3.10.19. Detailed environment configuration steps (Conda setup, pip installs) are provided inside the src/quick_start.ipynb notebook. The notebook contains a dedicated setup cell that allows you to configure the environment and verify dependencies immediately. 2. Mode A: Instant Result Verification (via brief.py) If you wish to quickly verify specific numbers cited in the paper (e.g., Table 2) without running the full GPU pipeline, you can use the brief.py script. It parses the stored data in the results/ folder. Usage: cd src python brief.py --model --method --dataset Example: To retrieve the performance of our Multi-layer sentinel on Qwen3-4B against the Sneaky dataset: python brief.py --model Qwen3-4B --method Multi-layer --dataset Sneaky Note: Run python brief.py --help to see all available models, methods, and datasets. 3. Mode B: Full Reproduction (via quick_start.ipynb) For a complete verification of the methodology, use the provided Jupyter Notebook. This notebook integrates the entire workflow: Environment Config: Sets up the lac conda environment. Model Download: Automatically fetches open source models(Qwen3-4B, etc.) and baselines (Llama-Guard, etc.). Extraction: Extracts hidden states from the LLM. Training: Trains single-layer probes and the Multi-Layer Sentinel. Evaluation: Tests the Sentinel against adversarial datasets and compares it with baselines. How to Run: Navigate to the src/ directory. Launch Jupyter Lab or Notebook. Open quick_start.ipynb. Execute cells sequentially. The notebook contains detailed markdown descriptions linking each code block to specific sections and figures in the paper. ⚙️ Customization & Other Models The default reproduction script is configured for Qwen3-4B to ensure execution within reasonable time and memory limits. To reproduce results for other models (e.g., Llama-3.1-8B-Instruction, Qwen2.5-7B-Instruction) as cited in the paper, you need to modify two variables in the notebook: Model ID: Update the Hugging Face model loading path (e.g., change Qwen/Qwen3-4B to the desired model). Layer Count: Update the layer range loop (e.g., Qwen3-4B has 36 layers, while Llama-3.1-8B-Instruction has 32 layers). 🔗 Correspondence with Open Science Policy In accordance with the USENIX Security 2026 open-science policy, this artifact fulfills the specific commitments made in the Open Science section of our paper. The table below maps our paper's promises to the components provided in this artifacts: 🔒 Open Science Commitment and Artifact Mapping This section maps the artifacts provided in this repository to the specific commitments made in the paper's Open Science section. 1. Source Code: Description in Paper: We committed to releasing the "Full codebase for our probing framework" including scripts for training, sentinel construction, and evaluation metrics. Corresponding Artifact Component: src/ Folder: Contains the implementation using Hugging Face Transformers and PyTorch. The quick_start.ipynb specifically embodies the full pipeline code. 2. Data Access: Description in Paper: We promised to provide "scripts and instructions to reconstruct" the training sets (toxic: filtered text from NSFW-56k; non-toxic: GPT-4o-generated "home scene" prompts) and evaluation benchmarks (I2P, Sneaky, MMA, Labelled). Corresponding Artifact Component: data/ Folder: Contains the prepared datasets (e.g., fixed.json, adversarial prompts) necessary to run the pipeline immediately without requiring external scraping. 3. Probe Training: Description in Paper: We stated that "Training code and hyperparameters are provided; trained probes are not released". Corresponding Artifact Component: src/quick_start.ipynb (Training Step): The notebook contains the exact training logic and hyperparameters. It trains the probes from scratch locally, fulfilling the requirement to provide code over pre-trained binaries. 4. Hidden States: Description in Paper: We noted that "Precomputed hidden states are not released" but our code supports "on-the-fly extraction using publicly available LLMs". Corresponding Artifact Component: src/quick_start.ipynb (Extraction Step): The notebook demonstrates on-the-fly extraction from the local LLM (downloaded to models/), verifying that our method does not rely on cached proprietary tensors. 5. Reproducibility: Description in Paper: We committed to providing a README to "reproduce key results (e.g., Table 2, Figure 5)". Corresponding Artifact Component: src/brief.py & results/: The brief.py script allows instantaneous verification of the final data points reported in the paper (Table 2). The quick_start.ipynb allows full regeneration of these results.

Related Organizations

Microsoft Research Asia (China)
China (People's Republic of)
Shanghai Jiao Tong University
China (People's Republic of)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green