STARC: Selective Token Access with Remapping and Clustering for Efficient LLM Decoding on PIM Systems

# STARC: Selective Token Access with Remapping and Clustering for Efficient LLM Decoding on PIM Systems This repository provides a complete workflow to reproduce the key results of **STARC**, including: 1. The implementation of STARC’s selective token access with **KV remapping** and **online clustering**. 2. Evaluation scripts to reproduce: - accuracy results on **LongBench** and **RULER**, and - perplexity results on **PG-19**. 3. The simulator setup to reproduce **system-level performance/energy** results on GPU–PIM platforms based on the **AttAcc** simulator (Ramulator-based). --- ## Contents - [What’s in this artifact](#whats-in-this-artifact) - [Requirements](#requirements) - [Hardware](#hardware) - [Software](#software) - [Resource estimate](#resource-estimate) - [Getting started](#getting-started) - [1) Get the code](#1-get-the-code) - [2) Create the environment](#2-create-the-environment) - [3) Set up the PIM system simulator](#3-set-up-the-pim-system-simulator) - [4) Build Ramulator2](#4-build-ramulator2) - [Reproducing paper results](#reproducing-paper-results) - [E1: LongBench accuracy](#e1-longbench-accuracy) - [E2: PG-19 perplexity](#e2-pg-19-perplexity) - [E3: RULER (32K context)](#e3-ruler-32k-context) - [E4: GPU–PIM system simulation](#e4-gpupim-system-simulation) - [Outputs](#outputs) --- ## What’s in this artifact - **Algorithm:** The STARC algorithm, which enables efficient long-context LLM inference by selectively accessing and remapping KV cache entries via online clustering under a fixed KV-cache budget. - **Program:** The STARC artifact running public long-context benchmarks: LongBench (16 datasets) and RULER (13 datasets). - **Models:** LongChat-7B v1.5-32K; LLaMA-3.1-8B-Instruct; Mistral-7B-Instruct-v0.3 (publicly available via Hugging Face). - **Datasets:** LongBench (16 datasets; e.g., HotpotQA, QASPER, GovReport, etc.); PG-19; RULER (13 datasets; e.g., NIAH Single, Multi-key NIAH, Multi-value NIAH, etc.), all publicly available (Hugging Face). - **Metrics:** LongBench task scores; PG-19 perplexity; RULER task scores; and system metrics such as latency and energy. - **Outputs:** LongBench/RULER scores, PG-19 perplexity traces, and system-level performance/energy metrics with breakdowns. - **Availability:** Publicly available. - **License:** MIT license. --- ## Requirements ### Hardware - **LLM accuracy evaluation (LongBench / PG-19 / RULER):** Compatible with commonly used NVIDIA GPUs. We recommend NVIDIA **H100** or **L40** with sufficient GPU memory (e.g., at least **48 GB per GPU**). - **System-level simulation:** CPU-only execution is sufficient. Experiments in the paper were conducted on a dual-socket AMD EPYC 9334 system with 64 CPU cores in total (2×32 cores). ### Software - **Python:** 3.10 - **CUDA:** 12.8 - **Python dependencies:** see `pyproject.toml` ### Resource estimate - **Disk space:** ~80 GB total - **Setup time:** ~20 minutes - **Experiment time:** - Model accuracy experiments: ~12 hours (excluding additional appendix results) - System-level performance experiments: ~24 hours --- ## Getting started ### 1) Get the code ```bash git clone --recurse-submodules https://github.com/EPIC-RPI/STARC cd STARC ``` ### 2) Create the environment To better reproduce the results and avoid potential conflicts, we recommend using Python 3.10 and CUDA 12.8. We provide scripts for the recommended environment setup. Please follow the instructions below to create the conda environment and install the STARC packages: ```bash conda create -yn STARC python=3.10 conda activate STARC pip install ninja==1.11.1.1 packaging pip install -e . pip install flash-attn==2.3.0 --no-build-isolation conda install -c conda-forge cupy conda install numpy scikit-learn conda install cmake ``` ### 3) Set up the PIM system simulator This artifact builds on the AttAcc simulator: ```bash cd simulator_starc git submodule update --init --recursive ``` ### 4) Build Ramulator2 ```bash bash set_pim_ramulator.sh cd ramulator2 mkdir build cd build cmake .. -DCMAKE_POLICY_VERSION_MINIMUM=3.5 make -j cp ramulator2 ../ramulator2 cd ../../ ``` --- ## Reproducing paper results This section describes how to reproduce the key results reported in the paper. ### E1: LongBench accuracy To reproduce the LongBench accuracy results: ```bash cd /STARC/scripts/ sh longbench.sh ``` If you want to evaluate more models, the corresponding model paths are defined in: ```text STARC/evaluation/LongBench/config/model2path.json ``` By replacing the model name in `longbench.sh`, you can evaluate STARC under different models reported in the paper. ### E2: PG-19 perplexity To reproduce the perplexity results on PG-19: ```bash cd /STARC/scripts/ sh ppl_eval.sh ``` ### E3: RULER (32K context) To reproduce RULER results on **LLaMA-3.1-8B-Instruct**, the RULER testing data are already included in the `STARC/ruler` directory. To reproduce the RULER results under a 32K context length: ```bash cd /STARC/scripts/ sh RULER.sh ``` ### E4: GPU–PIM system simulation The system-level simulation experiments are conducted using the AttAcc-based simulator. #### Full attention To reproduce the results for full attention: ```bash python main.py --system dgx-attacc --gpu H100 --ngpu 8 --model Mistral-7B \ --lin 2048 --lout 32000 --batch 16 --pim bank \ --powerlimit --ffopt --pipeopt ``` #### Sparse attention configurations To reproduce the results for configurations with sparse attention methods: ```bash python main.py --system dgx-attacc --gpu H100 --ngpu 8 --model Mistral-7B \ --lin 2048 --lout 32000 --batch 16 --pim bank \ --powerlimit --ffopt --pipeopt \ --sparsity --kv_budget_table kv_budget_Mistral_STARC.txt ``` Different sparse attention methods and models use different `.txt` files specified by the `--kv_budget_table` option. These files are derived from the attention masks produced by each method at each decoding step in real inference tasks (e.g., LongBench), and map them to the row-level granularity of the PIM architecture, where each DRAM row activation fetches 16 key/value vectors in parallel. They define how many memory rows are activated at each decoding step and are used to guide the simulator accordingly. > **Note 1:** Please always start simulations from the **maximum** context length (e.g., 32K). For all methods **except STARC**, the simulator cache (`ramulator.out`) generated for the longest context length can be reused for shorter context lengths. In this case, the simulator will directly read the cached results without rerunning the simulation. > **Note 2:** When switching the evaluated method (Full attention / STARC / SparQ / Quest), please delete the previously generated `ramulator.out`; otherwise, cached results from the last run may be reused. > **Note 3:** When reproducing methods other than STARC, comment out the following two lines in `STARC/simulator_starc/src/ramulator_wrapper.py` to avoid introducing clustering overhead: > ```python > if l == l_target - 1: > trace_args += " --add_cluster" > ``` > **Note 4:** When reproducing **STARC** results, due to time constraints, we are currently unable to directly present the clustering overhead separately. If you would like to isolate the clustering overhead, you may follow the steps below: > > 1. Keep the code in Note 3 **enabled**: > ```python > if l == l_target - 1: > trace_args += " --add_cluster" > ``` > and run the corresponding command to obtain the simulation result. > 2. Delete the generated `ramulator.out`. > 3. Comment out the above code and rerun the **same** command. > 4. The difference between the two results corresponds to the clustering overhead. > > We apologize for this inconvenience. We will optimize the code in future updates to directly report the clustering overhead. --- ## Outputs - **Model accuracy experiments:** For **LongBench**, the evaluation generates a corresponding `.jsonl` file for each model and each task. These files contain the ground-truth answers and model predictions. The final results are summarized in `result.json`. For **RULER**, evaluation results are printed directly to the terminal. - **PG-19 perplexity:** A `log_PG19.txt` file is generated to record the evolution of perplexity during evaluation. - **Simulation experiments:** The simulator produces an `output.csv` file that records the breakdown of end-to-end latency and energy consumption. An example format is shown below. | model | dtype | xpu | cap | bw | sys_opb | hw | cores | pipe_level | is_parallel | power_constraint | gqa_size | Lin | Lout | bs | required_cap | s_flops | g_flops | s_time | s_matmul | s_fc | s_comm | s_softmax | s_act | s_lnorm | g_time (ms) | g_matmul | g_fc | g_comm | g_etc | g_qkv_time | g_prj_time | g_ff_time | g2g_comm | c2g_comm | g_softmax | g_act | g_lnorm | g_energy (nJ) | g_dram_energy | g_l2_energy | g_l1_energy | g_reg_energy | g_alu_energy | g_fc_mem_energy | g_fc_comp_energy | g_attn_mem_energy | g_attn_comp_energy | g_etc_mem_energy | g_etc_comp_energy | g_comm_energy | |------|------|-----|-----|----|---------|----|-------|------------|-------------|------------------|----------|-----|------|----|--------------|----------|----------|--------|-----------|------|---------|------------|-------|----------|-------------|-----------|------|---------|-------|------------|------------|-----------|-----------|-----------|------------|-------|----------|---------------|---------------|-------------|-------------|--------------|---------------|------------------|-------------------|--------------------|---------------------|-------------------|--------------------|----------------| | Mistral-7B | W16A16 | GPU | 1280 | 9 | 295.1670644 | BA | 8 | TRUE | TRUE | TRUE | 0 | 2048 | 16000 | 16 | 1.63247E+11 | 26440397457 | 2.74317E+11 | 137.0683392 | 27.98165465 | 62.93091672 | 11.61364915 | 24.87258191 | 2.506916933 | 7.162619808 | 0.916930348 | 0.075737878 | 0.404029795 | 0.346612651 | 0.090550025 | 0.137451554 | 0.030404956 | 0.236173285 | 0.344865024 | 0.001747627 | 0 | 0.033255129 | 0.057294896 | 551880956.3 | 412383134.3 | 51675503 | 30652644.72 | 19149650.66 | 37147608.35 | 344396314.7 | 130849869.7 | 61361497.02 | 6874799.8 | 6625322.598 | 900737.2698 | 872415.232 | Since we focus **only on the decoding stage**, the following columns are used for analysis: - **Latency breakdown:** `g_matmul`, `g_fc`, `g_comm`, `g_etc`, `g_time (ms)` - **Energy breakdown:** `g_fc_mem_energy`, `g_fc_comp_energy`, `g_attn_mem_energy`, `g_attn_comp_energy`, `g_etc_mem_energy`, `g_etc_comp_energy`, `g_comm_energy`, `g_energy (nJ)` Energy consumption is further decomposed into **compute (comp)** and **memory (mem)** components. --- ## License This project is released under the **MIT License**. --- ## Acknowledgements This repository incorporates and builds upon open-source implementations from the following projects. We thank the authors for making their code publicly available. - **Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference** Codebase: https://github.com/mit-han-lab/quest - **AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference** Simulator: https://github.com/scale-snu/attacc_simulator

Related Organizations

Rensselaer Polytechnic Institute
United States

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average