Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders

Name: Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders
Keywords: Machine Learning, FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Artificial Intelligence, Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)

Shu, Dong; Wu, Xuansheng; Zhao, Haiyan; Du, Mengnan; Liu, Ninghao

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2025

Data sources: arXiv.org e-Print Archive

https://doi.org/10.18653/v1/20...

Article . 2025 . Peer-reviewed

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2025

License: CC BY

Data sources: Datacite

Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2025Embargo end date: 01 Jan 2025Publisher:Association for Computational Linguistics (ACL)Journal:Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Authors: Shu, Dong; Wu, Xuansheng; Zhao, Haiyan; Du, Mengnan; Liu, Ninghao;

doi: 10.18653/v1/2025.emnlp-main.87 , 10.48550/arxiv.2505.08080

arXiv: 2505.08080

Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders

- Summary
- Subjects
- Metrics

Abstract

Sparse Autoencoders (SAEs) have recently emerged as powerful tools for interpreting and steering the internal representations of large language models (LLMs). However, conventional approaches to analyzing SAEs typically rely solely on input-side activations, without considering the causal influence between each latent feature and the model's output. This work is built on two key hypotheses: (1) activated latents do not contribute equally to the construction of the model's output, and (2) only latents with high causal influence are effective for model steering. To validate these hypotheses, we propose Gradient Sparse Autoencoder (GradSAE), a simple yet effective method that identifies the most influential latents by incorporating output-side gradient information.

EMNLP 2025 Main

Keywords

Machine Learning, FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Artificial Intelligence, Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Related to Research communities

Knowmad Institut