Fathom Monitor: Per-Token Hallucination Detection via Coherence Divergence in Sparse Autoencoder Feature Space

Rodabaugh, Alexander

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Other literature type . 2026

License: CC BY

Data sources: ZENODO

ZENODO

Other literature type . 2026

License: CC BY

Data sources: Datacite

Fathom Monitor: Per-Token Hallucination Detection via Coherence Divergence in Sparse Autoencoder Feature Space

descriptionPublicationkeyboard_double_arrow_right Other literature type 02 Apr 2026Publisher:Zenodo

Authors: Rodabaugh, Alexander;

doi: 10.5281/zenodo.19382453

Fathom Monitor: Per-Token Hallucination Detection via Coherence Divergence in Sparse Autoencoder Feature Space

- Summary
- Subjects
- Metrics

Abstract

This technical disclosure describes Fathom Monitor, a system and method for detecting hallucination-risk tokens in large language model (LLM) outputs at the time of generation, using a mechanistic signal derived from the geometric structure of sparse autoencoder (SAE) feature activations. The core innovation is the use of C_delta — the divergence between late-layer and early-layer feature coherence — as a per-token hallucination indicator. When C_delta exceeds a calibrated threshold at a given token position, that token is flagged as uncertain or high-risk and annotated inline. Empirical validation on TruthfulQA (n=50, Gemma-2-2B): C_delta discriminates hallucination with p=0.040, Cohen's d=0.407. Depth (K) is blind to hallucination (p=0.931). This document constitutes a public technical disclosure establishing prior art. Related provisional patents: US 64/020,489 (March 29, 2026) and US 64/021,113 (March 30, 2026). Builds on Zenodo records doi:10.5281/zenodo.19326175 and doi:10.5281/zenodo.19364702.

Patent pending: US 64/020,489 and US 64/021,113. This disclosure establishes prior art for the Fathom Monitor system as of April 2, 2026 under 35 U.S.C. § 102 (AIA). Provisional patent application to be filed within 12 months.

Keywords

mechanistic interpretability, Fathom, sparse autoencoder, C_delta, large language models, hallucination detection, TruthfulQA, coherence, per-token uncertainty

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average