
This technical disclosure describes Fathom Monitor, a system and method for detecting hallucination-risk tokens in large language model (LLM) outputs at the time of generation, using a mechanistic signal derived from the geometric structure of sparse autoencoder (SAE) feature activations. The core innovation is the use of C_delta — the divergence between late-layer and early-layer feature coherence — as a per-token hallucination indicator. When C_delta exceeds a calibrated threshold at a given token position, that token is flagged as uncertain or high-risk and annotated inline. Empirical validation on TruthfulQA (n=50, Gemma-2-2B): C_delta discriminates hallucination with p=0.040, Cohen's d=0.407. Depth (K) is blind to hallucination (p=0.931). This document constitutes a public technical disclosure establishing prior art. Related provisional patents: US 64/020,489 (March 29, 2026) and US 64/021,113 (March 30, 2026). Builds on Zenodo records doi:10.5281/zenodo.19326175 and doi:10.5281/zenodo.19364702.
Patent pending: US 64/020,489 and US 64/021,113. This disclosure establishes prior art for the Fathom Monitor system as of April 2, 2026 under 35 U.S.C. § 102 (AIA). Provisional patent application to be filed within 12 months.
mechanistic interpretability, Fathom, sparse autoencoder, C_delta, large language models, hallucination detection, TruthfulQA, coherence, per-token uncertainty
mechanistic interpretability, Fathom, sparse autoencoder, C_delta, large language models, hallucination detection, TruthfulQA, coherence, per-token uncertainty
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
