Epistemic Twins: Enabling a Symbolic Science of Language Model Knowledge

Large Language Models (LLMs) are impactful yet opaque artifacts. At their core, they are subsymbolic constructs defined by billions of numeric weights that interact in a largely inscrutable manner. Current analysis paradigms are either black-box benchmarks that test model performance on pre-defined tasks, or mechanistic interpretability approaches that trace back outputs to specific weights.Both analysis methods are limited by the experimenter's hypothesis space - one must know what to look for to find it. In this perspective, we argue for a third, radically different analysis paradigm: Epistemic Twins. We propose constructing large-scale symbolic approximations of LLMs in human-readable formats. This enables the comprehensive materialization of factual knowledge (or beliefs) inherent in the model without predefining hypotheses, facilitating large-scale analysis and auditing towards better understanding and explainability.

Found an issue? Give us feedback