Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint
Data sources: ZENODO
addClaim

Quantitative Analysis of Hallucination Bias in LLM Counting Tasks and Suppression Effects via Structured Protocol (KIS)

Authors: Hasegawa, Hiroyasu;

Quantitative Analysis of Hallucination Bias in LLM Counting Tasks and Suppression Effects via Structured Protocol (KIS)

Abstract

Abstract (English) This study presents an exploratory quantitative analysis of hallucinations arising when large language models (LLMs) count items in large volumes of unstructured text data, and examines the suppression effects of the Knowledge Innovation System (KIS), a proprietary structured protocol. Three models — GPT-5.3 Instant, Gemini 3 Flash, and Claude Sonnet 4.6 — were evaluated on a three-label (Yes / No / Pending) text dataset ranging from 200 to 2,000 items under four conditions: Baseline (no protocol), KIS Level 4 / Logic: Strict, Chain-of-Thought (CoT) prompting, and a KIS + CoT hybrid. Results showed that Gemini overcounted the Pending category by +38 items at 1,000 entries under Baseline, exhibiting what we term harmonic hallucination, yet achieved 100% accuracy across all scales with KIS applied. Claude maintained perfect accuracy up to 2,000 items without any protocol. ChatGPT abandoned the task beyond 800 items under Baseline but recovered to 100% accuracy at 1,000 items under the KIS + CoT hybrid. Notably, applying CoT alone to ChatGPT induced distribution fabrication even at 200 items, demonstrating a counter-productive effect. Based on these findings, we propose a three-type taxonomy of LLM hallucination: Confabulation Type (Gemini), Avoidance Type (ChatGPT), and Process-Opaque Type (Claude). We further demonstrate that KIS functions as an external scaffold — structurally separating the counting, verification, and reporting phases via its log: full output — thereby leveling inter-model performance gaps and providing the audit trails required in practical deployments. Keywords: LLM, Hallucination, Counting Task, Prompt Engineering, KIS, Chain-of-Thought, Model Comparison, Audit Trail

Powered by OpenAIRE graph
Found an issue? Give us feedback