Quantitative Analysis of Hallucination Bias in LLM Counting Tasks and Suppression Effects via Structured Protocol (KIS)

Hasegawa, Hiroyasu

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Preprint

Data sources: ZENODO

Quantitative Analysis of Hallucination Bias in LLM Counting Tasks and Suppression Effects via Structured Protocol (KIS)

descriptionPublicationkeyboard_double_arrow_right Preprint Under curationPublisher:Zenodo

Authors: Hasegawa, Hiroyasu;

doi: 10.5281/zenodo.19787746

Quantitative Analysis of Hallucination Bias in LLM Counting Tasks and Suppression Effects via Structured Protocol (KIS)

- Summary

Abstract

Abstract (English) This study presents an exploratory quantitative analysis of hallucinations arising when large language models (LLMs) count items in large volumes of unstructured text data, and examines the suppression effects of the Knowledge Innovation System (KIS), a proprietary structured protocol. Three models — GPT-5.3 Instant, Gemini 3 Flash, and Claude Sonnet 4.6 — were evaluated on a three-label (Yes / No / Pending) text dataset ranging from 200 to 2,000 items under four conditions: Baseline (no protocol), KIS Level 4 / Logic: Strict, Chain-of-Thought (CoT) prompting, and a KIS + CoT hybrid. Results showed that Gemini overcounted the Pending category by +38 items at 1,000 entries under Baseline, exhibiting what we term harmonic hallucination, yet achieved 100% accuracy across all scales with KIS applied. Claude maintained perfect accuracy up to 2,000 items without any protocol. ChatGPT abandoned the task beyond 800 items under Baseline but recovered to 100% accuracy at 1,000 items under the KIS + CoT hybrid. Notably, applying CoT alone to ChatGPT induced distribution fabrication even at 200 items, demonstrating a counter-productive effect. Based on these findings, we propose a three-type taxonomy of LLM hallucination: Confabulation Type (Gemini), Avoidance Type (ChatGPT), and Process-Opaque Type (Claude). We further demonstrate that KIS functions as an external scaffold — structurally separating the counting, verification, and reporting phases via its log: full output — thereby leveling inter-model performance gaps and providing the audit trails required in practical deployments. Keywords: LLM, Hallucination, Counting Task, Prompt Engineering, KIS, Chain-of-Thought, Model Comparison, Audit Trail

Found an issue? Give us feedback