Medical LLM Metacognition Is Multidimensional: A MetaMedQA Reanalysis of Confidence, Missing-Answer Recognition, and Unknown-Answer Detection

Recent work using MetaMedQA argued that large language models (LLMs) lack essential metacognition for reliable medical reasoning. However, metacognition is not a single construct: confidence–correctness discrimination, missing-answer recognition, unknown-answer detection, and abstention behavior may dissociate. Here, we reanalyzed MetaMedQA using a confidence-centered evaluation framework previously developed for a controlled clinical-evidence benchmark. Two GPT-family models, gpt-4.1-nano and gpt-5.5, were evaluated on 1373 MetaMedQA items using structured outputs containing an answer, numerical confidence, and a more-information-needed judgment. gpt-4.1-nano achieved 56.4% accuracy, mean confidence of 79.7%, Brier score of 0.318, expected calibration error of 0.276, and AUROC2 of 0.582. Missing-answer recall was 19.1%, and unknown/unanswerable recall was 25.9%. gpt-5.5 improved substantially, achieving 84.9% accuracy, mean confidence of 91.2%, Brier score of 0.112, expected calibration error of 0.062, and AUROC2 of 0.819. Missing-answer recall increased to 67.8%, and unknown/unanswerable recall to 56.2%. Nevertheless, incorrect responses from gpt-5.5 still received high mean confidence. These results suggest that medical-LLM metacognition is better understood as a set of dissociable behavioral capacities rather than as a single absent-or-present property. Stronger models can show improved confidence–correctness discrimination and calibration, while still retaining clinically relevant failures in missing-answer and unknown-answer recognition.

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Fields of Science

social sciences

psychology and cognitive sciences

Fields of Science

social sciences

psychology and cognitive sciences