Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint
Data sources: ZENODO
addClaim

edical LLM Metacognition Is Multidimensional: A MetaMedQA Reanalysis of Confidence, Missing-Answer Recognition, and Unknown-Answer Detection

Authors: Nazzal, Ahmad;

edical LLM Metacognition Is Multidimensional: A MetaMedQA Reanalysis of Confidence, Missing-Answer Recognition, and Unknown-Answer Detection

Abstract

Recent work using MetaMedQA argued that large language models (LLMs) lack essential metacognition for reliable medical reasoning. However, metacognition is not a single construct: confidence–correctness discrimination, missing-answer recognition, unknown-answer detection, and abstention behavior may dissociate. Here, we reanalyzed MetaMedQA using a confidence-centered evaluation framework previously developed for a controlled clinical-evidence benchmark. Two GPT-family models, gpt-4.1-nano and gpt-5.5, were evaluated on 1373 MetaMedQA items using structured outputs containing an answer, numerical confidence, and a more-information-needed judgment. gpt-4.1-nano achieved 56.4% accuracy, mean confidence of 79.7%, Brier score of 0.318, expected calibration error of 0.276, and AUROC2 of 0.582. Missing-answer recall was 19.1%, and unknown/unanswerable recall was 25.9%. gpt-5.5 improved substantially, achieving 84.9% accuracy, mean confidence of 91.2%, Brier score of 0.112, expected calibration error of 0.062, and AUROC2 of 0.819. Missing-answer recall increased to 67.8%, and unknown/unanswerable recall to 56.2%. Nevertheless, incorrect responses from gpt-5.5 still received high mean confidence. These results suggest that medical-LLM metacognition is better understood as a set of dissociable behavioral capacities rather than as a single absent-or-present property. Stronger models can show improved confidence–correctness discrimination and calibration, while still retaining clinically relevant failures in missing-answer and unknown-answer recognition.

Powered by OpenAIRE graph
Found an issue? Give us feedback