Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Article . 2026
License: CC BY
Data sources: Datacite
ZENODO
Article . 2026
License: CC BY
Data sources: Datacite
ZENODO
Article . 2026
License: CC BY
Data sources: Datacite
versions View all 3 versions
addClaim

Artificial Intelligence Large Language Models for Pulmonary Nodule Surgical Decision-Making: A Comparative Accuracy Study

Authors: ÇAVUŞOĞLU YALÇIN, NİLAY;

Artificial Intelligence Large Language Models for Pulmonary Nodule Surgical Decision-Making: A Comparative Accuracy Study

Abstract

Background Artificial intelligence (AI) large language models show promise in medical decision-making, but their reliability in determining surgical indications for pulmonary nodules remains unexplored. We evaluated the diagnostic accuracy and consistency of three leading AI models compared with expert thoracic surgeon consensus. METHODS This cross-sectional diagnostic accuracy study evaluated ChatGPT-4, Claude 3.5 Sonnet, and Google Gemini Pro using 45 standardized clinical vignettes representing diverse pulmonary nodule presentations. Six thoracic surgeons with ≥5 years of experience independently reviewed all vignettes to establish consensus. Each AI model was tested three times per vignette to assess test-retest reliability. Primary outcome was overall diagnostic accuracy; secondary outcomes included inter-model agreement and performance across nodule categories and complexity levels. RESULTS Expert panel achieved 91.4% mean inter-rater agreement (range: 60-100%), with unanimous consensus in 46.7% of cases. Overall AI-expert agreement was 82.2% (95% CI: 71.1-93.4%). Claude and Gemini both achieved 82.2% accuracy with perfect test-retest reliability (100% consistency across three trials), while GPT-4 demonstrated 80.0% accuracy with 86.8% consistency. Inter-model agreement was highest between Claude and Gemini (100%), versus 62.2% for GPT-4 comparisons with either model. Performance varied significantly by nodule category: 100% agreement in complex scenarios (mixed pattern, multiple nodules, high-risk comorbidities, post-treatment) versus 20% in intermediate-sized solid nodules (21-30 mm). CONCLUSIONS Leading AI large language models demonstrate substantial agreement with expert consensus in pulmonary nodule management, with Claude and Gemini showing superior consistency. However, performance varies markedly by clinical context, particularly for intermediate-sized solid nodules where guideline ambiguity is greatest. Current AI capabilities may complement but cannot replace expert thoracic surgical judgment. KEY WORDS: Artificial intelligence; large language models; pulmonary nodule; surgical indication; diagnostic accuracy

Keywords

Artificial intelligence, surgical indication, Artificial Intelligence, large language models, pulmonary nodule, diagnostic accuracy

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!