Artificial Intelligence Large Language Models for Pulmonary Nodule Surgical Decision-Making: A Comparative Accuracy Study

Background Artificial intelligence (AI) large language models show promise in medical decision-making, but their reliability in determining surgical indications for pulmonary nodules remains unexplored. We evaluated the diagnostic accuracy and consistency of three leading AI models compared with expert thoracic surgeon consensus. METHODS This cross-sectional diagnostic accuracy study evaluated ChatGPT-4, Claude 3.5 Sonnet, and Google Gemini Pro using 45 standardized clinical vignettes representing diverse pulmonary nodule presentations. Six thoracic surgeons with ≥5 years of experience independently reviewed all vignettes to establish consensus. Each AI model was tested three times per vignette to assess test-retest reliability. Primary outcome was overall diagnostic accuracy; secondary outcomes included inter-model agreement and performance across nodule categories and complexity levels. RESULTS Expert panel achieved 91.4% mean inter-rater agreement (range: 60-100%), with unanimous consensus in 46.7% of cases. Overall AI-expert agreement was 82.2% (95% CI: 71.1-93.4%). Claude and Gemini both achieved 82.2% accuracy with perfect test-retest reliability (100% consistency across three trials), while GPT-4 demonstrated 80.0% accuracy with 86.8% consistency. Inter-model agreement was highest between Claude and Gemini (100%), versus 62.2% for GPT-4 comparisons with either model. Performance varied significantly by nodule category: 100% agreement in complex scenarios (mixed pattern, multiple nodules, high-risk comorbidities, post-treatment) versus 20% in intermediate-sized solid nodules (21-30 mm). CONCLUSIONS Leading AI large language models demonstrate substantial agreement with expert consensus in pulmonary nodule management, with Claude and Gemini showing superior consistency. However, performance varies markedly by clinical context, particularly for intermediate-sized solid nodules where guideline ambiguity is greatest. Current AI capabilities may complement but cannot replace expert thoracic surgical judgment. KEY WORDS: Artificial intelligence; large language models; pulmonary nodule; surgical indication; diagnostic accuracy

Keywords

Artificial intelligence, surgical indication, Artificial Intelligence, large language models, pulmonary nodule, diagnostic accuracy

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now