Evaluation of ChatGPT's Performance in Residency Training Progress Exams and Competency Exams in Orthopedics and Traumatology

DİNÇEL, Yaşar Mahsut; KUTLUAY, Gündüz Ercan; SASANİ, Hadi; ŞİMŞEK, Mehmet Ali; EREM, Murat

Found an issue? Give us feedback

https://doi.org/10.2...arrow_drop_down

https://doi.org/10.21203/rs.3....

Article . 2026 . Peer-reviewed

License: CC BY

Data sources: Crossref

ZENODO

Journal . 2026

License: CC BY

Data sources: Datacite

ZENODO

Journal . 2026

License: CC BY

Data sources: Datacite

Evaluation of ChatGPT's Performance in Residency Training Progress Exams and Competency Exams in Orthopedics and Traumatology

descriptionPublicationkeyboard_double_arrow_right Article , Journal 12 Jan 2026Publisher:Springer Science and Business Media LLC

Authors: DİNÇEL, Yaşar Mahsut; KUTLUAY, Gündüz Ercan; SASANİ, Hadi; ŞİMŞEK, Mehmet Ali; EREM, Murat;

doi: 10.21203/rs.3.rs-8464449/v1 , 10.5281/zenodo.18998620 , 10.5281/zenodo.18998621

Evaluation of ChatGPT's Performance in Residency Training Progress Exams and Competency Exams in Orthopedics and Traumatology

- Summary
- Metrics

Abstract

Abstract Background Artificial intelligence (AI) technologies have rapidly expanded into the field of medical education, offering innovative tools for training and assessment.This study aimed to evaluate the performance of the ChatGPT-3.5 language model in the “Residency Training Progress Examination” (UEGS) and the “Competency Examination” administered by the Turkish Society of Orthopedics and Traumatology (TOTBID). The objective was to determine whether ChatGPT performs comparably to orthopedic residents and whether it can achieve a passing score in the Competency Exam. Methods A total of 2,000 UEGS and 1,000 Competency Exam questions (2012–2023, excluding 2020) were presented to ChatGPT-3.5 using standardized prompts designed within the Role–Goals–Context (RGC) framework. The model’s responses were statistically compared with those of orthopedic residents and specialists using the Mann–Whitney U and Kruskal–Wallis tests (p < 0.05). Results ChatGPT achieved the highest accuracy in the General Orthopedics category (62%) and the lowest in Adult Reconstructive Surgery (40%). It outperformed residents only in the Spine Surgery category (p < 0.05). In the Competency Exams, ChatGPT passed four of ten exams. Conclusion ChatGPT-3.5 demonstrated limited reliability and accuracy in orthopedic examinations and should be used cautiously as an educational support tool. Future studies involving newer multimodal versions of large language models may clarify their potential role in medical education and assessment.

Related Organizations

Namık Kemal University
Turkey
Trakya University
Turkey

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average