Abd-Alrazaq, A. et al. Large Language models in medical education: Opportunities, Challenges, and future directions. JMIR Med. Educ. 9, e48291 (2023).
Elzayyat, M., Mohammad, J. N. & Zaqout, S. Assessing LLM-generated vs. expert-created clinical anatomy mcqs: a student perception-based comparative study in medical education. Med. Educ. Online. 30 (1), 2554678 (2025).
Ling, Q. et al. Assessing the possibility of using large Language models in ocular surface diseases. Int. J. Ophthalmol. 18 (1), 1–8 (2025).
Gotta, J. et al. Large Language models (LLMs) in radiology exams for medical students: performance and consequences. Rofo 197 (9), 1057–1067 (2025).
Bahir, D. et al. Gemini AI vs. ChatGPT: A comprehensive examination alongside ophthalmology residents in medical knowledge. Graefes Arch. Clin. Exp. Ophthalmol. 263 (2), 527–536 (2025).
Al-Thani, S. N. et al. Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education. Int. J. Emerg. Med. 18 (1), 146 (2025).
Fattah, F. H. et al. Comparative analysis of ChatGPT and gemini (Bard) in medical inquiry: a scoping review. Front. Digit. Health. 7, 1482712 (2025).
Wu, X. et al. A multi-dimensional performance evaluation of large Language models in dental implantology: comparison of ChatGPT, DeepSeek, Grok, gemini and Qwen across diverse clinical scenarios. BMC Oral Health. 25 (1), 1272 (2025).
Singal, A. & Goyal, S. Comparative evaluation of AI platforms Google gemini 2.5 Flash, Google gemini 2.0 Flash, deepseek V3 and ChatGPT 4o in solving multiple-choice questions from different subtopics of anatomy. Surg. Radiol. Anat. 47 (1), 193 (2025).
Masalkhi, M., Ong, J., Waisberg, E. & Lee, A. G. Google deepmind’s gemini AI versus chatgpt: a comparative analysis in ophthalmology. Eye (Lond). 38 (8), 1412–1417 (2024).
Marey, A. et al. Evaluating the accuracy and reliability of AI chatbots in patient education on cardiovascular imaging: a comparative study of ChatGPT, gemini, and copilot. Egypt. J. Radiol. Nuclear Med. 56 (1), 37 (2025).
Bayala, Y. L. T. et al. Performance of the large Language models in African rheumatology: a diagnostic test accuracy study of ChatGPT-4, Gemini, Copilot, and Claude artificial intelligence. BMC Rheumatol. 9 (1), 54 (2025).
Madrid-GarcÃa, A. et al. Harnessing ChatGPT and GPT-4 for evaluating the rheumatology questions of the Spanish access exam to specialized medical training. Sci. Rep. 13 (1), 22129 (2023).
Meral, G., Ateş, S., Günay, S., Öztürk, A. & Kuşdoğan, M. Comparative analysis of ChatGPT, gemini and emergency medicine specialist in ESI triage assessment. Am. J. Emerg. Med. 81, 146–150 (2024).
Lee, J. T. et al. Evaluation of performance of generative large Language models for stroke care. Npj Digit. Med. 8 (1), 481 (2025).
Bahir, D. et al. Gemini AI vs. ChatGPT: A comprehensive examination alongside ophthalmology residents in medical knowledge. Graefe’s Archive Clin. Experimental Ophthalmol. 263 (2), 527–536 (2025).
Sabaner, M. C. & Yozgat, Z. Performance of ChatGPT-4 omni and gemini 1.5 pro on Ophthalmology-Related questions in the Turkish medical specialty exam. Turk. J. Ophthalmol. 55 (4), 177–185 (2025).
Hernández-Flores, L. A. et al. Assessment of challenging oncologic cases: A comparative analysis between ChatGPT, Gemini, and a multidisciplinary tumor board. J. Surg. Oncol. 131 (8), 1562–1570 (2025).
Oh, N., Choi, G. S. & Lee, W. Y. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large Language models. Ann. Surg. Treat. Res. 104 (5), 269–273 (2023).
Wang, T. et al. Evaluating the performance of State-of-the-Art artificial intelligence chatbots based on the WHO global guidelines for the prevention of surgical site infection: Cross-Sectional study. J. Med. Internet Res. 27, e75567 (2025).
Li, H. et al. ChatGPT-4o outperforms gemini advanced in assisting multidisciplinary decision-making for advanced gastric cancer. Eur. J. Surg. Oncol. 51 (8), 110096 (2025).
Lin, C. R. et al. Multiple large Language models versus clinical guidelines for postmenopausal osteoporosis: a comparative study of ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Google gemini, Google gemini Advanced, and Microsoft copilot. Arch. Osteoporos. 20 (1), 120 (2025).
Liu, R., Liu, J., Yang, J., Sun, Z. & Yan, H. Comparative analysis of ChatGPT-4o mini, ChatGPT-4o and gemini advanced in the treatment of postmenopausal osteoporosis. BMC Musculoskelet. Disord. 26 (1), 369 (2025).
Muluk, E. A comparative analysis of artificial intelligence platforms: ChatGPT-4o and Google gemini in answering questions about birth control methods. Cureus 17 (1), e76745 (2025).
McNulty, A. M. et al. Performance evaluation of ChatGPT-4.0 and gemini on image-based neurosurgery board practice questions: A comparative analysis. J. Clin. Neurosci. 134, 111097 (2025).
Sau, S. et al. Accuracy and quality of ChatGPT-4o and Google gemini performance on image-based neurosurgery board questions. Neurosurg. Rev. 48 (1), 320 (2025).
Ali, R. et al. Performance of ChatGPT and GPT-4 on neurosurgery written board examinations. Neurosurgery 93 (6), 1353–1365 (2023).
Huang, K. A., Choudhary, H. K., Hardin, W. M. & Prakash, N. Comparative analysis of ChatGPT-4o and gemini advanced performance on diagnostic radiology In-Training exams. Cureus 17 (3), e80874 (2025).
Clark, K. R. Comparative analysis of llms’ performance on a practice radiography certification exam. Radiol. Technol. 96 (5), 334–342 (2025).
Suh, P. S. et al. Comparing diagnostic accuracy of radiologists versus GPT-4V and gemini pro vision using image inputs from diagnosis please cases. Radiology 312 (1), e240273 (2024).
Jain, S., Chakraborty, B., Agarwal, A. & Sharma, R. Performance of large Language models (ChatGPT and gemini Advanced) in Gastrointestinal pathology and clinical review of applications in gastroenterology. Cureus 17 (4), e81618 (2025).
Khan, A. A. et al. Artificial intelligence for anesthesiology Board-Style examination questions: role of large Language models. J. Cardiothorac. Vasc Anesth. 38 (5), 1251–1259 (2024).
Passby, L., Jenko, N. & Wernham, A. Performance of ChatGPT on specialty certificate examination in dermatology multiple-choice questions. Clin. Exp. Dermatol. 49 (7), 722–727 (2024).
Dhanvijay, A. K. D. et al. Performance of large Language models (ChatGPT, Bing Search, and Google Bard) in solving case vignettes in physiology. Cureus 15 (8), e42972 (2023).
Kumari, A. et al. Large Language models in hematology case solving: A comparative study of ChatGPT-3.5, Google Bard, and Microsoft Bing. Cureus 15 (8), e43861 (2023).
Cheong, R. C. T. et al. Performance of artificial intelligence chatbots in sleep medicine certification board exams: ChatGPT versus Google bard. Eur. Arch. Otorhinolaryngol. 281 (4), 2137–2143 (2024).
Makrygiannakis, M. A., Giannakopoulos, K. & Kaklamanos, E. G. Evidence-based potential of generative artificial intelligence large Language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing. Eur. J. Orthod. (2024).
Yamaguchi, S. et al. Evaluating the efficacy of leading large Language models in the Japanese National dental hygienist examination: A comparative analysis of chatGPT, Bard, and Bing chat. J. Dent. Sci. 19 (4), 2262–2267 (2024).
Fukuda, H. et al. Evaluating the image recognition capabilities of GPT-4V and gemini pro in the Japanese National dental examination. J. Dent. Sci. 20 (1), 368–372 (2025).
Khan, M. P. & O’Sullivan, E. D. A comparison of the diagnostic ability of large Language models in challenging clinical cases. Front. Artif. Intell. 7, 1379297 (2024).
Khan, A. A. et al. Assessing the performance of ChatGPT in medical ethical decision-making: a comparative study with USMLE-based scenarios. J. Med. Ethics (2025).
Lee, Y. et al. Harnessing artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and bard in generating clinician-level bariatric surgery recommendations. Surg. Obes. Relat. Dis. 20 (7), 603–608 (2024).
Lee, Y. et al. Performance of artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and bard in the American society for metabolic and bariatric surgery textbook of bariatric surgery questions. Surg. Obes. Relat. Dis. 20 (7), 609–613 (2024).
Anvari, S., Lee, Y., Jin, D. S., Malone, S. & Collins, M. Artificial intelligence in hepatology: a comparative analysis of ChatGPT-4, Bing, and bard at answering clinical questions. J. Can. Assoc. Gastroenterol. 8 (2), 58–62 (2025).
Zare, S. et al. Comparing the performance of ChatGPT-3.5-Turbo, ChatGPT-4, and Google bard with Iranian students in pre-internship comprehensive exams. Sci. Rep. 14 (1), 28456 (2024).
Roos, J., Martin, R. & Kaczmarczyk, R. Evaluating bard gemini pro and GPT-4 vision against student performance in medical visual question answering: comparative case study. JMIR Form. Res. 8, e57592 (2024).
Meo, S. A., Abukhalaf, F. A., ElToukhy, R. A. & Sattar, K. Exploring the role of DeepSeek-R1, ChatGPT-4, and Google gemini in medical education: how valid and reliable are they? Pak J. Med. Sci. 41 (7), 1887–1892 (2025).
Omar, M. et al. Generating credible referenced medical research: A comparative study of openai’s GPT-4 and google’s gemini. Comput. Biol. Med. 185, 109545 (2025).
Guidance, W. Ethics and Governance of Artificial Intelligence for Health (World Health Organization, 2021).
Asgari, E. et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. Npj Digit. Med. 8 (1), 274 (2025).
