Comparative performance of ChatGPT-4o, ChatGPT-5, and gemini 2.5 flash on Persian internal medicine subspecialty board exams

Abd-Alrazaq, A. et al. Large Language models in medical education: Opportunities, Challenges, and future directions. JMIR Med. Educ. 9, e48291 (2023).

Article
PubMed
PubMed Central

Google Scholar

Elzayyat, M., Mohammad, J. N. & Zaqout, S. Assessing LLM-generated vs. expert-created clinical anatomy mcqs: a student perception-based comparative study in medical education. Med. Educ. Online. 30 (1), 2554678 (2025).

Article
PubMed
PubMed Central

Google Scholar

Ling, Q. et al. Assessing the possibility of using large Language models in ocular surface diseases. Int. J. Ophthalmol. 18 (1), 1–8 (2025).

Article
PubMed

Google Scholar

Gotta, J. et al. Large Language models (LLMs) in radiology exams for medical students: performance and consequences. Rofo 197 (9), 1057–1067 (2025).

Article
PubMed

Google Scholar

Bahir, D. et al. Gemini AI vs. ChatGPT: A comprehensive examination alongside ophthalmology residents in medical knowledge. Graefes Arch. Clin. Exp. Ophthalmol. 263 (2), 527–536 (2025).

PubMed

Google Scholar

Al-Thani, S. N. et al. Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education. Int. J. Emerg. Med. 18 (1), 146 (2025).

Article
PubMed
PubMed Central

Google Scholar

Fattah, F. H. et al. Comparative analysis of ChatGPT and gemini (Bard) in medical inquiry: a scoping review. Front. Digit. Health. 7, 1482712 (2025).

Article
PubMed
PubMed Central

Google Scholar

Wu, X. et al. A multi-dimensional performance evaluation of large Language models in dental implantology: comparison of ChatGPT, DeepSeek, Grok, gemini and Qwen across diverse clinical scenarios. BMC Oral Health. 25 (1), 1272 (2025).

Article
PubMed
PubMed Central

Google Scholar

Singal, A. & Goyal, S. Comparative evaluation of AI platforms Google gemini 2.5 Flash, Google gemini 2.0 Flash, deepseek V3 and ChatGPT 4o in solving multiple-choice questions from different subtopics of anatomy. Surg. Radiol. Anat. 47 (1), 193 (2025).

Article

Google Scholar

Masalkhi, M., Ong, J., Waisberg, E. & Lee, A. G. Google deepmind’s gemini AI versus chatgpt: a comparative analysis in ophthalmology. Eye (Lond). 38 (8), 1412–1417 (2024).

Article

Google Scholar

Marey, A. et al. Evaluating the accuracy and reliability of AI chatbots in patient education on cardiovascular imaging: a comparative study of ChatGPT, gemini, and copilot. Egypt. J. Radiol. Nuclear Med. 56 (1), 37 (2025).

Article

Google Scholar

Bayala, Y. L. T. et al. Performance of the large Language models in African rheumatology: a diagnostic test accuracy study of ChatGPT-4, Gemini, Copilot, and Claude artificial intelligence. BMC Rheumatol. 9 (1), 54 (2025).

Article
PubMed
PubMed Central

Google Scholar

Madrid-García, A. et al. Harnessing ChatGPT and GPT-4 for evaluating the rheumatology questions of the Spanish access exam to specialized medical training. Sci. Rep. 13 (1), 22129 (2023).

Article
PubMed
PubMed Central
ADS

Google Scholar

Meral, G., Ateş, S., Günay, S., Öztürk, A. & Kuşdoğan, M. Comparative analysis of ChatGPT, gemini and emergency medicine specialist in ESI triage assessment. Am. J. Emerg. Med. 81, 146–150 (2024).

Article
PubMed

Google Scholar

Lee, J. T. et al. Evaluation of performance of generative large Language models for stroke care. Npj Digit. Med. 8 (1), 481 (2025).

Article
PubMed
ADS

Google Scholar

Bahir, D. et al. Gemini AI vs. ChatGPT: A comprehensive examination alongside ophthalmology residents in medical knowledge. Graefe’s Archive Clin. Experimental Ophthalmol. 263 (2), 527–536 (2025).

Google Scholar

Sabaner, M. C. & Yozgat, Z. Performance of ChatGPT-4 omni and gemini 1.5 pro on Ophthalmology-Related questions in the Turkish medical specialty exam. Turk. J. Ophthalmol. 55 (4), 177–185 (2025).

Article
PubMed
PubMed Central

Google Scholar

Hernández-Flores, L. A. et al. Assessment of challenging oncologic cases: A comparative analysis between ChatGPT, Gemini, and a multidisciplinary tumor board. J. Surg. Oncol. 131 (8), 1562–1570 (2025).

Article
PubMed

Google Scholar

Oh, N., Choi, G. S. & Lee, W. Y. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large Language models. Ann. Surg. Treat. Res. 104 (5), 269–273 (2023).

Article
PubMed
PubMed Central

Google Scholar

Wang, T. et al. Evaluating the performance of State-of-the-Art artificial intelligence chatbots based on the WHO global guidelines for the prevention of surgical site infection: Cross-Sectional study. J. Med. Internet Res. 27, e75567 (2025).

Article
PubMed
PubMed Central

Google Scholar

Li, H. et al. ChatGPT-4o outperforms gemini advanced in assisting multidisciplinary decision-making for advanced gastric cancer. Eur. J. Surg. Oncol. 51 (8), 110096 (2025).

Article
PubMed

Google Scholar

Lin, C. R. et al. Multiple large Language models versus clinical guidelines for postmenopausal osteoporosis: a comparative study of ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Google gemini, Google gemini Advanced, and Microsoft copilot. Arch. Osteoporos. 20 (1), 120 (2025).

Article

Google Scholar

Liu, R., Liu, J., Yang, J., Sun, Z. & Yan, H. Comparative analysis of ChatGPT-4o mini, ChatGPT-4o and gemini advanced in the treatment of postmenopausal osteoporosis. BMC Musculoskelet. Disord. 26 (1), 369 (2025).

Article
PubMed
PubMed Central

Google Scholar

Muluk, E. A comparative analysis of artificial intelligence platforms: ChatGPT-4o and Google gemini in answering questions about birth control methods. Cureus 17 (1), e76745 (2025).

PubMed Central

Google Scholar

McNulty, A. M. et al. Performance evaluation of ChatGPT-4.0 and gemini on image-based neurosurgery board practice questions: A comparative analysis. J. Clin. Neurosci. 134, 111097 (2025).

Article
PubMed

Google Scholar

Sau, S. et al. Accuracy and quality of ChatGPT-4o and Google gemini performance on image-based neurosurgery board questions. Neurosurg. Rev. 48 (1), 320 (2025).

Article
PubMed

Google Scholar

Ali, R. et al. Performance of ChatGPT and GPT-4 on neurosurgery written board examinations. Neurosurgery 93 (6), 1353–1365 (2023).

Article
PubMed

Google Scholar

Huang, K. A., Choudhary, H. K., Hardin, W. M. & Prakash, N. Comparative analysis of ChatGPT-4o and gemini advanced performance on diagnostic radiology In-Training exams. Cureus 17 (3), e80874 (2025).

PubMed
PubMed Central

Google Scholar

Clark, K. R. Comparative analysis of llms’ performance on a practice radiography certification exam. Radiol. Technol. 96 (5), 334–342 (2025).

Article

Google Scholar

Suh, P. S. et al. Comparing diagnostic accuracy of radiologists versus GPT-4V and gemini pro vision using image inputs from diagnosis please cases. Radiology 312 (1), e240273 (2024).

Article
PubMed

Google Scholar

Jain, S., Chakraborty, B., Agarwal, A. & Sharma, R. Performance of large Language models (ChatGPT and gemini Advanced) in Gastrointestinal pathology and clinical review of applications in gastroenterology. Cureus 17 (4), e81618 (2025).

PubMed
PubMed Central

Google Scholar

Khan, A. A. et al. Artificial intelligence for anesthesiology Board-Style examination questions: role of large Language models. J. Cardiothorac. Vasc Anesth. 38 (5), 1251–1259 (2024).

Article

Google Scholar

Passby, L., Jenko, N. & Wernham, A. Performance of ChatGPT on specialty certificate examination in dermatology multiple-choice questions. Clin. Exp. Dermatol. 49 (7), 722–727 (2024).

Article
PubMed

Google Scholar

Dhanvijay, A. K. D. et al. Performance of large Language models (ChatGPT, Bing Search, and Google Bard) in solving case vignettes in physiology. Cureus 15 (8), e42972 (2023).

PubMed
PubMed Central

Google Scholar

Kumari, A. et al. Large Language models in hematology case solving: A comparative study of ChatGPT-3.5, Google Bard, and Microsoft Bing. Cureus 15 (8), e43861 (2023).

PubMed
PubMed Central

Google Scholar

Cheong, R. C. T. et al. Performance of artificial intelligence chatbots in sleep medicine certification board exams: ChatGPT versus Google bard. Eur. Arch. Otorhinolaryngol. 281 (4), 2137–2143 (2024).

Article
PubMed

Google Scholar

Makrygiannakis, M. A., Giannakopoulos, K. & Kaklamanos, E. G. Evidence-based potential of generative artificial intelligence large Language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing. Eur. J. Orthod. (2024).

Yamaguchi, S. et al. Evaluating the efficacy of leading large Language models in the Japanese National dental hygienist examination: A comparative analysis of chatGPT, Bard, and Bing chat. J. Dent. Sci. 19 (4), 2262–2267 (2024).

Article
PubMed

Google Scholar

Fukuda, H. et al. Evaluating the image recognition capabilities of GPT-4V and gemini pro in the Japanese National dental examination. J. Dent. Sci. 20 (1), 368–372 (2025).

Article
PubMed

Google Scholar

Khan, M. P. & O’Sullivan, E. D. A comparison of the diagnostic ability of large Language models in challenging clinical cases. Front. Artif. Intell. 7, 1379297 (2024).

Article
PubMed

Google Scholar

Khan, A. A. et al. Assessing the performance of ChatGPT in medical ethical decision-making: a comparative study with USMLE-based scenarios. J. Med. Ethics (2025).

Lee, Y. et al. Harnessing artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and bard in generating clinician-level bariatric surgery recommendations. Surg. Obes. Relat. Dis. 20 (7), 603–608 (2024).

Article
PubMed

Google Scholar

Lee, Y. et al. Performance of artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and bard in the American society for metabolic and bariatric surgery textbook of bariatric surgery questions. Surg. Obes. Relat. Dis. 20 (7), 609–613 (2024).

Article
PubMed

Google Scholar

Anvari, S., Lee, Y., Jin, D. S., Malone, S. & Collins, M. Artificial intelligence in hepatology: a comparative analysis of ChatGPT-4, Bing, and bard at answering clinical questions. J. Can. Assoc. Gastroenterol. 8 (2), 58–62 (2025).

Article
PubMed
PubMed Central

Google Scholar

Zare, S. et al. Comparing the performance of ChatGPT-3.5-Turbo, ChatGPT-4, and Google bard with Iranian students in pre-internship comprehensive exams. Sci. Rep. 14 (1), 28456 (2024).

Article
CAS
PubMed
ADS

Google Scholar

Roos, J., Martin, R. & Kaczmarczyk, R. Evaluating bard gemini pro and GPT-4 vision against student performance in medical visual question answering: comparative case study. JMIR Form. Res. 8, e57592 (2024).

Article
PubMed
PubMed Central

Google Scholar

Meo, S. A., Abukhalaf, F. A., ElToukhy, R. A. & Sattar, K. Exploring the role of DeepSeek-R1, ChatGPT-4, and Google gemini in medical education: how valid and reliable are they? Pak J. Med. Sci. 41 (7), 1887–1892 (2025).

Article
PubMed
PubMed Central

Google Scholar

Omar, M. et al. Generating credible referenced medical research: A comparative study of openai’s GPT-4 and google’s gemini. Comput. Biol. Med. 185, 109545 (2025).

Article

Google Scholar

Guidance, W. Ethics and Governance of Artificial Intelligence for Health (World Health Organization, 2021).

Asgari, E. et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. Npj Digit. Med. 8 (1), 274 (2025).

Article
PubMed
PubMed Central

Google Scholar

Comparative performance of ChatGPT-4o, ChatGPT-5, and gemini 2.5 flash on Persian internal medicine subspecialty board exams

Sharecaster

Categories

Newsletter

Welcome Back!

Retrieve your password

Comparative performance of ChatGPT-4o, ChatGPT-5, and gemini 2.5 flash on Persian internal medicine subspecialty board exams

Sharecaster

Categories

Tags

Newsletter

Welcome Back!

Retrieve your password