Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions

Newton, Phil; Summers, Chris; ZAHEER, MUHAMMAD; XIROMERITI, MARIA; STOKES, JEMIMA; BHANGU, JASKARAN; ROOME, ELIS; ROBERTS-PHILLIPS, ALANNA; MAZAHERI-ASADI, DARIUS; JONES, CAMERON; HUGHES, STUART; GILBERT, DOMINIC; JONES, EWAN; ESSEX, KEIONI; ELLIS, EMILY; DAVEY, ROSS; COX, ADRIENNE; BASSETT, JESSICA

doi:10.1007/s40670-025-02293-z

Journal article 1076 views 318 downloads

Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions

Phil Newton

, Chris Summers

, MUHAMMAD ZAHEER, MARIA XIROMERITI, JEMIMA STOKES, JASKARAN BHANGU, ELIS ROOME, ALANNA ROBERTS-PHILLIPS, DARIUS MAZAHERI-ASADI, CAMERON JONES, STUART HUGHES, DOMINIC GILBERT, EWAN JONES, KEIONI ESSEX, EMILY ELLIS, ROSS DAVEY, ADRIENNE COX, JESSICA BASSETT

Medical Science Educator, Volume: 35, Issue: 2, Pages: 721 - 729

Swansea University Authors: Phil Newton , Chris Summers , MUHAMMAD ZAHEER, MARIA XIROMERITI, JEMIMA STOKES, JASKARAN BHANGU, ELIS ROOME, ALANNA ROBERTS-PHILLIPS, DARIUS MAZAHERI-ASADI, CAMERON JONES, STUART HUGHES, DOMINIC GILBERT, EWAN JONES, KEIONI ESSEX, EMILY ELLIS, ROSS DAVEY, ADRIENNE COX, JESSICA BASSETT

PDF | Version of Record

© The Author(s) 2025. Open Access. This article is licensed under a Creative Commons Attribution 4.0 International License (CC-BY 4.0).
Download (900.08KB)

Check full text

DOI (Published version): 10.1007/s40670-025-02293-z

Abstract

ChatGPT apparently shows excellent performance on high-level professional exams such as those involved in medical assessment and licensing. This has raised concerns that ChatGPT could be used for academic misconduct, especially in unproctored online exams. However, ChatGPT has previously shown weake...

Full description

Published in:	Medical Science Educator
ISSN:	2156-8650
Published:	Springer Nature 2025
Online Access:	Check full text
URI:	https://cronfa.swan.ac.uk/Record/cronfa67970

Abstract:	ChatGPT apparently shows excellent performance on high-level professional exams such as those involved in medical assessment and licensing. This has raised concerns that ChatGPT could be used for academic misconduct, especially in unproctored online exams. However, ChatGPT has previously shown weaker performance on questions with pictures, and there have been concerns that ChatGPT’s performance may be artificially inflated by the public nature of the sample questions tested, meaning they likely formed part of the training materials for ChatGPT. This led to suggestions that cheating could be mitigated by using novel questions for every sitting of an exam and making extensive use of picture-based questions. These approaches remain untested. Here, we tested the performance of ChatGPT-4o on existing medical licensing exams in the UK and USA, and on novel questions based on those exams. ChatGPT-4o scored 94% on the United Kingdom Medical Licensing Exam Applied Knowledge Test and 89.9% on the United States Medical Licensing Exam Step 1. Performance was not diminished when the questions were rewritten into novel versions, or on completely novel questions which were not based on any existing questions. ChatGPT did show reduced performance on questions containing images when the answer options were added to an image as text labels. These data demonstrate that the performance of ChatGPT continues to improve and that secure testing environments are required for the valid assessment of both foundational and higher order learning.
Keywords:	Assessment validity; Academic integrity; Cheating; Evidence-based education; MCQs; Pragmatism
College:	Faculty of Medicine, Health and Life Sciences
Funders:	Swansea University
Issue:	2
Start Page:	721
End Page:	729

Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions

Similar Items