Clinical Performance of AI on Real Cases

This is a pretty interesting paper published in the April 30 edition of Science ( Peter G. Brodeur et al., Performance of a large language model on the reasoning tasks of a physician. Science 392,524-527 (2026). DOI:10.1126/science.adz4433 ).

Open access available here:
https://www.science.org/doi/10.1126/science.adz4433
It discusses some of the earlier OpenAI models’ (e.g. o1-preview and GPT-4) performances on generating differential diagnoses and then looked at how o1 and 4o performed on real world ED and ICU admissions when compared to two Internal Medicine physicians at Beth Israel Deaconess in Boston.

Excerpts from the article:

“The o1 model identified the exact or very close diagnosis (Bond scores of 4 to 5) in 67.1% of cases during the initial ER triage, 72.4% during the ER physician encounter, and 81.6% at admission to the medical floor or ICU—surpassing the two physicians (55.3, 61.8, and 78.9% for Physician 1; 50.0, 52.6, and 69.7% for Physician 2) at each stage.”

and

“We emphasize that our study addresses only text-based performance for both humans and machines; clinical medicine is multifaceted and awash with nontext inputs, including auditory (such as the patient’s level of distress) and visual information (for example, interpretation of medical imaging studies) that clinicians routinely use. Existing studies suggest that current foundation models are more limited in reasoning over nontext inputs (26, 27); future work is needed to assess how humans and machines may effectively collaborate (28) in use of nontext signals.”

Progress apparently continues apace; as these are now “older” models, I would agree with the authors that “Although we expect performance to be sustained or improved with newer models (27, 29), further studies should be done to elucidate how performance varies across models and to study how humans and LLMs may collaborate.”

Clinical Performance of AI on Real Cases

Share This Story, Choose Your Platform!