AI outperforms doctors in diagnosis tests — but don’t expect it to replace your doctor

A new study that examined how well artificial intelligence performed in an emergency room setting found that it outperformed doctors at diagnosing patients.

The study, published in Science on Thursday, evaluated OpenAI’s o1 model, which the company released in 2024. The model is a reasoning-focused AI specifically designed to excel at complex, structured problems. This makes it fairly different from chatbots like ChatGPT, which OpenAI designed as a generalist.

Despite the positive results from the study, the researchers emphasized the study’s limitations and raised concerns that their results may be used by others to suggest AI replace doctors. Dr. Adam Rodman, a general internist and medical educator at Beth Israel Deaconess Medical Center and the co-author of the study, said he gets “a little bit queasy about how some of these results might be used.”

What did the studies find?

The research team tested the model in different ways. The first way tested against curated medical training cases. These cases are specifically designed to test doctors’ diagnostic thinking and are just like the exams physicians take in school.

In these tests, the AI model consistently outperformed across these scenarios. But the way that test where the AI really excelled was with historical real-world ER cases.

During this test, researchers pulled patient cases from Beth Israel Deaconess Medical Center to get as close to real conditions as possible. The researchers noted that they gave the AI raw electronic health record data, which they described as “messy” and similar to what actual doctors encounter.

The team tested the data at two different points: when the patient arrived at triage and then later when the patient was ready for admission. In the first test, the AI got the correct diagnosis 67% of the time, compared to 50% and 55% for the two human doctors it measured against. By the time the patient was ready for admission, the AI’s diagnosis jumped to 81%, compared to 70% and 79% for the human doctors.

Although the tests were not conducted during the hospital visit, the researchers found that the AI was effective at making diagnoses.

“We can definitively say … reasoning models can meet that criteria for making diagnostic reasoning at the highest levels of human performance,” Rodman said, according to Vox.

ChatGPT is still no doctor

While specific models developed for the medical field performed well in tests, that doesn’t mean ChatGPT or other large language models are a replacement for a general physician. Something the researchers specifically noted.

”No one should look at this and say we do not need doctors,” Rodman said.

A previous study published in February in Nature Medicine found that ChatGPT underestimated the severity of a patient’s condition in 52% of cases.

To test this, researchers gave the chatbot a range of medical scenarios, from non-urgent to medical emergencies, and evaluated its performance in assessment. In one example of the AI failing to notice the severity of the diagnosis, the bot told a patient on the verge of diabetic shock or respiratory failure to just monitor themselves instead of seeking immediate care. It also repeatedly failed to pick up on clear signs of suicidal ideation, a topic Straight Arrow has previously reported on.

The authors of the study published on Thursday have asked that their research be used in clinical trials testing how AI performs under real-world conditions before any hospital uses it to deploy more AI.

“Our findings suggest the urgent need for prospective trials to evaluate these technologies in real-world patient care settings,” the authors wrote.