AI in the ER: What New Study Reveals

In an April 30 article posted by Harvard Magazine, it was reported that the study revealed that “an advanced AI agent has outperformed human physicians on a series of demanding tests that assess the ability to correctly diagnose patient illnesses in clinical settings.” According to the study, OpenAI’s o1 preview, the company’s first model capable of step-by-step reasoning, proved that it could conduct real world triage in ERs, recommend appropriate diagnostic tests, and perform case management tasks at a level that matched or surpassed the ability of human physicians.

A Promising Future

The study, led by Harvard researchers with collaborators at Stanford and published in Science, suggests an urgent need for controlled trials of the technology to determine how it can be most effectively deployed.

According to the article, researchers put the o1 preview system through the following paces:

They asked the large language model (LLM) to arrive at a patient diagnosis and develop a testing plan.
They evaluated its skill in clinical reasoning compared to both experts and generalist physicians.
They assessed the LLM’s performance on 76 emergency room cases in a Boston hospital at three stages: initial triage at arrival, first contact with a physician, and upon admission to the medical floor or intensive care unit.

The doctors who reviewed these tests found that the system “matched or exceeded expert human performance across each stage.” The AI was particularly good at making assessments at the initial triage stage, when there was the least information available, according to the article.

In other tests, the AI model proved “especially adept at diagnoses involving rare diseases and complex cases.” It was noted that “AI excelled in an evaluation that involved real scenarios from Massachusetts General Hospital that have been published in The New England Journal of Medicine.”

Delving into the Details

Thomas Buckley, a doctoral student at Harvard Medical School who worked on the study, stated that “the results suggest that o1 preview is achieving nearly optimal diagnosis on this set of challenging cases that have been used as benchmarks for assessing the diagnostic ability of computers since 1959.”

The study further found that the system significantly outperformed previous AI models, as well as humans using conventional aids, such as up-to-date Google search, on tasks involving what doctors refer to as “management reasoning.” This ranges from recommendations for antibiotic use to care goals approach and even end-of-life conversations.

Peter Brodeur, a clinical fellow at Beth Israel Deaconess Medical Center had these observations:

Management reasoning is likely a more complex task than diagnostic reasoning. It requires many considerations of not only the objective features of a case, but also subjective factors: what context and situations you’re in, and therefore, it probably doesn’t come as a surprise that a reasoning model performs significantly better at such tasks than humans and ChatGPT-4.

Send in the Robots?

Despite all this potential for improved care, the researchers are not suggesting that AI can replace doctors. After all, AI comes with some notable caveats. The study was based entirely on text-based inputs, a domain in which language models excel. But the researchers note that practicing physicians are evaluating many other forms of information, such as listening to the patient, reviewing X-rays and imaging studies, as well as taking note of physiological signals, EKGs, ECGs, etc.

Rather than replacing providers, the research team “envisions AI models working in partnership with physicians, to help them make better decisions.” One researcher went so far as to say that “AI models can get things wrong” and they “can be sycophantic.”

Summing It Up

The study’s senior co-author Adam Rodman, an assistant professor at Harvard Medical School who leads the school’s task force for integrating AI into the curriculum, said “the study definitively shows that reasoning models of AI can meet the criteria for making diagnoses at the highest levels of human performance.” The results suggest at least two instances in which such models could be especially useful to physicians. One is performing triage in ERs. The other is providing a second opinion. With the rapid advancement of this technology, there is hope that hospital care—particularly in the ER—will be advanced in the near future.

AI in the ER: What New Study Reveals

Share