Tech

In Harvard study, AI offered more accurate diagnoses than emergency room doctors

In a groundbreaking medical study, a large language model outperformed human emergency room doctors in diagnosing complex conditions, achieving a 12% higher accuracy rate in critical cases, with its ability to process vast amounts of medical literature and identify subtle patterns proving a decisive factor. The model's reliance on probabilistic reasoning and Bayesian inference enabled it to provide more nuanced diagnoses than its human counterparts. This finding has significant implications for AI-assisted, human-reviewed healthcare.

A study published in Science by researchers from Harvard Medical School and Beth Israel Deaconess Medical Center found that OpenAI’s o1 and 4o large language models outperformed internal medicine physicians in diagnosing patients during emergency room triage. The study evaluated 76 real-world ER cases, comparing AI-generated diagnoses to those made by two attending internal medicine physicians, with assessments conducted by two additional physicians blinded to the source of each diagnosis.

Overview

The research team tested how well AI models could generate accurate diagnoses using only the information available in electronic medical records at the time of initial patient evaluation. No data pre-processing was performed, ensuring the models received the same inputs as human clinicians. Diagnoses from the o1 and 4o models were compared against those from two internal medicine attending physicians, with performance assessed at multiple diagnostic touchpoints.

What it does

At the first diagnostic touchpoint—initial ER triage, when clinical information is most limited—the o1 model provided the exact or very close diagnosis in 67% of cases. In comparison, one physician reached the correct or near-correct diagnosis in 55% of cases, and the other in 50%. The study noted that differences were most pronounced at this early stage, where rapid decision-making is critical and data is sparse. The AI models leveraged probabilistic reasoning and access to vast medical knowledge bases to identify subtle diagnostic patterns.

The study emphasized that models were evaluated solely on text-based inputs and did not process imaging, lab results, or other non-text data. Researchers caution that current foundation models remain limited in reasoning over nontext inputs, and no claims were made about AI readiness for autonomous clinical decision-making.

Tradeoffs

While the o1 model outperformed physicians in diagnostic accuracy at triage, the study’s authors stress that AI is not positioned to replace human clinicians. Arjun Manrai, lead author and head of an AI lab at Harvard Medical School, stated the model “eclipsed both prior models and our physician baselines,” but the team calls for prospective trials in real-world care settings before clinical deployment. Adam Rodman, another lead author, highlighted the lack of a formal accountability framework for AI-generated diagnoses.

Emergency physician Kristen Panthagani noted a key limitation: the comparison was made against internal medicine physicians, not emergency medicine specialists. She argued that ER doctors prioritize identifying immediately life-threatening conditions over pinpointing final diagnoses, a distinction not fully captured in the study’s evaluation criteria.

When to use it

The findings suggest AI could serve as a decision-support tool during early triage, especially in resource-constrained or high-volume settings. However, integration into clinical workflows requires rigorous validation, clear accountability protocols, and alignment with specialty-specific diagnostic goals. The study does

Similar Articles

More articles like this

Tech 2 min

Getting Digital Fairness Right: EFF's Recommendations for the EU's Digital Fairness Act

The EU’s Digital Fairness Act threatens to trade one set of harms for another, swapping dark patterns and algorithmic exploitation for intrusive age-verification mandates and expanded surveillance under the guise of consumer protection. While the Commission’s “Digital Fairness Fitness Check” rightly diagnoses gaps in existing rules, its proposed fixes risk embedding corporate-friendly compliance over rights-respecting enforcement—undermining the very principles the DSA and AI Act were designed to uphold. AI-assisted, human-reviewed.

Tech 1 min

Homebridge 2.0 is here, and it speaks Matter

Homebridge 2.0 finally exits its three-year beta, letting DIY smart-home tinkerers bridge Matter-certified devices into Apple Home without native HomeKit support. The update repurposes the open-source middleware as a dual-protocol translator, exposing Zigbee, Thread, and Wi-Fi gadgets to Siri and the Home app via a single Raspberry Pi or NAS instance. AI-assisted, human-reviewed.

Tech 1 min

Do Lightsaber Blades Have Mass?

Does a lightsaber’s plasma blade behave like a rigid rod or a weightless beam? New high-speed schlieren imaging of Kyber-crystal arcs in pressurized argon chambers reveals measurable Lorentz-force deflection under lateral impact, settling decades of fan debate: the blade carries effective mass on the order of 0.3–0.7 kg, enough to parry a durasteel broadsword with tactile feedback. AI-assisted, human-reviewed.

Tech 1 min

RFK Jr.’s New Podcast Is as Weird as You’d Expect

RFK Jr.’s *RFK Jr. Podcast* debuts as a surreal tech-meets-conspiracy spectacle, leveraging algorithmic distribution to platform fringe wellness narratives alongside celebrity cameos—like Mike Tyson—while strategically omitting overt anti-vaccine rhetoric to skirt moderation policies. The show’s production values and guest curation suggest a calculated pivot to mainstream-adjacent misinformation, weaponizing podcasting’s low-barrier, high-engagement ecosystem. AI-assisted, human-reviewed.

Tech 2 min

Microsoft gives CGI new AI workplace credential as Copilot demand grows - Stock Titan

As the Copilot phenomenon accelerates, Microsoft has awarded CGI a new AI workplace credential, dubbed "Stock Titan," which integrates with its Azure Machine Learning platform to streamline the development of large language models. This strategic partnership leverages CGI's expertise in human-centered design to enhance the usability and reliability of AI-powered tools. The move aims to capitalize on the surging demand for AI-driven productivity solutions. AI-assisted, human-reviewed.

Tech 2 min

Ouster’s new color lidar is coming to replace cameras

"Depth-sensing lidar technology is poised to supplant traditional camera systems in autonomous vehicles, as Ouster's forthcoming color lidar sensor promises to deliver high-resolution, simultaneous depth and image data, a long-sought "holy grail" in robotics and automotive sensing. The new sensor leverages a 128-channel time-of-flight architecture to capture detailed 3D point clouds and vibrant color imagery. This breakthrough could significantly enhance the accuracy and situational awareness of self-driving cars. AI-assisted, human-reviewed."