A team of researchers from Ben-Gurion University of the Negev has developed a new database to test the ability of AI language models to diagnose complex medical cases.
At the Association for the Advancement of Artificial Intelligence conference in Philadelphia, researchers unveiled findings that challenge conventional approaches to AI in healthcare. Their study revealed that general-purpose AI models, such as GPT-4o, may outperform specialized medical models in diagnosing complex medical cases. This discovery could reshape how AI is applied to healthcare, offering faster and more accurate diagnostic tools.
Traditionally, AI language models have been tested on straightforward medical scenarios, like exam-style questions or common diseases. However, real-world medical cases are often far more intricate, requiring nuanced reasoning. To address this, the research team constructed the CUPCase database, comprising 3,562 detailed case reports from the BMC Journal of Medical Case Reports. These cases, featuring rare and unusual conditions, were formatted into multiple-choice and open-ended questions to simulate real-life diagnostic challenges doctors face.
The results were striking. GPT-4o, a general-purpose model, achieved 87.9% accuracy on multiple-choice questions and 76.4% on open-ended ones, surpassing specialized models like Meditron-70B and MedLM-Large. “We were surprised that general models like GPT-4o outperformed those tailored for medicine,” said researcher Ofir Ben-Shoham. This suggests that broad, versatile AI models may excel in handling the complexity of real-world medical diagnostics.