Towards More Realistic Evaluation of Medical AI Systems

July 8th, 2024 | Klemen Vodopivec

Existing Medical LLM Benchmarks: Assessing Knowledge but Not Practical Use

While medical students are assessed through a variety of methods, including practical exams and clinical rotations, the predominant approach for evaluating LLMs in the medical field involves medical exam-type questions, with a strong emphasis on multiple-choice formats. Some of the most common medical benchmarks are (1):

MedQA: A dataset of medical question-answering pairs sourced from Medical Licensing Exams in the US, Mainland China, and Taiwan. Each question is accompanied by 4-5 multiple-choice options, with one correct answer and supporting explanations or references.
PubMedQA: A dataset of biomedical research questions and answers based on PubMed abstracts, focusing on yes/no, factoid, and list-type questions.
MedMCQA: A collection of medical multiple-choice questions covering various topics, including anatomy, physiology, and clinical medicine.
MMLU clinical topics: A subset of the Massive Multitask Language Understanding (MMLU) benchmark, focusing on clinical topics and utilizing a multiple-choice format.
MultiMedQA: A multilingual medical question answering dataset that includes questions from the USMLE, NBME, and other medical exams in multiple languages.

These benchmarks provide a standardized way to assess medical knowledge and reasoning skills, but they fail to capture the complex and dynamic nature of real-world clinical work (2). In practice, doctors engage in sequential decision-making, handling uncertainty with limited resources while compassionately caring for patients and gathering relevant information from them. This multifaceted aspect of clinical practice is not reflected in the static multiple-choice evaluations that currently dominate the literature (3).

As compound AI systems—architectures that integrate multiple AI models to perform complex tasks—emerge, there is a growing need for new benchmarks that capture agentic behaviors. To support the safe and effective deployment of AI in medicine, we must design benchmarks that better simulate real-world clinical workflows by shifting our focus from static datasets to dynamic simulation environments.

Envisioning a Benchmark For Medical Agents

While developing MedAsk, our symptom assessment tool, we recognized the need for a simulation-based benchmark to evaluate its performance. As no existing benchmark suited our needs, we decided to design our own, drawing inspiration from the initial work in this area (4, 5, 6).

Our goal is to create an Objective Structured Clinical Examination (OSCE)-style benchmark that evaluates AI agents’ proficiency in predicting patient diagnoses through diagnostic dialogue and various other sources of health data. This benchmark, which will be open-sourced, will consist of three tasks, each designed to mirror the envisioned progression of Medask’s development:

Text-based dialogue: The agent will engage in a text-only conversation with a simulated patient to gather relevant information and attempt to reach a diagnosis.
Multimodal conversation: The agent will interact with a simulated patient using text, images (e.g., skin lesions), and home medical data (e.g., blood pressure, glucose levels, wearables) to achieve a more accurate diagnosis.
Comprehensive data access: The agent will have access to the patient’s complete EHR (lab test results, medical history, genetic testing, prior diagnoses), consider medical staff opinions, and engage in patient dialogue to gather additional insights.

The primary outcome metric for evaluating the AI agents will be the top-5 diagnostic accuracy, which measures the agent’s ability to include the correct diagnosis within its top-5 predictions. Secondary metrics, such as history taking, triage accuracy and explainability will also be added in the future.

We are in the final stages of development for the first task of the benchmark and will release it in the near future, along with the results of a comparative evaluation of MedAsk and other state-of-the-art models such as GPT-4o and Claude 3.5 Sonnet. Stay tuned!

Sources

https://huggingface.co/blog/leaderboard-medicalllm (Accessed June 15, 2024)
https://sergeiai.substack.com/p/googles-med-gemini-im-excited-and (Accessed June 15, 2024)
https://www.linkedin.com/pulse/compass-broken-how-current-medllm-benchmarks-worsen-ai-walker-md-yqzec/?trackingId=moLt0%2FOUst8S0GgF6LlwYA%3D%3D (Accessed June 15, 2024)
Schmidgall, Samuel, et al. “AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments.” arXiv preprint arXiv:2405.07960 (2024).
Tu, Tao, et al. “Towards conversational diagnostic ai.” arXiv preprint arXiv:2401.05654 (2024).
Johri, S., Jeong, J., Tran, B. A., Schlessinger, D. I., Wongvibulsin, S., Cai, Z. R., Daneshjou, R., & Rajpurkar, P. (2023). Guidelines for Rigorous Evaluation of Clinical LLMs for Conversational Reasoning. medRxiv (Cold Spring Harbor Laboratory). https://doi.org/10.1101/2023.09.12.23295399