Introducing SymptomCheck Bench

November 2nd, 2024 | Klemen Vodopivec, Rok Vodopivec, Nejc Završan

Overview

In our previous blog post, we highlighted the need for more realistic, agentic-like benchmarks in medical AI evaluation. Today, we present SymptomCheck Bench, our first step towards addressing this challenge.

SymptomCheck Bench is an OSCE-style (1) benchmark designed to test the diagnostic accuracy of Large Language Model (LLM) based agents that engage in a text-only conversation with the patient – like our MedAsk app. We chose this name because the tasks we evaluate closely mirror the functionality of applications known as symptom checkers (2) – extracting symptom information from users and generating possible medical conditions based on those symptoms.

In this article, we will:

Explain how SymptomCheck Bench works and describe its key components.
Present comparative results of MedAsk versus traditional symptom checkers.
Discuss the benchmark’s limitations and explore future improvements.

We believe transparency is paramount in the medical domain. As such, we have made the entire benchmark, including code and data, freely accessible on GitHub (3). We encourage the medical AI community to engage with SymptomCheck Bench and contribute to its ongoing development.

SymptomCheck Benchmark Description

SymptomCheck Bench follows a structured, four-step process, as illustrated in Figure 1.

Initialization: A clinical vignette is selected from a predefined set, providing the clinical details to the patient simulator.
Dialogue: The symptom checker agent and patient simulator engage in a text-based dialogue, simulating a real symptom assessment conversation, with the agent asking questions and the patient providing responses based on the clinical vignette data.
Diagnosis: Once the symptom checker agent has gathered sufficient information, it produces a list of the top five differential diagnoses (DDx).
Evaluation: The evaluator agent compares the DDx list against the ground truth diagnosis from the selected clinical vignette to assess the symptom checker agent’s performance.

Symptom Checker Agent

At the heart of SymptomCheck Bench is the Symptom Checker Agent, the AI system under evaluation. Its main tasks are:

Engage in a text-based conversation with the simulated patient.
Ask relevant questions to gather symptoms, medical history, and other clinical details.
Generate a differential diagnosis (DDx) based on the data collected.

The agent is limited to asking 12 questions before the conversation concludes, since longer conversations seem to actually decrease the diagnostic accuracy (4). Various large language models are supported as the underlying engine for the agent, including the GPT series, Mistral, Claude, DeepSeek, and others. Appendix A shows the prompt instructions for the symptom checker agent.

In line with our commitment to transparency, we have made an older version of MedAsk available on GitHub (3) as the symptom checker agent for SymptomCheck Bench. This version was used for comparative testing against other symptom checkers and serves as a baseline for evaluating future iterations of MedAsk. For future versions, an API will be provided to allow anyone to test MedAsk using the SymptomCheck Bench framework.

Patient Agent

The patient agent simulates patient interactions based on scenarios from clinical vignettes using GPT-4o as the base language model. Its tasks are as follows:

Start the conversation with the chief complaint from the vignette.
Answer follow-up questions using data from the vignette.
If asked about information not present in the vignette, respond with “I don’t know.”

Importantly, the patient agent is not provided with its ground truth diagnosis to prevent information leakage into the conversation. Appendix B shows the prompt instructions for the patient agent.

We sourced our clinical vignettes from research conducted by Hammoud et al., which compared the diagnostic accuracy of six different symptom checkers. Their work provided 400 open-source vignettes representing a diverse range of clinical scenarios. A more detailed description of the vignettes and their creation process can be found in the original publication (5).

Figure 2 illustrates an example vignette and the simulated dialogue between a patient agent and symptom checker agent (MedAsk) based on that vignette.

Evaluator Agent

At the conclusion of the dialogue, the symptom checker agent returns a list of the top five differential diagnoses. This list is then automatically compared to the ground truth diagnosis from the vignette by the evaluator agent, which uses GPT-4o as its base model. The evaluator determines if the ground truth diagnosis appears in the DDx list and at what position, known as top-k diagnostic accuracy.

One of the key challenges in developing the evaluator agent was selecting an evaluation prompt that could account for the diversity in disease naming. We adopted the disease-matching definitions from the study by Tu et al. (6), which are outlined in Table 1. A diagnosis is considered a match only if it meets the “Exact Match” or “Extremely Relevant” definitions, which we believe strikes the best balance between strictness and practicality. The full prompt used for evaluation can be found in Appendix C.

Degree of Matching	Description
Unrelated	Nothing in the DDx is related to the probable diagnosis.
Somewhat Related	DDx contains something that is related, but unlikely to be helpful in determining the probable diagnosis.
Relevant	DDx contains something that is closely related and might have been helpful in determining the probable diagnosis.
Extremely Relevant	DDx contains something that is very close, but not an exact match to the probable diagnosis.
Exact Match	DDx includes the probable diagnosis.

Table 1: Degrees of Matching Between DDx and Ground Truth Diagnosis, Taken from (6).

The auto-evaluation method we used for this benchmark might be perceived as a potential weakness. However, recent studies (6, 7) have shown that such methods align well with human ratings in similar contexts.

To ensure the robustness of our approach, we brought human experts into the loop. Three different medical experts manually evaluated the matches between ground truth diagnosis and MedAsk outputs for 100 clinical vignettes. We compared their assessments with our evaluator agent’s results using the matching criteria described above. As shown in Table 2, we found substantial agreement between the human experts and the LLM-based evaluator.

Comparison	Kendall’s Tau-b (p-value)
Overall Agreement	0.9199 (< 0.001)
Pairwise Agreements
Evaluator Agent vs Expert 1	0.9927 (< 0.001)
Evaluator Agent vs Expert 2	0.9544 (< 0.001)
Evaluator Agent vs Expert 3	0.9081 (< 0.001)
Expert 1 vs Expert 2	0.9617 (< 0.001)
Expert 1 vs Expert 3	0.8252 (< 0.001)
Expert 2 vs Expert 3	0.8771 (< 0.001)

Table 2: Inter-Rater Agreement for Human Expert Evaluation and Auto-evaluation.

MedAsk Performance Compared to Existing Symptom Checkers

We evaluated MedAsk, using GPT-4o as its base model, by running the benchmark five times across all 400 clinical vignettes (results can be found on GitHub). The averaged diagnostic accuracy results were:

Top 1 (correct diagnosis as first choice): 58.3%
Top 3 (correct diagnosis within first three choices): 78.7%
Top 5 (correct diagnosis within first five choices): 82.0%

One significant advantage of using the vignettes from (5) is that their study already established baseline performance metrics for six different commercial symptom checkers using the same clinical scenarios. While methodological differences exist between our evaluation approach (automated patient simulation and evaluation) and theirs (human-based interactions and assessments; see their study’s methods section for details), we believe the comparison provides valuable insights into MedAsk’s relative performance. Figure 3 shows the comparative results of the five best performing symptom checkers from the study (Avey, Ada, WebMD, K Health and Buoy) and MedAsk.

As shown in Figure 3, MedAsk demonstrated impressive performance compared to existing symptom checkers: ranking second behind Avey in accuracy across all metrics, outperforming Ada, and doubling the accuracy of WebMD, K Health, and Buoy.

This performance is especially noteworthy given MedAsk’s development timeline. While established symptom checkers have benefited from at least seven years of development and refinement, MedAsk achieved these results after just six months of development. This rapid progress demonstrates the significant potential of LLM-based approaches in symptom assessment. Based on ongoing improvements to our latest version, we anticipate even stronger performance in future evaluations.

Limitations and Future Work

The current implementation of SymptomCheck Bench has several important limitations that should be addressed in future work. First, our reliance on an LLM-based patient simulator, while practical, introduces potential biases that have not been fully characterized. We have yet to conduct systematic comparisons between our simulated patient responses and those from patient actors commonly used in similar studies. Additionally, the impact of different base models for the patient simulator remains unexplored, though a recent study (4) suggests that the choice of language model can significantly affect the nature and quality of simulated interactions.

While MedAsk and similar systems often provide triage recommendations and treatment plans, SymptomCheck Bench does not evaluate these capabilities. Furthermore, the benchmark doesn’t assess important qualitative aspects of the consultation process (8), such as the efficiency of information gathering, demonstration of empathy, or the rate of hallucinations and factual errors in the agent’s responses.

Third, the use of publicly available clinical vignettes raises concerns about data contamination, as some of these scenarios may have been included in the training data of LLMs, potentially leading to inflated performance metrics. The current set of 400 vignettes, while valuable, also provides limited coverage of possible medical conditions (9). Future iterations of the benchmark would benefit from both expanding the vignette database and ensuring the use of novel clinical scenarios that are verifiably absent from LLM training data.

Finally, during the initial stages of the experiment, we performed benchmarking tests of smaller scope on Mixtral, Mistral Large, DeepSeek V2.5, Claude 3 Opus, Claude 3.5 Sonnet, GPT-3.5 and GPT-4o. Given GPT-4o’s superior performance and our financial constraints, we concentrated our extensive evaluation on this model alone. Extensively benchmarking the other models, or newer releases such as o1, could provide valuable insights into the relationship between underlying model capabilities and diagnostic accuracy.

Sources

OSCE Definition – https://geekymedics.com/what-is-an-osce/ (Accessed October 20, 2024)
WebMD Symptom Checker – https://symptoms.webmd.com/ (Accessed October 20, 2024)
SymptomCheck Bench GitHub – https://github.com/medaks/symptomcheck-bench
Schmidgall, Samuel, et al. “AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments.” arXiv preprint arXiv:2405.07960 (2024).
Hammoud M, Douglas S, Darmach M, Alawneh S, Sanyal S, Kanbour Y. Evaluating the Diagnostic Performance of Symptom Checkers: Clinical Vignette Study. URL: https://ai.jmir.org/2024/1/e46875
Tu, Tao, et al. “Towards conversational diagnostic ai.” arXiv preprint arXiv:2401.05654 (2024).
Lan, Tian, et al. “Criticbench: Evaluating large language models as critic.” arXiv preprint arXiv:2402.13764 (2024).
Yao, Zonghai, et al. “MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework.” arXiv preprint arXiv:2410.01553 (2024).
Kopka, Marvin, et al. “Evaluating self-triage accuracy of laypeople, symptom-assessment apps, and large language models: A framework for case vignette development using a representative design approach (RepVig).” medRxiv (2024): 2024-04.

Appendix

A) Symptom Checker Agent Prompt Instructions

You are a doctor diagnosing through an online chat platform a patient with the following characteristics {DEMOGRAPHICS}.

You will ask the patient concise questions (1-3 sentences at a time) in order to understand their disease. After gathering sufficient information, finish the conversation by writing chosen diagnoses in this format:

DIAGNOSIS READY: [diagnosis1, diagnosis2, diagnosis3, diagnosis4, diagnosis5]

Below is the dialogue history. Provide the doctor’s response.

{DIALOGUE_HISTORY}

B) Patient Agent Prompt Instructions

You are a patient with the following background:

DEMOGRAPHICS: {DEMOGRAPHICS}

HISTORY: {HISTORY_OF_ILLNESS}

PRIMARY COMPLAINTS: {PRIMARY_COMPLAINTS}

ADDITIONAL DETAILS: {ABSENT_FINDINGS, PHYSICAL_HISTORY, FAMILY_HISTORY, SOCIAL_HISTORY}

You are visiting a doctor because of your PRIMARY COMPLAINTS. A doctor will ask you questions to diagnose your condition. Provide concise answers of 1-3 sentences, sharing only the relevant information based on your background. If the doctor asks about something not mentioned in the background, simply reply ‘I don’t know.’

Below is the dialogue history. Provide the patient’s response.

{DIALOGUE_HISTORY}

C) Evaluator Agent Prompt Instructions

Given a list of differential diagnoses and the correct diagnosis, determine if any of the diagnoses in the list are either an exact match, or very close, but not an exact match to the correct diagnosis.

If any diagnosis meets these criteria, specify its position, starting from 1. If none of the diagnoses meet these criteria, write -1.

Respond in the following format: Correct diagnosis position: [number]

OBTAINED DIAGNOSES: {DDX_LIST}

CORRECT DIAGNOSIS: {GROUND_TRUTH_DIAGNOSIS}