MedAsk

SymptomCheck Bench Update: Record Results and Disease-Specific Insights

January 22nd, 2025 | Klemen Vodopivec

Introduction

When we introduced SymptomCheck Bench, our goal was to create a transparent and rigorous framework for evaluating LLM-based medical agents, including our LLM-powered symptom checker MedAsk. At that time, we expressed optimism about MedAsk’s potential for future improvements. Today, we are excited to reveal the record-breaking advancements achieved in just three months, along with our first comprehensive analysis of its performance across various disease categories.

Achieving State-of-the-Art Results on SymptomCheck Bench

Through continued development of MedAsk’s custom cognitive architecture, we’ve achieved SOTA results in diagnostic accuracy, surpassing Avey’s previous leading performance in both top-1 and top-5 accuracy while also boosting Top-3 accuracy by 6.3%. Following our established methodology, we ran the benchmark five times across all 400 clinical vignettes. The averaged diagnostic accuracy results were:

  • Top 1 (correct diagnosis as first choice): 68.0% (up from 58.3%)
  • Top 3 (correct diagnosis within first three choices): 85.0% (up from 78.7%)
  • Top 5 (correct diagnosis within first five choices): 90.6% (up from 82.0%)

Figure 1 presents the updated comparative results between MedAsk and the five leading symptom checkers from our previous blog post (Avey, Ada, WebMD, K Health, and Buoy). The complete results and evaluation data are available on our GitHub repository.

Updated Dx comparison
Figure 1: Updated Diagnostic Accuracy Comparison Between MedAsk and Leading Symptom Checkers Across Top-1, Top-3, and Top-5 Diagnoses.

For those interested in testing MedAsk, we welcome you to reach out via email at info@medask.tech, and we will provide API access to enable direct evaluation and experimentation.

Performance Breakdown by Disease Category

In this section, we present a detailed analysis of MedAsk’s performance across various disease categories. This type of granular assessment is an integral part of our internal development process, helping us identify both strengths and areas for improvement.

For this analysis, we used the 14 disease categories defined in the original study by Hammoud et al., from which the clinical vignettes were sourced. While their study provided overall category distributions, to our knowledge the source data didn’t include category labels for individual vignettes. To enable this analysis, our team manually categorized each vignette. Our resulting distribution shows slight variations from the original breakdown, which we’ve documented in the Appendix.

Using data from our five benchmark runs discussed earlier, we analyzed MedAsk’s performance across these disease categories. Figure 2 presents the breakdown of Top-5 diagnostic accuracy across different disease categories.

Figure 2: Average Diagnostic Accuracy by Disease Categories for MedAsk.

MedAsk demonstrated exceptional performance across most disease categories, with Dermatology achieving a perfect Top-5 diagnostic accuracy of 100%. This result highlights the model’s ability to excel in conditions with distinctive symptom patterns. Similarly, categories such as Ophthalmology (97.8%), Respiratory (97.1%), Neurology (96.5%), and Infectious Diseases (95.7%) were standout performers, indicating strong diagnostic reliability in areas where systematic symptom evaluation is crucial.

Cardiovascular (88.7%), Gastrointestinal (89.8%), and Obstetrics and Gynecology (88.5%) also performed well, though there remains room for optimization in these important domains. Categories such as Hematology (74.8%) and Nephrology (83.2%) showed comparatively lower accuracy, suggesting areas for focused improvement in future iterations.

Looking Ahead

Accurate measurement is the foundation of improvement, and our benchmarking efforts continue to evolve. With MedAsk now setting new performance standards on SymptomCheck Bench, we’re shifting our focus to more challenging comparisons against foundation models like GPT-4 and Gemini. One of our key upcoming initiatives is the introduction of a new benchmark to evaluate the accuracy of ICD code assignments for predicted diagnoses. This benchmark will allow us to measure MedAsk’s precision in mapping diseases to the correct coding standards (a critical aspect for real-world medical applications) and compare it to other foundation models.

As our results approach saturation on the current vignette set, we recognize the need for more challenging and diverse scenarios. To address this, we are actively exploring synthetic data generation to create a richer and more comprehensive evaluation framework that better mirrors the complexity of real-world medical diagnosis.

The application of large language models in medicine continues to offer profound potential, and every new insight fuels our excitement for the future. If you’re interested in collaborating with us or contributing to this journey, reach out to us at info@medask.tech. We’re eager to connect!

This table presents our manual categorization of clinical vignettes compared to the original distribution reported in Hammoud et al. While most categories align closely with the original distribution, there are some notable differences, particularly in Urology (31 vs 14 vignettes) and Nephrology (19 vs 32 vignettes). These discrepancies likely arise from differences in interpreting cases that could be classified under multiple disease categories.. Despite these variations, the total number of vignettes remains consistent at 400, preserving the overall scope of the benchmark. 

Disease Category

Number of Vignettes (Original)

Hematology

23 (23)

Cardiovascular

46 (46)

Neurology

23 (22)

Endocrine

19 (20)

Otorhinolaryngology

23 (23)

Gastrointestinal

43 (44)

Obstetrics and Gynecology

54 (54)

Infectious

23 (23)

Respiratory

35 (37)

Orthopedics and Rheumatology

32 (32)

Ophthalmology

18 (18)

Dermatology

11 (12)

Urology

31 (14)

Nephrology

19 (32)

Table 1 : Distribution of Clinical Vignettes Across Disease Categories