MedAsk Outperforms Leading LLMs and Symptom Checkers in Triage Accuracy

March 26th, 2025 | Klemen Vodopivec

Introduction

The primary goal of symptom checker applications like MedAsk is to empower individuals to make informed healthcare decisions and guide them to the most appropriate care settings. To achieve this, it’s essential that the triage advice provided is accurate, clearly indicating whether a user should seek emergency care, contact their general practitioner, or manage with self-care.

On a systemic level, effective digital triage can help alleviate pressure on overcrowded healthcare systems. The potential financial benefit is substantial, with estimates suggesting that redirecting patients to more appropriate care settings could save more than $4 billion annually in the US alone.

At MedAsk, we’re committed to rigorous benchmarking as the foundation for developing safe and reliable medical AI systems. Having previously demonstrated MedAsk’s diagnostic and medical coding accuracy, this article introduces the results of our triage accuracy benchmark.

Methodology

Datasets

We evaluated triage accuracy using two distinct clinical vignette datasets:

Semigran Dataset
Developed by Semigran et al., this dataset is the most commonly used standard for assessing symptom checker triage accuracy. It includes a total of 45 clinical vignettes evenly distributed across three triage categories:

15 emergency cases
15 non-emergency cases
15 self-care cases

These vignettes are open-source and are also available on our GitHub.

Kopka Dataset
A more recent dataset created by Kopka et al., comprising 45 vignettes drawn from real-world cases sourced from the medical advice subreddit (r/AskDocs). This dataset includes:

2 emergency cases
30 non-emergency cases
13 self-care cases

For copyright reasons, these vignettes are not publicly available; however, the Kopka team kindly shared them with us specifically for the purposes of this benchmark.

The primary difference between the two datasets lies in their construction methods. The Semigran dataset comprises artificial clinical vignettes developed by medical professionals, while the Kopka dataset contains authentic user scenarios derived from actual patient inquiries, potentially providing greater real-world applicability and ecological validity.

Experiment Overview

We assessed the triage accuracy of three large language models (GPT-4o, o1, and o3-mini) on both clinical vignette datasets using the evaluation method described by Kopka et al. Each vignette was provided to the language models in a one-shot format using a standardized prompt (available in Appendix). The triage recommendation generated by each model was then compared to the vignette’s ground-truth triage level.

We applied the same evaluation procedure to MedAsk, providing each vignette in a one-shot format and leveraging its custom cognitive architecture to generate a triage recommendation. This recommendation was then compared to the established ground truth.

Evaluation Criteria

Our evaluation criteria align with established reporting standards for symptom checker triage accuracy research. Specifically, we assessed:

Overall accuracy: The percentage of vignettes correctly triaged across all categories.
Accuracy by triage level: Evaluated separately for emergency, non-emergency, and self-care categories.
Safety of advice: Calculated as the proportion of recommendations at or above the correct urgency level.
Overtriage rate: Measured as the percentage of errors resulting from recommending a higher urgency level than necessary.

Results

We conducted each experiment five times across all models and both vignette sets to ensure statistical reliability. For each metric, we report the mean and standard deviation. Full experimental results are available on our GitHub.

By adopting the same methodology as Kopka et al., we are able to directly compare our results with their findings on symptom checkers and layperson performance. For both vignette sets, we compare MedAsk’s results against the three LLMs we tested, as well as the three best-performing symptom checkers and the average layperson scores reported by Kopka et al.

Semigran Vignettes

Figure 1 presents a comparison of the average overall accuracy across MedAsk, the tested LLMs, symptom checkers, and layperson performance on the Semigran vignette dataset. Standard deviation was not available for symptom checker results.

Table 1 provides detailed results for all evaluated metrics; however, for symptom checkers, only overall accuracy data was available.

MedAsk outperformed all tested LLMs, symptom checkers, and human participants on the Semigran dataset, achieving an overall accuracy of 82.7% (±4.0%). It demonstrated exceptional accuracy (96%) in emergency detection and provided the safest triage recommendations (94.2% safety of advice). Notably, MedAsk also achieved the highest self-care accuracy (72%), accepting a slight trade-off with marginally lower non-emergency accuracy (80%).

LLMs tended to default to non-emergency recommendations, resulting in lower emergency accuracy (80%-86.7%) and substantially lower self-care accuracy (33.3%-44%), though they achieved higher non-emergency accuracy (88.2%-94.7%).

Laypeople showed the lowest overall accuracy (60.9%) and lower safety of advice (84.2%), suggesting increased risk of missed emergency cases compared to AI-based methods.

Traditional symptom checkers (NHS111, Ada, and Symptomate) performed notably worse (48%-64% accuracy) than MedAsk and LLMs. Their relatively lower accuracy underscores existing limitations of current rule-based symptom-checker technologies and highlights the potential of LLM-based symptom checkers like MedAsk to deliver more accurate, safer digital triage at scale.

Kopka Vignettes

Figure 2 presents the comparative average accuracy results of MedAsk, the tested LLMs, symptom checkers, and layperson performance on the Kopka vignette dataset. As with the Semigran results, standard deviation data was not available for the symptom checkers.

Table 2 provides detailed performance metrics across all evaluated categories.

MedAsk again outperformed all tested LLMs, symptom checkers, and laypeople, achieving the highest overall accuracy (80.4% ±1.9%). It delivered perfect (100%) accuracy on emergency cases and maintained exceptional safety (96.4% safety of advice). Compared to other LLMs, MedAsk had significantly higher self-care accuracy (60%) while accepting a modest trade-off in non-emergency accuracy (88%).

The language models showed notably lower overall accuracy (61.3%-71.1%) due to extremely low self-care accuracy (1.5%-12.7%), demonstrating an excessive tendency towards caution and substantial overtriage (89.8%-96.3%).

Laypeople again demonstrated lower accuracy (62.4%) and reduced safety (90.7%), underscoring limitations in untrained human judgment.

Interestingly, traditional symptom checkers performed relatively well on this dataset, with accuracies between 73.3% and 80%. However, MedAsk’s consistently balanced accuracy across all triage levels, along with its superior safety profile, underscores its real-world advantage in practical digital triage applications.

Conclusion

Digital triage holds tremendous potential to improve healthcare by reducing costs and easing pressure on medical professionals. However, realizing these benefits requires a careful balance across triage levels. Systems must remain vigilant to avoid missing emergency cases while preventing excessive overtriage, which can overwhelm emergency services and cause unnecessary anxiety. Equally important is accurately identifying self-care cases to avoid burdening healthcare providers when home care would suffice.

At MedAsk, achieving this delicate balance is central to our mission. Our benchmark results demonstrate MedAsk’s superior performance compared to existing LLMs and symptom checkers, particularly in emergency and self-care detection. Despite MedAsk’s exceptional safety profile, we see opportunities for further improvement, especially in accurately recognizing self-care scenarios, which remains a key focus of our ongoing development efforts.

Appendix

Please give a triage classification for the following vignette.

{vignette}

The criteria for triage classifications are presented below:

Emergency (em): Call 911 or go directly to the emergency room;
Non-Emergency (ne): Try to see a doctor, but the symptoms do not require immediate attention at an emergency room. For example, you can see a doctor in the next week.
Self-care (sc): Let the health issue get better on its own and review the situation in a few days again.

RESPONSE FORMAT:

Provide only one of the following responses:

– `em`

– `ne`

– `sc`