Medical AI Triage Accuracy 2025: MedAsk Beats OpenAI's o3 & GPT-4.5

July 5th, 2025 | Klemen Vodopivec, Rok Vodopivec

Introduction

A few months ago, we shared the first results from our triage accuracy benchmark. While the initial performance was state-of-the-art, we wanted to push it further.

Today, we’re announcing a major step forward: a 6% improvement in MedAsk’s triage accuracy (82% to 88%), the result of three months of focused development. Alongside this, we’ve enhanced our benchmarking methodology to increase statistical power and reliability.

We’re also excited to open-source our triage benchmark framework, enabling the broader community to evaluate and compare triage systems with transparency.

Methodology

We followed the same methodology as our previous article, with two key additions: enhanced statistical testing for greater power and open-sourcing our benchmark. The benchmark is now available on our GitHub in the TriageBench folder.

Improving Statistical Power: McNemar’s Test

One limitation of working with small datasets (45 vignettes per benchmark) is that it’s difficult to detect statistically significant performance differences between models, especially when improvements are modest but meaningful.

To address this, we implemented a two-sided paired McNemar’s test, which compares the classification outputs of two models on the same set of vignettes. This test focuses specifically on disagreements between model predictions, effectively controlling for vignette-level variability and increasing the sensitivity of comparisons.

By using McNemar’s test, we achieved a 3–4× increase in statistical power compared to our original evaluation approach, which enabled us to detect differences in performance with higher confidence. This added rigor strengthens the validity of our claim that MedAsk significantly outperforms state-of-the-art LLMs in digital triage.

Results

We compared MedAsk’s performance against the best-performing models from the recent Kopka et al. paper. For the Kopka vignette set, we used their published results directly. For the Semigran dataset, we benchmarked the top 5 models ourselves, running each model (and MedAsk) 5 times to ensure reliability.

We also applied our enhanced statistical analysis to compare MedAsk against the best-performing model on the Semigran dataset (where we have the complete data) to determine if our improvements are statistically significant.

Semigran Vignettes

Figure 1 presents the comparative average accuracy results of MedAsk and the 5 LLMs we tested on the Semigran vignette dataset.

Table 1 provides detailed performance metrics across all evaluated categories.

MedAsk achieved the highest overall accuracy at 87.6%, representing a 4.9% improvement over our previous benchmarking results. This gain primarily stems from a substantial 16% improvement in self-care triage accuracy. The next best-performing model, o4-mini, scored 7.2% lower at 80.4%.

Statistical Significance Testing

We applied the McNemar paired test to compare MedAsk against o4-mini, the second-best performer:

Discordant pairs: MedAsk correct/o4-mini wrong = 25; MedAsk wrong/o4-mini correct = 9
P-value: 0.0101
Odds ratio: 2.78 (MedAsk vs o4-mini)
Accuracy difference: 7.11% in favor of MedAsk

These results confirm that MedAsk’s performance advantage is statistically significant (p < 0.05) and not due to chance. The odds of MedAsk providing the correct triage classification over o4-mini are nearly 3 times higher on this dataset.

Kopka Vignettes

Figure 2 presents a comparison of overall accuracy between MedAsk and the 5 LLMs tested in the Kopka paper. Standard deviation data is only available for MedAsk results.

Table 2 provides detailed performance metrics across all evaluated categories. However, for the other LLMs, standard deviation, safety of advice, and overtriage metrics were not available.

MedAsk also outperformed all other models on the Kopka dataset, achieving 81.8% overall accuracy. Compared to our previous benchmarking, MedAsk improved by 1.4%, with this gain primarily coming from a 3.3% increase in non-emergency case accuracy. The next best-performing model, o1-mini, scored 8.2% lower at 73.6%.

Conclusion

At MedAsk, we believe large language models will increasingly serve as the first point of contact for health concerns—a digital front door replacing the instinct to “just Google it.” In this role, accurate triage becomes essential. It’s not just a benchmark metric; it’s about ensuring users are directed to the right level of care, whether that’s self-care at home or immediate emergency attention.

Our latest results show continued, meaningful progress. MedAsk is delivering state-of-the-art triage accuracy, with significant gains in self-care detection while maintaining excellent safety. We’re now setting our sights on crossing the 90% accuracy threshold, a milestone we believe is within reach.

That said, the biggest bottleneck we face isn’t modeling—it’s test data. The current benchmark relies on just 90 vignettes across two datasets. For meaningful progress, the field needs larger, more diverse, and representative clinical vignette sets.

We welcome collaborators who are interested in contributing to or co-developing expanded benchmark datasets. If you’d like to work with us, please reach out.