How MedAsk’s Cognitive Architecture Improves ICD-10 Coding Accuracy

February 14th, 2025 | Klemen Vodopivec, Rok Vodopivec

Introduction

Medical coding accuracy is crucial for healthcare operations, directly impacting patient care, billing accuracy, and clinical research. While large language models like GPT-4o have shown promise in this area, they still face significant challenges in medical coding precision.

MedAsk addresses these challenges through our custom cognitive architecture, which we recently introduced in our blog post. Built as a specialized second layer on top of foundation models, this architecture enhances the base model’s capabilities while incorporating specific medical coding knowledge and safety measures. In this article, you’ll see our cognitive architecture in action as it is used to improve MedAsk’s ICD-10 coding accuracy by more than 16% compared to GPT-4o.

Methodology

Dataset

To evaluate ICD-10 coding accuracy, we analyzed all the conditions generated by MedAsk during the first week of January 2025, totaling approximately 1,000 cases. Each condition had an ICD-10 code assigned by the base model (GPT-4o), which we used as a reference for comparison.

Next, we reprocessed the same set of conditions using MedAsk’s custom cognitive architecture, generating a new set of ICD-10 codes. Our final dataset included 948 cases, each containing:

The condition generated by MedAsk.
The ICD-10 code assigned by the base model (GPT-4o).
The ICD-10 code assigned by MedAsk’s cognitive architecture.

Experiment Overview

We compared the ICD-10 code assignments from both methods against official ICD-10 database entries (we used ICD-10-AM, 11th edition, which is the standard in our country) to determine which approach provides more accurate coding. The experiment followed a three-step process:

Take a condition and its assigned ICD-10 code from our dataset.
Search for the ICD-10 code in the official ICD database.
Compare the search results with the original condition using the evaluation criteria outlined below.

Evaluation Criteria

We measured coding accuracy by evaluating how well each condition matched its corresponding ICD-10 database entry. The matching assessment was performed automatically using GPT-4o as an evaluator model (prompt details in Appendix). We categorized matches into four types:

Invalid/Non-existent ICD Codes: Codes that don’t exist in the official ICD-10 database.
Exact Match / Same Condition: Perfect alignment between the condition and ICD code, including accepted terminology variants.
Partially Related Conditions: Cases where the code represents a related but distinct condition.
Complete Mismatch: Cases where the assigned code is entirely unrelated to the condition.

Results

We conducted the experiment five times across all 948 conditions to ensure statistical reliability and account for any variability in the evaluation process. Figure 1 presents the comparative results between the base model (GPT-4o) and MedAsk’s custom cognitive architecture.

Looking at the figure, we can see MedAsk achieves substantial improvements across all key metrics:

Exact Matches increased from 68.5% (± 0.2%) to 84.7% (± 1.1%), a 16.2 percentage point improvement.
Partial Matches decreased from 20.4% (± 0.4%) to 9.6% (± 1.4%), indicating fewer imprecise assignments.
Complete Mismatches were reduced from 7.5% (± 0.5%) to 2.0% (± 0.4%) , a significant reduction in serious coding errors.
Invalid Codes remained relatively stable (3.6% vs 3.7%).

These results show that Medask’s cognitive architecture not only improves the accuracy of correct matches but also significantly reduces the occurrence of serious coding errors. The almost 4x reduction in complete mismatches is particularly noteworthy as these represent the most problematic type of coding errors in clinical settings.

Importantly, there’s a key distinction in how invalid codes are handled: MedAsk’s 3.7% invalid codes were all null values returned when the system wasn’t confident about the correct code, while GPT-4o’s 3.6% invalid codes were hallucinated codes that don’t exist in the ICD-10 database. This reflects our architecture’s built-in safety measures against hallucinations.

The complete results and evaluation dataset are available in our GitHub repository.

Conclusion

Large language models alone are not well-suited for medical coding, which is why we have built and continue to develop a specialized second layer that builds upon the base model’s strengths in diagnostic dialogue while addressing its weaknesses in structured tasks like ICD-10 coding.

The results in this article demonstrate the effectiveness of this approach, with our method achieving 84.7% exact matches, a significant improvement over the base GPT-4o model’s 68.5%. However, while this represents a major improvement, there is still room for refinement.

The medical field demands exceptional accuracy in coding for patient safety, proper treatment, and accurate billing. We will continue to develop, test, and optimize MedAsk’s cognitive architecture, with the goal of pushing accuracy toward 100% and ensuring that MedAsk becomes an increasingly reliable tool for healthcare professionals.

Appendix

ICD-10 Match Classification Prompt

You are a medical coding assistant tasked with evaluating the relationship between an AI-generated diagnosis and an official ICD-10 diagnosis.

EXAMPLE DIAGNOSIS: {example_diagnosis}

OFFICIAL ICD-10 DIAGNOSIS: {official_diagnosis}

Determine how closely the AI-generated diagnosis (EXAMPLE DIAGNOSIS) aligns with the OFFICIAL ICD-10 DIAGNOSIS. Choose one of the following categories:

Exact match

– The example diagnosis describes the same medical condition as the official diagnosis

– Terms may vary but clearly refer to the same condition (e.g., “Heart Attack” vs “Myocardial Infarction”)

– Includes cases where specificity levels differ but core condition matches

Partially related condition

– Conditions are clinically related but not the same (e.g., complications, common comorbidities)

– Includes broader/narrower categories of the same condition family

– Example: “Chronic Sinusitis” vs “Nasal Polyps”

Complete mismatch

– There is no meaningful clinical relationship between the two terms.

– Example: “Migraine” does not match “Pneumonia.”

RESPONSE FORMAT:

Provide only one of the following responses:

– `EXACT`

– `PARTIALLY`

– `MISMATCH`

Do not provide explanations or additional text—only return the classification.