November 23, 2024

AI Falls Short: Large Language Models Struggle With Medical Coding, Study Shows

Utilizing the description for each code, the researchers prompted designs from OpenAI, Google, and Meta to output the most accurate medical codes. The generated codes were compared with the initial codes and mistakes were evaluated for any patterns.Analysis of Model PerformanceThe private investigators reported that all of the studied large language models, consisting of GPT-4, GPT-3.5, Gemini-pro, and Llama-2-70b, showed restricted accuracy (listed below 50 percent) in recreating the initial medical codes, highlighting a significant space in their effectiveness for medical coding.”The research study authors proposed that integrating LLMs with professional understanding might automate medical code extraction, potentially improving billing precision and minimizing administrative expenses in health care.Conclusion and Next Steps”This study sheds light on the existing capabilities and obstacles of AI in health care, emphasizing the need for mindful consideration and additional refinement prior to extensive adoption,” states co-senior author Girish Nadkarni, MD, MPH, Irene and Dr. Arthur M. Fishberg Professor of Medicine at Icahn Mount Sinai, Director of The Charles Bronfman Institute of Personalized Medicine, and System Chief of D3M.The scientists warn that the research studys synthetic task may not completely represent real-world situations where LLM efficiency might be worse.Next, the research study group prepares to establish customized LLM tools for precise medical data extraction and billing code assignment, aiming to enhance quality and performance in health care operations.Reference: “Large Language Models Are Poor Medical Coders– Benchmarking of Medical Code Querying” by Ali Soroush, Benjamin S. Glicksberg, Eyal Zimlichman, Yiftach Barash, Robert Freeman, Alexander W. Charney, Girish N Nadkarni and Eyal Klang, 19 April 2024, NEJM AI.DOI: 10.1056/ AIdbp2300040This research was supported by the AGA Research Foundations 2023 AGA-Amgen Fellowship to-Faculty Transition Award AGA2023-32-06 and an NIH UL1TR004419 award.

A study from the Icahn School of Medicine at Mount Sinai shows that present big language designs are not yet effective for medical coding, requiring further advancement and strenuous screening before scientific application. Credit: SciTechDaily.comResearch reveals its limitations in medical coding.Researchers at the Icahn School of Medicine at Mount Sinai have actually discovered that cutting edge synthetic intelligence systems, particularly big language designs (LLMs), are poor at medical coding. Their research study, recently published in the NEJM AI, stresses the requirement for refinement and recognition of these innovations before considering scientific implementation.The research study extracted a list of more than 27,000 unique diagnosis and treatment codes from 12 months of regular care in the Mount Sinai Health System, while excluding recognizable patient data. Using the description for each code, the researchers triggered designs from OpenAI, Google, and Meta to output the most precise medical codes. The created codes were compared to the initial codes and errors were examined for any patterns.Analysis of Model PerformanceThe detectives reported that all of the studied big language models, consisting of GPT-4, GPT-3.5, Gemini-pro, and Llama-2-70b, revealed restricted precision (below 50 percent) in recreating the original medical codes, highlighting a considerable space in their effectiveness for medical coding. GPT-4 showed the finest efficiency, with the highest exact match rates for ICD-9-CM (45.9 percent), ICD-10-CM (33.9 percent), and CPT codes (49.8 percent). GPT-4 also produced the highest proportion of improperly produced codes that still communicated the appropriate meaning. For example, when offered the ICD-9-CM description “nodular prostate without urinary obstruction,” GPT-4 produced a code for “nodular prostate,” showcasing its comparatively nuanced understanding of medical terms. Even thinking about these technically right codes, an unacceptably big number of errors remained.The next best-performing model, GPT-3.5, had the greatest tendency towards being vague. It had the highest percentage of incorrectly generated codes that were accurate however more basic in nature compared to the accurate codes. In this case, when offered with the ICD-9-CM description “undefined negative impact of anesthesia,” GPT-3.5 produced a code for “other specified adverse results, not in other places classified.”Importance of Rigorous AI Evaluation”Our findings underscore the critical need for rigorous assessment and refinement before deploying AI technologies in delicate functional areas like medical coding,” says research study corresponding author Ali Soroush, MD, MS, Assistant Professor of Data-Driven and Digital Medicine (D3M), and Medicine (Gastroenterology), at Icahn Mount Sinai. “While AI holds fantastic potential, it should be approached with caution and ongoing advancement to ensure its dependability and efficacy in health care.”One prospective application for these designs in the healthcare industry, say the detectives, is automating the project of medical codes for compensation and research study functions based upon scientific text.”Previous studies suggest that more recent large language models fight with mathematical tasks. The degree of their accuracy in designating medical codes from medical text had actually not been completely investigated across various models,” says co-senior author Eyal Klang, MD, Director of the D3Ms Generative AI Research Program. “Therefore, our aim was to examine whether these designs could effectively perform the essential job of matching a medical code to its corresponding official text description.”The study authors proposed that incorporating LLMs with expert knowledge could automate medical code extraction, possibly boosting billing precision and decreasing administrative costs in health care.Conclusion and Next Steps”This study sheds light on the existing capabilities and difficulties of AI in health care, emphasizing the requirement for cautious consideration and additional refinement prior to extensive adoption,” states co-senior author Girish Nadkarni, MD, MPH, Irene and Dr. Arthur M. Fishberg Professor of Medicine at Icahn Mount Sinai, Director of The Charles Bronfman Institute of Personalized Medicine, and System Chief of D3M.The researchers caution that the studys artificial task may not totally represent real-world circumstances where LLM efficiency could be worse.Next, the research group prepares to establish customized LLM tools for accurate medical information extraction and billing code project, intending to enhance quality and performance in healthcare operations.Reference: “Large Language Models Are Poor Medical Coders– Benchmarking of Medical Code Querying” by Ali Soroush, Benjamin S. Glicksberg, Eyal Zimlichman, Yiftach Barash, Robert Freeman, Alexander W. Charney, Girish N Nadkarni and Eyal Klang, 19 April 2024, NEJM AI.DOI: 10.1056/ AIdbp2300040This research was supported by the AGA Research Foundations 2023 AGA-Amgen Fellowship to-Faculty Transition Award AGA2023-32-06 and an NIH UL1TR004419 award.