Testing and evaluation of generative large language models in electronic health record applications: a systematic review

Xinsong Du; Zhengyang Zhou; Yifei Wang; Ya-Wen Chuang; Yiming Li; Richard Yang; Wenyu Zhang; Xinyi Wang; Xinyu Chen; Hao Guan; John Lian; Pengyu Hong; David W. Bates; Li Zhou

doi:10.1093/jamia/ocaf233

Back

Testing and evaluation of generative large language models in electronic health record applications: a systematic review

Journal article

Peer reviewed

Testing and evaluation of generative large language models in electronic health record applications: a systematic review

Xinsong Du, Zhengyang Zhou, Yifei Wang, Ya-Wen Chuang, Yiming Li, Richard Yang, Wenyu Zhang, Xinyi Wang, Xinyu Chen, Hao Guan, …

Journal of the American Medical Informatics Association : JAMIA, 9430800

01/13/2026

DOI: https://doi.org/10.1093/jamia/ocaf233

PMID: 41528313

Abstract

Computer Science

Computer Science, Information Systems

Computer Science, Interdisciplinary Applications

Health Care Sciences & Services

Information Science & Library Science

Life Sciences & Biomedicine

Medical Informatics

Science & Technology

Technology

Background The use of generative large language models (LLMs) with electronic health record (EHR) data is rapidly expanding to support clinical and research tasks. This systematic review characterizes the clinical fields and use cases that have been studied and evaluated to date.Methods We followed the Preferred Reporting Items for Systematic Review and Meta-Analyses guidelines to conduct a systematic review of articles from PubMed and Web of Science published between January 1, 2023, and November 9, 2024. Studies were included if they used generative LLMs to analyze real-world EHR data and reported quantitative performance evaluations. Through data extraction, we identified clinical specialties and tasks for each included article, and summarized evaluation methods.Results Of the 18 735 articles retrieved, 196 met our criteria. Most studies focused on radiology (26.0%), oncology (10.7%), and emergency medicine (6.6%). Regarding clinical tasks, clinical decision support made up the largest proportion of studies (62.2%), while summarizations and patient communications made up the smallest, at 5.6% and 5.1%, respectively. In addition, GPT-4 and GPT-3.5 were the most commonly used generative LLMs, appearing in 60.2% and 57.7% of studies, respectively. Across these studies, we identified 22 unique non-NLP metrics and 35 unique NLP metrics. While NLP metrics offer greater scalability, none demonstrated a strong correlation with gold-standard human evaluations.Conclusion Our findings highlight the need to evaluate generative LLMs on EHR data across a broader range of clinical specialties and tasks, as well as the urgent need for standardized, scalable, and clinically meaningful evaluation frameworks.

Metrics

1 Record Views

See more details

Details

Title: Testing and evaluation of generative large language models in electronic health record applications: a systematic review
Creators: Xinsong Du - Brigham and Women's Hospital
Zhengyang Zhou - Brandeis University
Yifei Wang - Brandeis University
Ya-Wen Chuang - Taichung Vet Gen Hosp, Dept Internal Med, Div Nephrol, Taichung, Taiwan
Yiming Li - Brigham and Women's Hospital
Richard Yang - Brigham and Women's Hospital
Wenyu Zhang - Brigham and Women's Hospital
Xinyi Wang - Brigham and Women's Hospital
Xinyu Chen - Brigham and Women's Hospital
Hao Guan - Brigham and Women's Hospital
John Lian - Brigham and Women's Hospital
Pengyu Hong - Brandeis University
David W. Bates - Brigham and Women's Hospital
Li Zhou - Brigham and Women's Hospital
Publication Details: Journal of the American Medical Informatics Association : JAMIA, 9430800
Publisher: Oxford Univ Press
Number of pages: 11
Grant note: R01AG080429 / National Institute of Health-National Institute of Aging; United States Department of Health & Human Services; National Institutes of Health (NIH) - USA; NIH National Institute on Aging (NIA) R01LM014239 / National Institute of Health-National Library of Medicine; United States Department of Health & Human Services; National Institutes of Health (NIH) - USA; NIH National Library of Medicine (NLM)
Identifiers: 9924570205101921
Academic Unit: Michtom School of Computer Science; Benjamin and Mae Volen National Center for Complex Systems
Language: English
Resource Type: Journal article

Testing and evaluation of generative large language models in electronic health record applications: a systematic review

Abstract

Metrics

Details

Brandeis University Social media