A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric

Hao Guan; Peter C Hou; Pengyu Hong; Liqin Wang; Wenyu Zhang; Xinsong Du; Zhengyang Zhou; Li Zhou

Back

A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric

Journal article

Peer reviewed

A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric

Hao Guan, Peter C Hou, Pengyu Hong, Liqin Wang, Wenyu Zhang, Xinsong Du, Zhengyang Zhou and Li Zhou

AMIA ... Annual Symposium proceedings, Vol.2024, p.383

2024

PMID: 41726532

Abstract

Diagnostic Errors - classification

Humans

Natural Language Processing

Radiology Information Systems

Recent advances in vision-language models (VLMs) have enabled automatic radiology report generation, yet current evaluation methods remain limited to general-purpose NLP metrics or coarse classification-based clinical scores. In this study, we propose a clinically informed evaluation framework for VLM-generated radiology reports that goes beyond traditional performance measures. We define a taxonomy of 12 radiology-specific error types, each annotated with clinical risk levels (low, medium, high) in collaboration with physicians. Using this framework, we conduct a comprehensive error analysis of three representative VLMs, i.e., DeepSeek VL2, CXR-LLaVA, and CheXagent, on 685 gold-standard, expert-annotated MIMIC-CXR cases. We further introduce a risk-aware evaluation metric, the Clinical Risk-weighted Error Score for Text-generation (CREST), to quantify safety impact. Our findings reveal critical model vulnerabilities, common error patterns, and condition-specific risk profiles, offering actionable insights for model development and deployment. This work establishes a safety-centric foundation for evaluating and improving medical report generation models. The source code of our evaluation framework, including CREST computation and error taxonomy analysis, is available at https://github.com/guanharry/VLM-CREST.

Metrics

1 Record Views

Details

Title: A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric
Creators: Hao Guan - Harvard Medical School
Peter C Hou - Harvard Medical School
Pengyu Hong - Brandeis University
Liqin Wang - Harvard Medical School
Wenyu Zhang - Geisel School of Medicine at Dartmouth, Lebanon, NH
Xinsong Du - Harvard Medical School
Zhengyang Zhou - Brandeis University
Li Zhou - Harvard Medical School
Publication Details: AMIA ... Annual Symposium proceedings, Vol.2024, p.383
Identifiers: 9924588746701921
Academic Unit: Michtom School of Computer Science; Benjamin and Mae Volen National Center for Complex Systems
Language: English
Resource Type: Journal article

A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric

Abstract

Metrics

Details

Brandeis University Social media