Logo image
Interactive active learning for literature screening: finetuning GPT with DeepSeek reasoning for cross-domain generalization
Journal article   Peer reviewed

Interactive active learning for literature screening: finetuning GPT with DeepSeek reasoning for cross-domain generalization

Yiming Li, Joseph M Plasek, Xinsong Du, Yifei Wang, Zhengyang Zhou, John Lian, Ya-Wen Chuang, Pengyu Hong, Peter C Hou and Li Zhou
Journal of the American Medical Informatics Association : JAMIA
03/09/2026
Handle:
https://hdl.handle.net/10192/79244
PMID: 41801982

Abstract

generative pre-trained transformer (GPT) active learning DeepSeek supervised learning reasoning literature screening
Automated literature screening in biomedical research is often hindered by domain shifts and scarcity of labeled data, which limit model accuracy and generalizability. While large language models (LLMs) perform well in zero-shot settings, they often fail to capture complex, domain-specific reasoning patterns. To address this limitation, this study investigates whether an interactive, weakly supervised learning framework combining GPT (generative pre-trained transformer)'s fine-tuning adaptability with DeepSeek's reasoning capabilities can improve literature screening performance across biomedical domains. We developed an active learning framework that leverages model disagreement between GPT-4o and DeepSeek to improve literature screening performance. This process began with a labeled corpus of 6331 articles on large language models, from which a model disagreement analysis was performed to identify cases where GPT-4o misclassified and DeepSeek produced correct predictions. Three GPT variants-GPT-4o, GPT-4o-mini, and GPT-4.1-nano, were fine-tuned under standard supervised learning settings using these disagreement-based samples. Fine-tuning prompts incorporated classification labels and, when available, rationale traces generated by DeepSeek to provide reasoning-augmented weak supervision. Model performance was evaluated on an independent benchmark set of 291 annotated articles across 10 topic queries in cancer immunotherapy and LLMs in medicine, using standard evaluation metrics, with recall as the primary measure. Fine-tuning GPT models using disagreement-based examples significantly improved performance. GPT-4o-mini achieved the best overall results after fine-tuning, especially with the highest F1 score (0.93, P < .001) and recall (0.95, P < .001). Across the biomedical topics, fine-tuned models consistently outperformed their zero-shot counterparts without increasing reviewer workload. These findings demonstrate the effectiveness of disagreement-driven active learning in enhancing GPT-based biomedical literature screening. Lightweight models like GPT-4o-mini benefit most from targeted, reasoning-enriched training, highlighting their suitability for scalable deployment. This study introduces an interactive active learning framework that leverages fine-tuned LLMs with reasoning capabilities to enhance literature screening. The approach offers a scalable solution to more efficient and reliable information retrieval in systematic reviews.

Metrics

1 Record Views

Details

Logo image