Scholarship list
Journal article
OIDA-QA: A Multimodal Benchmark for Analyzing the Opioid Industry Documents Archive
Published 03/14/2026
Proceedings of the ... AAAI Conference on Artificial Intelligence, 40, 46, 39249 - 39258
The opioid crisis represents a significant moment in public health that reveals systemic shortcomings across regulatory systems, healthcare practices, corporate governance, and public policy. Analyzing how these interconnected systems simultaneously failed to protect public health requires innovative analytic approaches for exploring the vast amounts of data and documents disclosed in the UCSF-JHU Opioid Industry Documents Archive (OIDA). The complexity, multimodal nature, and specialized characteristics of these healthcare-related legal and corporate documents necessitate more advanced methods and models tailored to specific data types and detailed annotations, ensuring the precision and professionalism in the analysis. In this paper, we tackle this challenge by organizing the original dataset according to document attributes and constructing a benchmark with 400k training documents and 10k for testing. From each document, we extract rich multimodal information—including textual content, visual elements, and layout structures—to capture a comprehensive range of features. Using multiple AI models, we then generate a large-scale dataset comprising 360k training QA pairs and 10k testing QA pairs. Building on this foundation, we develop domain-specific multimodal Large Language Models (LLMs) and explore the impact of multimodal inputs on task performance. To further enhance response accuracy, we incorporate historical QA pairs as contextual grounding for answering current queries. Additionally, we incorporate page references within the answers and introduce an importance-based page classifier, further improving the precision and relevance of the information provided. Preliminary results indicate the improvements with our AI assistant in document information extraction and question-answering tasks.
Journal article
Revisit, Extend, and Enhance Hessian-Free Influence Functions
Published 2026
Transactions on Machine Learning Research, 2026-
Preprint
Published 10/04/2025
arXiv (Cornell University)
Data-centric learning seeks to improve model performance from the perspective of data quality, and has been drawing increasing attention in the machine learning community. Among its key tools, influence functions provide a powerful framework to quantify the impact of individual training samples on model predictions, enabling practitioners to identify detrimental samples and retrain models on a cleaner dataset for improved performance. However, most existing work focuses on the question: "what data benefits the learning model?" In this paper, we take a step further and investigate a more fundamental question: "what is the performance ceiling of the learning model?" Unlike prior studies that primarily measure improvement through overall accuracy, we emphasize category-wise accuracy and aim for Pareto improvements, ensuring that every class benefits, rather than allowing tradeoffs where some classes improve at the expense of others. To address this challenge, we propose category-wise influence functions and introduce an influence vector that quantifies the impact of each training sample across all categories. Leveraging these influence vectors, we develop a principled criterion to determine whether a model can still be improved, and further design a linear programming-based sample reweighting framework to achieve Pareto performance improvements. Through extensive experiments on synthetic datasets, vision, and text benchmarks, we demonstrate the effectiveness of our approach in estimating and achieving a model's performance improvement across multiple categories of interest.
Conference proceeding
Data-Centric Physics-Informed Graph Neural Networks for Ultra-Fast Power Flow Analysis
Published 07/27/2025
IEEE Power & Energy Society General Meeting, 1 - 5
As the planning and operation of modern power systems typically require repetitive power flow analysis on numerous scenarios, deep learning (DL) applications to power flow analysis have gained momentum. However, DL models require a large volume of training data to achieve satisfactory performance. In power systems, it is very challenging to collect enough training samples from the real world, so numerical simulations must be performed to generate enough training samples. While much effort has been devoted to the improvement of DL model design, little attention has been paid to the curation of training data. To tackle this challenge and save data curation cost, we propose a data-centric approach to effectively select a small amount of beneficial samples to boost DL performance. It utilizes and improves the powerful tool of influence function to estimate the influence of training data samples on DL performance with high computational efficiency, then uses this information to guide the generation of beneficial data samples. The proposed data-centric learning approach is materialized on a physics-informed graph neural network (GNN) model for power flow analysis. Simulation results on the IEEE 300-bus test power system demonstrate the effectiveness of our proposed method over the traditional way.
Journal article
Using consumption emotional features to predict web-show viewership
Published 04/30/2025
Journal of the Academy of Marketing Science
Today an increasing number of TV shows and movies are released on online video streaming platforms. This study proposes a forecasting modeling framework that uses measures of a show's consumption emotional features, or viewer sentiments triggered by the show's production emotional features such as plot, as predictors to forecast a web show's viewership. Our forecasting modeling framework has three components: feature construction, feature selection through in-sample prediction, and out-of-sample forecasting. In feature construction, we take advantage of the increasingly popular live commenting function in video streaming, which allows viewers to post spontaneous, visceral comments while watching. We utilize machine learning techniques to process the voluminous, unstructured live comment data to form "emotion waves," which depict the evolution in viewers' moment-to-moment sentiments throughout the show. We characterize emotion waves to form measures of consumption emotional features. We separately characterize positive and negative emotion waves, as well as their relative positions, and also separately characterize emotion waves in different narrative segments of a show. In feature selection, we use an in-sample prediction model to verify our proposed measures and use only key measures with significant impacts to build the forecasting model. Lastly, in out-of-sample forecasting, we show that a small number of key measures formed over a small sample of live comments available shortly after a show's release can effectively forecast the show's viewership accumulated in an extended period after its release.
Journal article
Interpretable Novel Target Discovery Through Open-Set Domain Adaptation
Published 03/17/2025
ACM transactions on multimedia computing communications and applications
Open-set domain adaptation (OSDA) considers a special domain adaptation problem in which the target domain contains novel categories that never appear in the well-labeled source domain. Unfortunately, prior efforts on OSDA simply detect and recognize all novel categories as one “unknown” group without further exploration. The demand for exploring these novel categories prompts us to consider the underlying multi-class structure and semantic description of those unknown categories in more detail. In this paper, we propose a novel interpretable framework to accurately identify the seen categories in the target domain and effectively recover the semantic knowledge of the unseen categories with attributes and visual interpretations, which is referred to as Semantic Recovery Open-Set Domain Adaptation (SR-OSDA). Specifically, the proposed framework includes an explicit attribute explainable module and an implicit semantic interpretable module, which provide insight into the process of domain adaptation and the discovery of new categories. Furthermore, structure-preserving partial alignment is developed as a method of recognizing and aligning the visible categories across domains with the aid of domain-invariant feature learning. The visual-structural semantic attributes propagation is designed to provide smooth transitions from seen categories to unseen categories via visual-semantic mapping. Three new cross-domain SR-OSDA benchmarks are constructed in order to evaluate the proposed framework in novel and practical challenges. Experimental results and empirical analysis of our proposed solution to open-set recognition and semantic recovery demonstrate its superiority over other state-of-the-art solutions. Our source code is available at https://github.com/scottjingtt/XSROSDA.
Conference proceeding
Published 2025
Proceedings of Machine Learning Research - International Conference on Machine Learning, ICML 2025, 267, 10334 - 10353
Journal article
Graph-learning-assisted state estimation using sparse heterogeneous measurements
Published 10/2024
Electric power systems research, 235, 110644
Unlike transmission systems, distribution systems historically lack enough measurements, making their real-time monitoring almost impossible. Recent deployment of diverse types of devices such as phasor measurement units (PMUs), smart meters, solar inverters and weather information sensors opens up new ways of monitoring these systems, with the assistance of customized machine learning (ML) applications. The paper describes a grid-model-informed machine learning (ML) tool which integrates heterogeneous data streams and creates synchronous measurement snapshots to be used by a hybrid robust state estimator (SE) which provides not only accurate state estimates but also real-time feedback for ML model refinement. Improved monitoring performance due to the use of developed computational framework is experimentally observed by simulated scenarios on an electric utility’s distribution system. •A new framework for monitoring distribution systems is proposed.•A robust and scalable state estimator is developed.•A new method that can pinpoint line outages in distribution systems based on sparse PMU measurements.
Preprint
Salutary Labeling with Zero Human Annotation
Posted to a preprint site 05/27/2024
Active learning strategically selects informative unlabeled data points and queries their ground truth labels for model training. The prevailing assumption underlying this machine learning paradigm is that acquiring these ground truth labels will optimally enhance model performance. However, this assumption may not always hold true or maximize learning capacity, particularly considering the costly labor annotations required for ground truth labels. In contrast to traditional ground truth labeling, this paper proposes salutary labeling, which automatically assigns the most beneficial labels to the most informative samples without human annotation. Specifically, we utilize the influence function, a tool for estimating sample influence, to select newly added samples and assign their salutary labels by choosing the category that maximizes their positive influence. This process eliminates the need for human annotation. Extensive experiments conducted on nine benchmark datasets demonstrate the superior performance of our salutary labeling approach over traditional active learning strategies. Additionally, we provide several in-depth explorations and practical applications of large language model (LLM) fine-tuning.
Preprint
On the Inflation of KNN-Shapley Value
Posted to a preprint site 05/24/2024
Shapley value-based data valuation methods, originating from cooperative game theory, quantify the usefulness of each individual sample by considering its contribution to all possible training subsets. Despite their extensive applications, these methods encounter the challenge of value inflation - while samples with negative Shapley values are detrimental, some with positive values can also be harmful. This challenge prompts two fundamental questions: the suitability of zero as a threshold for distinguishing detrimental from beneficial samples and the determination of an appropriate threshold. To address these questions, we focus on KNN-Shapley and propose Calibrated KNN-Shapley (CKNN-Shapley), which calibrates zero as the threshold to distinguish detrimental samples from beneficial ones by mitigating the negative effects of small-sized training subsets. Through extensive experiments, we demonstrate the effectiveness of CKNN-Shapley in alleviating data valuation inflation, detecting detrimental samples, and assessing data quality. We also extend our approach beyond conventional classification settings, applying it to diverse and practical scenarios such as learning with mislabeled data, online learning with stream data, and active learning for label annotation.