Scholarship list
Journal article
OIDA-QA: A Multimodal Benchmark for Analyzing the Opioid Industry Documents Archive
Published 03/14/2026
Proceedings of the ... AAAI Conference on Artificial Intelligence, 40, 46, 39249 - 39258
The opioid crisis represents a significant moment in public health that reveals systemic shortcomings across regulatory systems, healthcare practices, corporate governance, and public policy. Analyzing how these interconnected systems simultaneously failed to protect public health requires innovative analytic approaches for exploring the vast amounts of data and documents disclosed in the UCSF-JHU Opioid Industry Documents Archive (OIDA). The complexity, multimodal nature, and specialized characteristics of these healthcare-related legal and corporate documents necessitate more advanced methods and models tailored to specific data types and detailed annotations, ensuring the precision and professionalism in the analysis. In this paper, we tackle this challenge by organizing the original dataset according to document attributes and constructing a benchmark with 400k training documents and 10k for testing. From each document, we extract rich multimodal information—including textual content, visual elements, and layout structures—to capture a comprehensive range of features. Using multiple AI models, we then generate a large-scale dataset comprising 360k training QA pairs and 10k testing QA pairs. Building on this foundation, we develop domain-specific multimodal Large Language Models (LLMs) and explore the impact of multimodal inputs on task performance. To further enhance response accuracy, we incorporate historical QA pairs as contextual grounding for answering current queries. Additionally, we incorporate page references within the answers and introduce an importance-based page classifier, further improving the precision and relevance of the information provided. Preliminary results indicate the improvements with our AI assistant in document information extraction and question-answering tasks.
Journal article
Revisit, Extend, and Enhance Hessian-Free Influence Functions
Published 2026
Transactions on Machine Learning Research, 2026-
Conference proceeding
Data-Centric Physics-Informed Graph Neural Networks for Ultra-Fast Power Flow Analysis
Published 07/27/2025
IEEE Power & Energy Society General Meeting, 1 - 5
As the planning and operation of modern power systems typically require repetitive power flow analysis on numerous scenarios, deep learning (DL) applications to power flow analysis have gained momentum. However, DL models require a large volume of training data to achieve satisfactory performance. In power systems, it is very challenging to collect enough training samples from the real world, so numerical simulations must be performed to generate enough training samples. While much effort has been devoted to the improvement of DL model design, little attention has been paid to the curation of training data. To tackle this challenge and save data curation cost, we propose a data-centric approach to effectively select a small amount of beneficial samples to boost DL performance. It utilizes and improves the powerful tool of influence function to estimate the influence of training data samples on DL performance with high computational efficiency, then uses this information to guide the generation of beneficial data samples. The proposed data-centric learning approach is materialized on a physics-informed graph neural network (GNN) model for power flow analysis. Simulation results on the IEEE 300-bus test power system demonstrate the effectiveness of our proposed method over the traditional way.
Journal article
Using consumption emotional features to predict web-show viewership
Published 04/30/2025
Journal of the Academy of Marketing Science
Today an increasing number of TV shows and movies are released on online video streaming platforms. This study proposes a forecasting modeling framework that uses measures of a show's consumption emotional features, or viewer sentiments triggered by the show's production emotional features such as plot, as predictors to forecast a web show's viewership. Our forecasting modeling framework has three components: feature construction, feature selection through in-sample prediction, and out-of-sample forecasting. In feature construction, we take advantage of the increasingly popular live commenting function in video streaming, which allows viewers to post spontaneous, visceral comments while watching. We utilize machine learning techniques to process the voluminous, unstructured live comment data to form "emotion waves," which depict the evolution in viewers' moment-to-moment sentiments throughout the show. We characterize emotion waves to form measures of consumption emotional features. We separately characterize positive and negative emotion waves, as well as their relative positions, and also separately characterize emotion waves in different narrative segments of a show. In feature selection, we use an in-sample prediction model to verify our proposed measures and use only key measures with significant impacts to build the forecasting model. Lastly, in out-of-sample forecasting, we show that a small number of key measures formed over a small sample of live comments available shortly after a show's release can effectively forecast the show's viewership accumulated in an extended period after its release.
Journal article
Interpretable Novel Target Discovery Through Open-Set Domain Adaptation
Published 03/17/2025
ACM transactions on multimedia computing communications and applications
Open-set domain adaptation (OSDA) considers a special domain adaptation problem in which the target domain contains novel categories that never appear in the well-labeled source domain. Unfortunately, prior efforts on OSDA simply detect and recognize all novel categories as one “unknown” group without further exploration. The demand for exploring these novel categories prompts us to consider the underlying multi-class structure and semantic description of those unknown categories in more detail. In this paper, we propose a novel interpretable framework to accurately identify the seen categories in the target domain and effectively recover the semantic knowledge of the unseen categories with attributes and visual interpretations, which is referred to as Semantic Recovery Open-Set Domain Adaptation (SR-OSDA). Specifically, the proposed framework includes an explicit attribute explainable module and an implicit semantic interpretable module, which provide insight into the process of domain adaptation and the discovery of new categories. Furthermore, structure-preserving partial alignment is developed as a method of recognizing and aligning the visible categories across domains with the aid of domain-invariant feature learning. The visual-structural semantic attributes propagation is designed to provide smooth transitions from seen categories to unseen categories via visual-semantic mapping. Three new cross-domain SR-OSDA benchmarks are constructed in order to evaluate the proposed framework in novel and practical challenges. Experimental results and empirical analysis of our proposed solution to open-set recognition and semantic recovery demonstrate its superiority over other state-of-the-art solutions. Our source code is available at https://github.com/scottjingtt/XSROSDA.
Conference proceeding
Published 2025
Proceedings of Machine Learning Research - International Conference on Machine Learning, ICML 2025, 267, 10334 - 10353
Journal article
Graph-learning-assisted state estimation using sparse heterogeneous measurements
Published 10/2024
Electric power systems research, 235, 110644
Unlike transmission systems, distribution systems historically lack enough measurements, making their real-time monitoring almost impossible. Recent deployment of diverse types of devices such as phasor measurement units (PMUs), smart meters, solar inverters and weather information sensors opens up new ways of monitoring these systems, with the assistance of customized machine learning (ML) applications. The paper describes a grid-model-informed machine learning (ML) tool which integrates heterogeneous data streams and creates synchronous measurement snapshots to be used by a hybrid robust state estimator (SE) which provides not only accurate state estimates but also real-time feedback for ML model refinement. Improved monitoring performance due to the use of developed computational framework is experimentally observed by simulated scenarios on an electric utility’s distribution system. •A new framework for monitoring distribution systems is proposed.•A robust and scalable state estimator is developed.•A new method that can pinpoint line outages in distribution systems based on sparse PMU measurements.
Journal article
Ultra-Short-Term Forecasting of Large Distributed Solar PV Fleets Using Sparse Smart Inverter Data
Published 04/16/2024
IEEE transactions on sustainable energy, 1 - 13
Ultra-short-term power forecasting for distributed solar photovoltaic (PV) generation is a largely unaddressed, highly challenging problem due to the prohibitive real-time data collection and processing requirements for a sheer number of distributed PV units. In this paper, we propose an innovative idea of forecasting the power output of a large fleet of distributed PV units using limited real-time data of a sparsely selected set of PV units, referred to as pilot units. We develop a two-stage method to address this problem. In the planning stage, we use the K-medoids clustering algorithm to select pilot units for the installation of real-time remote monitoring infrastructure. In the operation stage, we devise a deep learning framework integrating Long Short-Term Memory, Graph Convolutional Network, Multilayer Perceptron to capture the spatio-temporal power generation patterns between pilot units and other units, and forecast the power outputs of all units in a large PV fleet using the real-time data from the few selected pilot units only. Case study results show that our proposed method outperforms all baseline methods in forecasting for power outputs of individual PV units as well as the whole PV fleet, and the forecasting time resolution is not dependent on that of weather data.
Journal article
Global-Local Consistency Constrained Deep Embedded Clustering for Hyperspectral Band Selection
Published 10/18/2023
IEEE access, 1 - 1
Hyperspectral band selection plays a key role for overcoming the curse of dimensionality in the classification of hyperspectral remote sensing images (HSIs). Recently, clustering-based band selection methods have demonstrated great potential to select informative and representative bands for hyperspectral classification tasks. However, most clustering-based methods perform clustering directly on the original high-dimensional data, which reduces their performance. To address this problem, a novel band selection method called global-local consistency constrained deep embedded clustering (GLC-DEC) is proposed in this paper. In GLC-DEC, to simultaneously learn the low-dimensional embedded representation and cluster assignments of all bands in an HSI, the stacked autoencoder is integrated with the K-means method. In addition, to reduce the adverse impact of a limited number of training samples available in HSIs, local and global consistency constraints are imposed on the embedded representation so that discriminatively consistent representation of all bands is learned. Specifically, local graph regularization and global graph regularization are introduced into the GLC-DEC model, by which the strong correlation between neighboring bands and the manifold structure of all bands are fully exploited. Based on the clustering results provided by GLC-DEC, a group of representative bands are selected by using the minimum noise method. Experimental results on two real datasets demonstrate that the proposed GLC-DEC outperformed several state-of-the-art methods.
Journal article
Second-Order Unsupervised Feature Selection via Knowledge Contrastive Distillation
Published 09/01/2023
IEEE transactions on pattern analysis and machine intelligence, 1 - 11
Unsupervised feature selection aims to select a subset from the original features that are most useful for the downstream tasks without external guidance information. While most unsupervised feature selection methods focus on ranking features based on the intrinsic properties of data, most of them do not pay much attention to the relationships between features, which often leads to redundancy among the selected features. In this paper, we propose a two-stage S econd- O rder unsupervised F eature selection via knowledge contrastive dis T illation (SOFT) model that incorporates the second-order covariance matrix with the first-order data matrix for unsupervised feature selection. In the first stage, we learn a sparse attention matrix that can represent second-order relations between features by contrastively distilling the intrinsic structure. In the second stage, we build a relational graph based on the learned attention matrix and perform graph segmentation. To this end, we conduct feature selection by only selecting one feature from each cluster to decrease the feature redundancy. Experimental results on 12 public datasets show that SOFT outperforms classical and recent state-of-the-art methods, which demonstrates the effectiveness of our proposed method. Moreover, we also provide rich in-depth experiments to further explore several key factors of SOFT.