Scholarship list
Journal article
Can Machine Learning Target Health Care Fraud? Evidence From Medicare Hospitalizations
Published 2026
Journal of policy analysis and management, 45, 1, n/a
The United States spends more than $4 trillion per year on health care, largely conducted by private providers and reimbursed by insurers. A major concern in this system is overbilling and fraud by hospitals, who face incentives to misreport their claims to receive higher payments. In this work, we develop novel machine learning tools to identify hospitals that overbill insurers, which can be used to guide investigations and auditing of suspicious hospitals for both public and private health insurance systems. Using large‐scale claims data from Medicare, the US federal health insurance program for the elderly and disabled, we identify patterns consistent with fraud among inpatient hospitalizations. Our proposed approach for fraud detection is fully unsupervised, not relying on any labeled training data, and is explainable to end users, providing interpretations for which diagnosis, procedure, and billing codes lead to hospitals being labeled suspicious. Using newly collected data from the Department of Justice on hospitals facing anti‐fraud lawsuits, and case studies of suspicious hospitals, we validate our approach and findings. Our method provides a nearly fivefold lift over random targeting of hospitals. We also perform a postanalysis to understand which hospital characteristics, not used for detection, are associated with suspiciousness.
Dataset
Published 2025
Replication code and data for Can Machine Learning Target Health Care Fraud? Evidence from Medicare Hospitalizations
Preprint
Macroeconomic Forecasting with Large Language Models
Published 06/30/2024
This paper presents a comparative analysis evaluating the accuracy of Large Language Models (LLMs) against traditional macro time series forecasting approaches. In recent times, LLMs have surged in popularity for forecasting due to their ability to capture intricate patterns in data and quickly adapt across very different domains. However, their effectiveness in forecasting macroeconomic time series data compared to conventional methods remains an area of interest. To address this, we conduct a rigorous evaluation of LLMs against traditional macro forecasting methods, using as common ground the FRED-MD database. Our findings provide valuable insights into the strengths and limitations of LLMs in forecasting macroeconomic time series, shedding light on their applicability in real-world scenarios
Conference paper
DiffFind: Discovering Differential Equations from Time Series
Date presented 05/10/2024
Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 2024, 05/07/2024–05/10/2024, Taipei, Taiwan
Given one or more time sequences, how can we extract their governing equations? Single and co-evolving time sequences appear in numerous settings, including medicine (neuroscience - EEG signals, cardiology - EKG), epidemiology (covid/flu spreading over time), physics (astrophysics, material science), marketing (sales and competition modeling; market penetration), and numerous more. Linear differential equations will fail, since the underlying equations are often non-linear (SIR model for virus/product spread; Lotka-Volterra for product/species competition, Van der Pol for heartbeat modeling). We propose DiffFind and we use genetic algorithms to find suitable, parsimonious, differential equations. Thanks to our careful design decisions, DiffFind has the following properties - it is: (a) Effective, discovering the correct model when applied on real and synthetic nonlinear dynamical systems, (b) Explainable, gives succinct differential equations, and (c) Hands-off, requiring no manual hyperparameter specification. DiffFind outperforms traditional methods (like auto-regression), includes as special case and thus outperforms a recent baseline (‘SINDy’), and wins first or second place for all 5 real and synthetic datasets we tried, often achieving excellent, zero or near-zero RMSE of 0.005.
Report
Unsupervised Machine Learning for Explainable Health Care Fraud Detection
Published 2023
The US spends more than 4 trillion dollars per year on health care, largely conducted by private providers and reimbursed by insurers. A major concern in this system is overbilling, waste and fraud by providers, who face incentives to misreport on their claims in order to receive higher payments. In this work, we develop novel machine learning tools to identify providers that overbill insurers. Using large-scale claims data from Medicare, the US federal health insurance program for elderly adults and the disabled, we identify patterns consistent with fraud or overbilling among inpatient hospitalizations. Our proposed approach for fraud detection is fully unsupervised, not relying on any labeled training data, and is explainable to end users, providing reasoning and interpretable insights into the potentially suspicious behavior of the flagged providers. Data from the Department of Justice on providers facing anti-fraud lawsuits and case studies of suspicious providers validate our approach and findings. We also perform a post-analysis to understand hospital characteristics, those not used for detection but associate with a high suspiciousness score. Our method provides an 8-fold lift over random targeting, and can be used to guide investigations and auditing of suspicious providers for both public and private health insurance systems.
Conference proceeding
Less is more: Slimg for accurate, robust, and interpretable graph mining
Published 2023
Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 08/2023
How can we solve semi-supervised node classification in various graphs possibly with noisy features and structures? Graph neural networks (GNNs) have succeeded in many graph mining tasks, but their generalizability to various graph scenarios is limited due to the difficulty of training, hyperparameter tuning, and the selection of a model itself. Einstein said that we should "make everything as simple as possible, but not simpler." We rephrase it into the careful simplicity principle: a carefully-designed simple model can surpass sophisticated ones in real-world graphs. Based on the principle, we propose SlimG for semi-supervised node classification, which exhibits four desirable properties: It is (a) accurate, winning or tying on 10 out of 13 real-world datasets; (b) robust, being the only one that handles all scenarios of graph data (homophily, heterophily, random structure, noisy features, etc.); (c) fast and scalable, showing up to 18 times faster training in million-scale graphs; and (d) interpretable, thanks to the linearity and sparsity. We explain the success of SlimG through a systematic study of the designs of existing GNNs, sanity checks, and comprehensive ablation studies.
Dissertation
Data-driven Decisions—An Anomaly Detection Perspective
Published 2023
Anomaly detection (AD) algorithms are widely used for data-driven decision support in domains where quantifying risk is critical, such as identifying fraudulent healthcare providers in public health insurance, consumer lending, and detecting aberrant patterns in human electroencephalography (EEG) records. However, AD in decision support is challenging due to the multitude of data modalities (e.g. timeseries, or structural data) and data scale, unavailability of ground truth labels for learning and evaluation, and difficulty in yielding human interpretable results for domain-specific problems. This thesis proposes to address the challenges and build intelligent detection systems with the following desirable properties: unsupervised, explainable, scalable, and equitable. Throughout, we propose novel AD algorithms that enable better decision support by addressing domain-specific key challenges such as including domain or expert knowledge, mitigating bias that may adversely affect minority groups, and handling aberrant behavior involving a group of actors. We present applications in public healthcare fraud, and health monitoring.
Journal article
Benefit-aware early prediction of health outcomes on multivariate eeg time series
Published 2023
Journal of biomedical informatics, 139, 104296
Given a cardiac-arrest patient being monitored in the ICU (intensive care unit) for brain activity, how can we predict their health outcomes as early as possible? Early decision-making is critical in many applications, e.g. monitoring patients may assist in early intervention and improved care. On the other hand, early prediction on EEG data poses several challenges: (i) earliness-accuracy trade-off; observing more data often increases accuracy but sacrifices earliness, (ii) large-scale (for training) and streaming (online decision-making) data processing, and (iii) multi-variate (due to multiple electrodes) and multi-length (due to varying length of stay of patients) time series. Motivated by this real-world application, we present BENEFITTER that infuses the incurred savings from an early prediction as well as the cost from misclassification into a unified domain-specific target called benefit. Unifying these two quantities allows us to directly estimate a single target (i.e. benefit), and importantly, (a) is efficient and fast, with training time linear in the number of input sequences, and can operate in real-time for decision-making, (b) can handle multi-variate and variable-length time-series, suitable for patient data, and (c) is effective, providing up to 2× time-savings with equal or better accuracy as compared to competitors.
Preprint
UltraProp: Principled and Explainable Propagation on Large Graphs
Published 12/31/2022
Given a large graph with few node labels, how can we (a) identify whether there is generalized network-effects (GNE) or not, (b) estimate GNE to explain the interrelations among node classes, and (c) exploit GNE efficiently to improve the performance on downstream tasks? The knowledge of GNE is valuable for various tasks like node classification, and targeted advertising. However, identifying GNE such as homophily, heterophily or their combination is challenging in real-world graphs due to limited availability of node labels and noisy edges. We propose NetEffect, a graph mining approach to address the above issues, enjoying the following properties: (i) Principled: a statistical test to determine the presence of GNE in a graph with few node labels; (ii) General and Explainable: a closed-form solution to estimate the specific type of GNE observed; and (iii) Accurate and Scalable: the integration of GNE for accurate and fast node classification. Applied on real-world graphs, NetEffect discovers the unexpected absence of GNE in numerous graphs, which were recognized to exhibit heterophily. Further, we show that incorporating GNE is effective on node classification. On a million-scale real-world graph, NetEffect achieves over 7 times speedup (14 minutes vs. 2 hours) compared to most competitors.
Patent
Machine learning based on post-transaction data
Published 2022
May~3
11,321,632,