Scholarship list
Conference proceeding
Less is more: Slimg for accurate, robust, and interpretable graph mining
Published 2023
Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 08/2023
How can we solve semi-supervised node classification in various graphs possibly with noisy features and structures? Graph neural networks (GNNs) have succeeded in many graph mining tasks, but their generalizability to various graph scenarios is limited due to the difficulty of training, hyperparameter tuning, and the selection of a model itself. Einstein said that we should "make everything as simple as possible, but not simpler." We rephrase it into the careful simplicity principle: a carefully-designed simple model can surpass sophisticated ones in real-world graphs. Based on the principle, we propose SlimG for semi-supervised node classification, which exhibits four desirable properties: It is (a) accurate, winning or tying on 10 out of 13 real-world datasets; (b) robust, being the only one that handles all scenarios of graph data (homophily, heterophily, random structure, noisy features, etc.); (c) fast and scalable, showing up to 18 times faster training in million-scale graphs; and (d) interpretable, thanks to the linearity and sparsity. We explain the success of SlimG through a systematic study of the designs of existing GNNs, sanity checks, and comprehensive ablation studies.
Conference proceeding
Fairod: Fairness-aware outlier detection
Published 2021
Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society
AAAI/ACM Conference on AI, Ethics, and Society, 07/2021
Fairness and Outlier Detection (OD) are closely related, as it is exactly the goal of OD to spot rare, minority samples in a given population. However, when being a minority (as defined by protected variables, such as race/ethnicity/sex/age) does not reflect positive-class membership (such as criminal/fraud), OD produces unjust outcomes. Surprisingly, fairness-aware OD has been almost untouched in prior work, as fair machine learning literature mainly focuses on supervised settings. Our work aims to bridge this gap. Specifically, we develop desiderata capturing well-motivated fairness criteria for OD, and systematically formalize the fair OD problem. Further, guided by our desiderata, we propose FairOD, a fairness-aware outlier detector that has the following desirable properties: FairOD (1) exhibits treatment parity at test time, (2) aims to flag equal proportions of samples from all groups (i.e. obtain group fairness, via statistical parity), and (3) strives to flag truly high-risk samples within each group. Extensive experiments on a diverse set of synthetic and real world datasets show that FairOD produces outcomes that are fair with respect to protected variables, while performing comparable to (and in some cases, even better than) fairness-agnostic detectors in terms of detection performance.
Conference proceeding
Gen 2 out: Detecting and ranking generalized anomalies
Published 2021
2021 IEEE International Conference on Big Data (Big Data)
IEEE International Conference on Big Data, 2021
In a cloud of m-dimensional data points, how would we spot, as well as rank, both single-point- as well as group-anomalies? We are the first to generalize anomaly detection in two dimensions: The first dimension is that we handle both point-anomalies, as well as group-anomalies, under a unified view - we shall refer to them as generalized anomalies. The second dimension is that Gen2Out not only detects, but also ranks, anomalies in suspiciousness order. Detection, and ranking, of anomalies has numerous applications: For example, in EEG recordings of an epileptic patient, an anomaly may indicate a seizure; in computer network traffic data, it may signify a power failure, or a DoS/DDoS attack.We start by setting some reasonable axioms; surprisingly, none of the earlier methods pass all the axioms. Our main contribution is the Gen2Out algorithm, that has the following desirable properties: (a) Principled and Sound anomaly scoring that obeys the axioms for detectors, (b) Doubly-general in that it detects, as well as ranks generalized anomaly– both point- and group-anomalies, (c) Scalable, it is fast and scalable, linear on input size. (d) Effective, experiments on real-world epileptic recordings (200GB) demonstrate effectiveness of Gen2Out as confirmed by clinicians. Experiments on 27 real-world benchmark datasets show that Gen2Out detects ground truth groups, matches or outperforms point-anomaly baseline algorithms on accuracy, with no competition for group-anomalies and requires about 2 minutes for 1 million data points on a stock machine.
Conference proceeding
Entity resolution in dynamic heterogeneous networks
Published 2020
Companion Proceedings of the Web Conference 2020
WWW '20
Networks evolve continuously over time not only with the addition and deletion of links and nodes but also with changes in the importance of edges. Even though many networks contain this type of temporal weightings, vast majority of research in network representation learning and classification has focused on static snapshots of the graph, while largely ignoring the temporal dynamics. In this work, we describe two approaches for incorporating weighted temporal information into network embedding methods such as Graph Convolutional Networks (GCNs). While the first approach aggregates time-weighted edges and nodes, the second approach uses temporal random walks to find relevant convolution nodes. With experiments on public and proprietary datasets, we demonstrate the effectiveness of the proposed TimeSage for link prediction tasks. By applying these predictions, we show improvements in our task of identifying fraudulent actors on a large e-commerce website selling software as subscriptions.
Conference proceeding
Incorporating privileged information to unsupervised anomaly detection
Published 2018
Machine Learning and Knowledge Discovery in Databases: European Conference Part I
Machine Learning and Knowledge Discovery in Databases European Conference, 10/10/2018–10/14/2018, Dublin, Ireland
We introduce a new unsupervised anomaly detection ensemble called SPI which can harness privileged information - data available only for training examples but not for (future) test examples. Our ideas build on the Learning Using Privileged Information (LUPI) paradigm pioneered by Vapnik et al. [19,17], which we extend to unsupervised learning and in particular to anomaly detection. SPI (for Spotting anomalies with Privileged Information) constructs a number of frames/fragments of knowledge (i.e., density estimates) in the privileged space and transfers them to the anomaly scoring space through "imitation" functions that use only the partial information available for test examples. Our generalization of the LUPI paradigm to unsupervised anomaly detection shepherds the field in several key directions, including (i) domain knowledge-augmented detection using expert annotations as PI, (ii) fast detection using computationally-demanding data as PI, and (iii) early detection using "historical future" data as PI. Through extensive experiments on simulated and real datasets, we show that augmenting privileged information to anomaly detection significantly improves detection performance. We also demonstrate the promise of SPI under all three settings (i-iii); with PI capturing expert knowledge, computationally expensive features, and future data on three real world detection tasks.
Conference proceeding
Spreading Activation Way of Knowledge Integration
Published 2015
Mining Intelligence and Knowledge Exploration: Third International Conference, MIKE 2015, Hyderabad, India, December 9-11, 2015, Proceedings 3
MIKE 2015: Mining Intelligence and Knowledge Exploration, 2015
Search and recommender systems benefit from effective integration of two different kinds of knowledge. The first is introspective knowledge, typically available in feature-theoretic representations of objects. The second is external knowledge, which could be obtained from how users rate (or annotate) items, or collaborate over a social network. This paper presents a spreading activation model that is aimed at a principled integration of these two sources of knowledge. In order to empirically evaluate our approach, we restrict the scope to text classification tasks, where we use the category knowledge of the labeled set of examples as an external knowledge source. Our experiments show a significantly improved classification effectiveness on hard datasets, where feature value representations, on their own, are inadequate in discriminating between classes.
Conference proceeding
Linking cases up: An extension to the case retrieval network
Published 2014
Case-Based Reasoning Research and Development: 22nd International Conference, ICCBR 2014, Cork, Ireland, September 29, 2014-October 1, 2014. Proceedings 22
ICCBR 2014: Case-Based Reasoning Research and Development, 2014
In many domains, cases are associated with each other though this is not easily explained by the set of features they share. It is hard, for example to explicitly enumerate features that make a movie romantic. We present an extension to the Case Retrieval Network architecture, a spreading activation model initially proposed by Burkhard and Lenz, by allowing cases to influence each other independently of the features. We show that the architecture holds promise in improving effectiveness of retrieval in two distinct experimental domains.
Conference proceeding
Published 2014
Proceedings of the 2014 Recommender Systems Challenge
Eighth ACM Conference on Recommender Systems, 10/06/2014–10/10/2016, Foster City, Silicon Valley California USA
Evaluation is a key factor to reflect the quality of a recommender system algorithm. Traditional recommenders pose the problem as an optimization task where they seek to minimize the error in predicted rating for an item or predicted top-n items of interest with respect a user. However, these predictions do not often translate to a well-perceived recommendation. In this work, instead of the typical rating prediction task, we predict the amount of interaction an item would receive through a social network. In particular, we propose a simple and efficient model to generate a ranked list of tweets of a user in the order of expected user interaction that they would receive on Twitter, which is expressed in terms of retweets and favorites. We evaluate our proposed model on an extended version of the MovieTweetings dataset, which contains tweets that are generated when users rate movies on IMDb (using the IMDb iOS app), and show that the proposed model performs better compared to the baselines.