Shubhranshu Shekhar

Assistant Professor of Data Science in the Brandeis International Business School

Data Mining

Machine Learning

Public Policy

Conference proceeding

Less is more: Slimg for accurate, robust, and interpretable graph mining

by Jaemin Yoo, Meng-Chieh Lee, Shubhranshu Shekhar and Christos Faloutsos

Published 2023

Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 08/2023

How can we solve semi-supervised node classification in various graphs possibly with noisy features and structures? Graph neural networks (GNNs) have succeeded in many graph mining tasks, but their generalizability to various graph scenarios is limited due to the difficulty of training, hyperparameter tuning, and the selection of a model itself. Einstein said that we should "make everything as simple as possible, but not simpler." We rephrase it into the careful simplicity principle: a carefully-designed simple model can surpass sophisticated ones in real-world graphs. Based on the principle, we propose SlimG for semi-supervised node classification, which exhibits four desirable properties: It is (a) accurate, winning or tying on 10 out of 13 real-world datasets; (b) robust, being the only one that handles all scenarios of graph data (homophily, heterophily, random structure, noisy features, etc.); (c) fast and scalable, showing up to 18 times faster training in million-scale graphs; and (d) interpretable, thanks to the linearity and sparsity. We explain the success of SlimG through a systematic study of the designs of existing GNNs, sanity checks, and comprehensive ablation studies.

Conference proceeding

Fairod: Fairness-aware outlier detection

by Shubhranshu Shekhar, Neil Shah and Leman Akoglu

Published 2021

Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society

AAAI/ACM Conference on AI, Ethics, and Society, 07/2021

Fairness and Outlier Detection (OD) are closely related, as it is exactly the goal of OD to spot rare, minority samples in a given population. However, when being a minority (as defined by protected variables, such as race/ethnicity/sex/age) does not reflect positive-class membership (such as criminal/fraud), OD produces unjust outcomes. Surprisingly, fairness-aware OD has been almost untouched in prior work, as fair machine learning literature mainly focuses on supervised settings. Our work aims to bridge this gap. Specifically, we develop desiderata capturing well-motivated fairness criteria for OD, and systematically formalize the fair OD problem. Further, guided by our desiderata, we propose FairOD, a fairness-aware outlier detector that has the following desirable properties: FairOD (1) exhibits treatment parity at test time, (2) aims to flag equal proportions of samples from all groups (i.e. obtain group fairness, via statistical parity), and (3) strives to flag truly high-risk samples within each group. Extensive experiments on a diverse set of synthetic and real world datasets show that FairOD produces outcomes that are fair with respect to protected variables, while performing comparable to (and in some cases, even better than) fairness-agnostic detectors in terms of detection performance.

Conference proceeding

Gen 2 out: Detecting and ranking generalized anomalies

by Meng-Chieh Lee, Shubhranshu Shekhar, Christos Faloutsos, T Noah Hutson and Leon Iasemidis

Published 2021

2021 IEEE International Conference on Big Data (Big Data)

IEEE International Conference on Big Data, 2021

In a cloud of m-dimensional data points, how would we spot, as well as rank, both single-point- as well as group-anomalies? We are the first to generalize anomaly detection in two dimensions: The first dimension is that we handle both point-anomalies, as well as group-anomalies, under a unified view - we shall refer to them as generalized anomalies. The second dimension is that Gen2Out not only detects, but also ranks, anomalies in suspiciousness order. Detection, and ranking, of anomalies has numerous applications: For example, in EEG recordings of an epileptic patient, an anomaly may indicate a seizure; in computer network traffic data, it may signify a power failure, or a DoS/DDoS attack.We start by setting some reasonable axioms; surprisingly, none of the earlier methods pass all the axioms. Our main contribution is the Gen2Out algorithm, that has the following desirable properties: (a) Principled and Sound anomaly scoring that obeys the axioms for detectors, (b) Doubly-general in that it detects, as well as ranks generalized anomaly– both point- and group-anomalies, (c) Scalable, it is fast and scalable, linear on input size. (d) Effective, experiments on real-world epileptic recordings (200GB) demonstrate effectiveness of Gen2Out as confirmed by clinicians. Experiments on 27 real-world benchmark datasets show that Gen2Out detects ground truth groups, matches or outperforms point-anomaly baseline algorithms on accuracy, with no competition for group-anomalies and requires about 2 minutes for 1 million data points on a stock machine.

Conference proceeding

Entity resolution in dynamic heterogeneous networks

by Shubhranshu Shekhar, Deepak Pai and Sriram Ravindran

Published 2020

Companion Proceedings of the Web Conference 2020

WWW '20

Networks evolve continuously over time not only with the addition and deletion of links and nodes but also with changes in the importance of edges. Even though many networks contain this type of temporal weightings, vast majority of research in network representation learning and classification has focused on static snapshots of the graph, while largely ignoring the temporal dynamics. In this work, we describe two approaches for incorporating weighted temporal information into network embedding methods such as Graph Convolutional Networks (GCNs). While the first approach aggregates time-weighted edges and nodes, the second approach uses temporal random walks to find relevant convolution nodes. With experiments on public and proprietary datasets, we demonstrate the effectiveness of the proposed TimeSage for link prediction tasks. By applying these predictions, we show improvements in our task of identifying fraudulent actors on a large e-commerce website selling software as subscriptions.

Conference proceeding

Incorporating privileged information to unsupervised anomaly detection

by Shubhranshu Shekhar and Leman Akoglu

Published 2018

Machine Learning and Knowledge Discovery in Databases: European Conference Part I

Machine Learning and Knowledge Discovery in Databases European Conference, 10/10/2018–10/14/2018, Dublin, Ireland

We introduce a new unsupervised anomaly detection ensemble called SPI which can harness privileged information - data available only for training examples but not for (future) test examples. Our ideas build on the Learning Using Privileged Information (LUPI) paradigm pioneered by Vapnik et al. [19,17], which we extend to unsupervised learning and in particular to anomaly detection. SPI (for Spotting anomalies with Privileged Information) constructs a number of frames/fragments of knowledge (i.e., density estimates) in the privileged space and transfers them to the anomaly scoring space through "imitation" functions that use only the partial information available for test examples. Our generalization of the LUPI paradigm to unsupervised anomaly detection shepherds the field in several key directions, including (i) domain knowledge-augmented detection using expert annotations as PI, (ii) fast detection using computationally-demanding data as PI, and (iii) early detection using "historical future" data as PI. Through extensive experiments on simulated and real datasets, we show that augmenting privileged information to anomaly detection significantly improves detection performance. We also demonstrate the promise of SPI under all three settings (i-iii); with PI capturing expert knowledge, computationally expensive features, and future data on three real world detection tasks.

Conference proceeding

Spreading Activation Way of Knowledge Integration

by Shubhranshu Shekhar, Sutanu Chakraborti and Deepak Khemani

Published 2015

Mining Intelligence and Knowledge Exploration: Third International Conference, MIKE 2015, Hyderabad, India, December 9-11, 2015, Proceedings 3

MIKE 2015: Mining Intelligence and Knowledge Exploration, 2015

Search and recommender systems benefit from effective integration of two different kinds of knowledge. The first is introspective knowledge, typically available in feature-theoretic representations of objects. The second is external knowledge, which could be obtained from how users rate (or annotate) items, or collaborate over a social network. This paper presents a spreading activation model that is aimed at a principled integration of these two sources of knowledge. In order to empirically evaluate our approach, we restrict the scope to text classification tasks, where we use the category knowledge of the labeled set of examples as an external knowledge source. Our experiments show a significantly improved classification effectiveness on hard datasets, where feature value representations, on their own, are inadequate in discriminating between classes.

Conference proceeding

Linking cases up: An extension to the case retrieval network

by Shubhranshu Shekhar, Sutanu Chakraborti and Deepak Khemani

Published 2014

Case-Based Reasoning Research and Development: 22nd International Conference, ICCBR 2014, Cork, Ireland, September 29, 2014-October 1, 2014. Proceedings 22

ICCBR 2014: Case-Based Reasoning Research and Development, 2014

In many domains, cases are associated with each other though this is not easily explained by the set of features they share. It is hard, for example to explicitly enumerate features that make a movie romantic. We present an extension to the Case Retrieval Network architecture, a spreading activation model initially proposed by Burkhard and Lenz, by allowing cases to influence each other independently of the features. We show that the architecture holds promise in improving effectiveness of retrieval in two distinct experimental domains.

Conference proceeding

How popular are your tweets?

by Avijit Saha, Janarthanan Rajendran, Shubhranshu Shekhar and Balaraman Ravindran

Published 2014

Proceedings of the 2014 Recommender Systems Challenge

Eighth ACM Conference on Recommender Systems, 10/06/2014–10/10/2016, Foster City, Silicon Valley California USA

Evaluation is a key factor to reflect the quality of a recommender system algorithm. Traditional recommenders pose the problem as an optimization task where they seek to minimize the error in predicted rating for an item or predicted top-n items of interest with respect a user. However, these predictions do not often translate to a well-perceived recommendation. In this work, instead of the typical rating prediction task, we predict the amount of interaction an item would receive through a social network. In particular, we propose a simple and efficient model to generate a ranked list of tweets of a user in the order of expected user interaction that they would receive on Twitter, which is expressed in terms of retweets and favorites. We evaluate our proposed model on an extended version of the MovieTweetings dataset, which contains tweets that are generated when users rate movies on IMDb (using the IMDb iOS app), and show that the proposed model performs better compared to the baselines.

Shubhranshu Shekhar

Assistant Professor of Data Science in the Brandeis International Business School

Scholarship list

Brandeis University Social media