Scholarship list
Conference proceeding
NeuralCubes: Deep Representations for Visual Data Exploration
Published 12/15/2021
2021 IEEE International Conference on Big Data (Big Data), 550 - 561
Visual exploration of large multi-dimensional datasets has seen tremendous progress in recent years, allowing users to express rich data queries that produce informative visual summaries, all in real time. Techniques based on data cubes are some of the most promising approaches. However, these techniques usually require a large memory footprint for large datasets. To tackle this problem, we present NeuralCubes: neural networks that predict results for aggregate queries, similar to data cubes. NeuralCubes learns a function that takes as input a given query, for instance, a geographic region and temporal interval, and outputs the result of the query. The learned function serves as a real-time, low-memory approximator for aggregation queries. Our models are small enough to be sent to the client side (e.g. the web browser for a web-based application) for evaluation, enabling data exploration of large datasets without database/network connection. We demonstrate the effectiveness of NeuralCubes through extensive experiments on a variety of datasets and discuss how NeuralCubes opens up opportunities for new types of visualization and interaction.
Conference proceeding
Communicating Performance of Regression Models Using Visualization in Pharmacovigilance
Published 10/2021
2021 IEEE Workshop on Visual Analytics in Healthcare (VAHC), 6 - 13
Statistical regression methods can help pharmaceutical organizations improve the quality of their pharmacovigilance by predicting the expected quantity of adverse events during a trial. However, the use of statistical techniques also changes the risk profile of any downstream tasks, due to bias and noise in the model's predictions. That risk profile must be clearly understood, documented, and communicated across many different stakeholders in a highly regulated environment. Aggregated performance metrics such as explained variance or mean average error fail to tell the whole story, making it difficult for subject matter experts to feel confident in deciding to use a model. In this work, we describe guidelines for communicating regression model performance for models deployed in predicting adverse events. First, we describe an interview study in which both data scientists and subject matter experts within a pharmaceutical organization describe their challenges in communicating and understanding regression performance. Based on the responses in this study, we develop guidelines for which visualizations to use to communicate performance, and use a publicly available trial safety database to demonstrate their use.
Conference proceeding
Efficient Bayesian Detection of Disease Onset in Truncated Medical Data
Published 08/2017
2017 IEEE International Conference on Healthcare Informatics (ICHI), 208 - 213
This paper describes a principled statistical methodof preprocessing incidentally collected electronic medical recordsto facilitate short-term predictions of disease onset withoutexplicit interaction with patients (e.g., medical tests, questionnaires). The model is also applicable to detection of remission. In incidentally collected data, records are possibly left and righttruncated - the first time an event of interest is seen in a patient'sdata may not be the first time in the patient's history that ithappened. It is therefore difficult to know if a disease onsethappens in a given history. If we are unable to determine ifand when the onset occurs, supervised learning and regressionapproaches cannot be applied.Our method determines if an onset occurs in a set of sparseand incomplete patient records, calculates the time of this onsetand provides a principled measure of confidence. It combinesindividual patient history with expectations computed from areference population. We compare the proposed method againststandard change detection algorithms on generated data withrealistic event sparsity and show that it can reliably detect onsetswhere traditional methods fail. We then go on to apply thealgorithm to a large corpus of U.S. Medicare data and show thatthe algorithm scales to large datasets efficiently. The algorithmis currently in trials at a large medical informatics company.