Abstract
In this paper, we investigate Chronic Obstructive Pulmonary Disease in the United States 2012-2017. We integrate data from
multiple sources and use them to analyze COPD at the level of core-based statistical area. We include cigarette smoking and
race / ethnicity categories because of well-known health disparities in the United States. We develop a baseline model with
multiple linear regression and then attempt to improve upon it with machine learning methods, including Lasso Regression,
Ridge Regression, Generalized Additive Model, Support Vector Machines, Artificial Neural Network, Random Forest, and
Gradient Boosted Tree. The best machine learning model, a Support Vector Machine, captures an additional 6% variance
explained in a strongly predictive model. Overall, cigarette smoking and household income are the strongest predictors. Future
directions for research and practice are discussed.