# Variable selection and classification with large-scale presence only data

STSW01 - Theoretical and algorithmic underpinnings of Big Data

Co-author: Hyebin Song (University of Wisconsin-Madison)

In various real-world problems, we are presented with positive and unlabelled data, referred to as presence-only responses where the  number of covariates $p$ is large. The combination of presence-only responses and high dimensionality presents both statistical and computational challenges. In this paper, we develop the \emph{PUlasso} algorithm for variable selection and classification with positive and unlabelled responses. Our algorithm involves using the majorization-minimization (MM) framework which is a generalization of the well-known expectation-maximization (EM) algorithm. In particular to make our algorithm scalable, we provide two computational speed-ups to the standard EM algorithm. We provide a theoretical guarantee where we first show that our algorithm is guaranteed to converge to a stationary point, and then  prove that any stationary point achieves the minimax optimal mean-squared error of $\frac{s \log p}{n}$, where $s$ is the sparsity of the true parameter. We also demonstrate through simulations that our algorithm out-performs state-of-the-art algorithms in the moderate $p$ settings in terms of classification performance. Finally, we demonstrate that our PUlasso algorithm performs well on a biochemistry example.