In vario us real-world problems\, we are presented with pos itive and unlabelled data\, referred to as presenc e-only responses where the \; number of covari ates $p$ is large. The combination of presence-onl y responses and high dimensionality presents both statistical and computational challenges. In this paper\, we develop the \\emph{PUlasso} algorithm f or variable selection and classification with posi tive and unlabelled responses. Our algorithm invol ves using the majorization-minimization (MM) frame work which is a generalization of the well-known e xpectation-maximization (EM) algorithm. In particu lar to make our algorithm scalable\, we provide tw o computational speed-ups to the standard EM algor ithm. We provide a theoretical guarantee where we first show that our algorithm is guaranteed to con verge to a stationary point\, and then \ ;prove that any stationary point achieves the mini max optimal mean-squared error of $\\frac{s \\log p}{n}$\, where $s$ is the sparsity of the true par ameter. We also demonstrate through simulations th at our algorithm out-performs state-of-the-art alg orithms in the moderate $p$ settings in terms of c lassification performance. Finally\, we demonstrat e that our PUlasso algorithm performs well on a bi ochemistry example.

Related L inks

- https://arxiv.org/abs/ 1711.08129 - Link to Arxiv paper