University of Cambridge > Talks.cam > Isaac Newton Institute Seminar Series > Variable selection and classification with large-scale presence only data

Variable selection and classification with large-scale presence only data

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact INI IT.

STSW01 - Theoretical and algorithmic underpinnings of Big Data

Co-author: Hyebin Song (University of Wisconsin-Madison)

In various real-world problems, we are presented with positive and unlabelled data, referred to as presence-only responses where the  number of covariates $p$ is large. The combination of presence-only responses and high dimensionality presents both statistical and computational challenges. In this paper, we develop the \emph{PUlasso} algorithm for variable selection and classification with positive and unlabelled responses. Our algorithm involves using the majorization-minimization (MM) framework which is a generalization of the well-known expectation-maximization (EM) algorithm. In particular to make our algorithm scalable, we provide two computational speed-ups to the standard EM algorithm. We provide a theoretical guarantee where we first show that our algorithm is guaranteed to converge to a stationary point, and then  prove that any stationary point achieves the minimax optimal mean-squared error of $\frac{s \log p}{n}$, where $s$ is the sparsity of the true parameter. We also demonstrate through simulations that our algorithm out-performs state-of-the-art algorithms in the moderate $p$ settings in terms of classification performance. Finally, we demonstrate that our PUlasso algorithm performs well on a biochemistry example.

Related Links

This talk is part of the Isaac Newton Institute Seminar Series series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2025 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity