An introduction to counts-of-counts data

Meetings are planned to take place in person. Seminars are principally for MPhil students. Please email the adminstrator should you wish to attend as a guest

Counts-of-counts data arise in many areas of biology and medicine, and have been studied by statisticians since the 1940s. One of the first examples, discussed by R. A. Fisher and collaborators in 1943 [1], concerns estimation of the number of unobserved species based on summary counts of the number of species observed once, twice, … in a sample of specimens. The data are summarized by the numbers C1, C2, … of species represented once, twice, … in a sample of size N = C1 2 C2 3 C3 …. containing S = C1 C2 + … species; the vector C = (C1, C2, …) gives the counts-of-counts. Other examples include the frequencies of the distinct alleles in a human genetics sample, the counts of distinct variants of the SARS -CoV-2 S protein obtained from consensus sequencing experiments, counts of sizes of components in certain combinatorial structures [2], and counts of the numbers of SNVs arising in one cell, two cells, … in a cancer sequencing experiment.

In this talk I will outline some of the stochastic models used to model the distribution of C, and some of the inferential issues that come from estimating the parameters of these models. I will touch on the celebrated Ewens Sampling Formula [3] and Fisher’s multiple sampling problem concerning the variance expected between values of S in samples taken from the same population [3]. Variants of birth-death-immigration processes can be used, for example when different variants grow at different rates. The classical Yule process with immigration can be used to derive some of the combinatorial results in a simple way, through a probabilistic trick known as embedding.

References

[1] Fisher RA, Corbet AS & Williams CB. J Animal Ecology, 12, 1943 [2] Arratia R, Barbour AD & Tavaré S. Logarithmic Combinatorial Structures, EMS , 2002 [3] Ewens WJ. Theoret Popul Biol, 3, 1972 [4] Da Silva P, Jamshidpey A, McCullagh P & Tavaré S. Bernoulli, in press, 2022

This talk is part of the Computational and Systems Biology Seminar Series 2023 - 24 series.