Clarifying confusion in scRNA-seq analysis
The high proportion of zeros in typical scRNA-seq datasets has led to widespread but inconsistent use of terminology such as “dropout” and “missing data”. Here, we argue that much of this terminology is unhelpful and confusing, and outline simple ideas to help reduce confusion. These include: (1) observed scRNA-seq counts reflect both true gene expression levels and measurement error, and carefully distinguishing these contributions helps clarify thinking; and (2) method development should start with a Poisson measurement model, rather than more complex models, because it is simple and generally consistent with existing data. We outline how several existing methods can be viewed within this framework and highlight how these methods differ in their assumptions about expression variation. We also illustrate how our perspective helps address questions of biological interest, such as whether mRNA expression levels are multimodal among cells.
Sarkar, A. and Stephens, M. “Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis.” Nature Genetics (2021).
Empirical results
Additional analyses
- Comparison of distribution deconvolution methods
- Distribution deconvolution examples
- Goodness of fit of deconvolved distributions (interactive browser)
- Marker genes in in silico mixtures (interactive browser)
- Expression variation in Census of Immune Cells
- Expression variation in 10X v3 PBMC data
- Expression variation in human brain cells
- Expression variation in human retina cells
- Expression variation in human liver
- Imputation of count matrices
Technical details
- Technical zero-generating mechanism in scRNA-seq data
- Estimating the number of modes
- Interactive simulations
- Effect of capture rate on sampling variation
- Speed up ash mode estimation
- Mixture of negative binomials
- Convergence of Gamma deconvolution
- Deconvolution of near-Poisson data
- Validation set log likelihood comparison
- NPMLE on C1 spike-in data
- Differential expression from distribution deconvolution
- Posterior approximation for scRNA-seq data
- Link functions in Poisson LRA
- Distributional assumptions in Poisson LRA
- Poisson-truncated normal data
- Poisson-unimodal Gamma mixture model
- Poisson-Gamma log likelihood surface
- Variational autoencoders for scRNA-seq data
- Weighted Negative Binomial Matrix Factorization
- Log link in Poisson ash
- Transformations of deconvolved gene expression distributions
- Gaussian methods
- Randomized quantiles
- Relaxing the independence assumption on expression models
- EM algorithm for point-Gamma expression model
- Iterative refinement of NPMLE grid in ashr
- Uniform vs half-uniform mixture prior