Grand challenges in single-cell data science
Table of Contents
Introduction
Lähnemann et al. 2020 list 11 “grand challenges” in single cell data science. On first reading, the list of challenges struck me as reviewing areas where researchers have already made the fundamental contributions, rather than areas where fundamental contributions remain to be made. Here, I revisit the paper more carefully to try and identify areas where fundamental contributions could be made.
Recurring challenges
Varying levels of resolution. The discussion in this section is initially motivated by the problem of unsupervised learning of clusters in the context of scRNA-seq data, although the overarching questions are more general. The overarching questions (which remain unstated explicitly, but implicitly motivate which considerations are deemed “important”) are: (1) how can we identify subsets of cells that make sense to treat as coherent units, (2) how can we choose the “resolution” (degree of coherence on the axis of interest versus degree of heterogeneity elsewhere) for a particular scientific question of interest, and (3) how do we build representations of the data that facilitate extracting coherent subsets of cells at the desired resolution?
The highlighted example of PAGA (Wolf et al. 2019) is illustrative of the assumptions that went into the framing of the problem. The example suggests that the challenge should be addressed by building monolithic, hierarchical representations that capture multiple levels of resolution at once, such that exploratory analysis becomes trivial. However, other methods such as HSNE (Pezotti et al. 2016) or topic models (Dey et al. 2017, González-Blas et al. 2019) require a different approach, where higher resolution representations require both a human in the loop, and potentially substantial extra computation.
Building monolithic, hierarchical representations requires developing new statistical models with (likely) complicated inference algorithms. It is not immediately obvious whether such representations would be useful outside of biology, and if so in what way. In contrast, building high quality, efficient software implementations of relatively simple methods could be a much more feasible strategy to make exploratory data analysis at multiple levels of resolution easier. There are two potential objections to the latter strategy: (1) it allows researcher degrees of freedom (Gelman and Loken 2013), and (2) it is difficult to assess uncertainty in the estimated hierarchy. These can be countered by the fact that the methods are used for exploratory data analysis, not confirmatory data analysis (Tukey 1977), and that in many cases the estimated representations and hierarchies can be confirmed by external, independent data.
Quantifying uncertainty of measurements and analysis results. There are two issues: (1) the measurement noise in single cell assays is increased compared to bulk assays, making certain analyses harder/impossible; (2) common analyses have uncertainties that are often not propagated through typical analysis pipelines. The first issue is not properly appreciated in the field, but unfortunately, likely cannot be addressed through clever computation. For example, accurately estimating gene expression variance from scRNA-seq data appears generally not possible.
The second issue initially appears more difficult. What does a typical analysis pipeline actually do? (1) filter e.g., low quality cells, and uninformative genes, (2) estimate a low rank representation, which induces a visualization and a clustering, (3) identify genes that are informative about the clusters.
The robustness of downstream results to filtering choices is not systematically studied. However, one obvious area that does not appear to be adequately explored is using existing statistical methods for black box function optimization (e.g., Snoek et al. 2012, Li et al. 2017) in this setting, using metrics such as the Adjusted Rand index of the cluster labels, the hold-out prediction accuracy of cluster labels from gene expression, or the correlation of cluster “centroids” to known/previously identified cell types as the optimization criterion. (It is possible that some groups have tried these approaches and found they did not work well.)
Commonly used representations such as PCA have statistical models underlying them (e.g., Tipping and Bishop 1999), which could in principle be used to produce uncertainty in the representation. It is clear the propagating that uncertainty of certain methods is very difficult, especially when those methods do not have an explicit statistical model underlying them. However, propagating uncertainty is easy in generative models that simultaneously estimate low-dimensional representations and high-dimensional parameters of interest (e.g., differential expression analysis in a variational auto-encoder, as in Lopez et al. 2018). The drawback of such approaches is that they require either more complex inference algorithms, require solving more difficult optimization problems, or require more work to interpret the results.
Identifying genes that are informative about cluster labels while quantifying uncertainty in the identified genes is a variable selection problem, which is well-studied (e.g., Wang et al. 2020).
Scaling to higher dimensionalities: more cells, more features, and broader coverage. There are two fundamental questions: (1) do the software implementations of common models/methods run with reasonable computational budgets on time/space, and (2) do the models/methods support multiple types of measurements (either simultaneously on the same samples or not)?
The first issue requires concentrated effort in software engineering, as opposed to effort in new experimental protocols, data generation, statistical models, or likely even in inference algorithms. This area is one in which there are few major players (Seurat, scanpy), but they have the majority of resources, attention, adoption, and traction. The second issue requires the development of novel, linked statistical models in the case of simultaneous measurements, which in turn likely requires specialized software implementations. Further, there may be fundamental statistical problems in the case of “integrating” non-simultaneous measurements, in the form of un-correctable biases due to systematic technical differences between experiments.
Challenges in single-cell transcriptomics
Handling sparsity in single-cell RNA sequencing. This discussion is moot (Sarkar and Stephens 2020). “Dropouts” are not a problem, “imputation” does not make sense, and sparsity is a problem only for (space and time) efficient, robust software implementations. The only open research question here is whether denoising is just an interesting side-effect of fitting generative models for statistical tasks of true interest, or whether denoised estimates of true gene expression can be substituted for the data matrix in a modular approach for those tasks. There is some indication that for visualization, at least, there is interest in the denoised gene expression levels (e.g., Alquira-Hernandez and Powell 2021).
Defining flexible statistical frameworks for discovering complex differential patterns in gene expression. In what sense are current DE frameworks “inflexible”? The paper suggests two senses: (1) DE needs fixed assignments of samples to conditions, and (2) DE only considers mean shifts. The first issue matters only when the assignments come from some clustering algorithm, which generally has an uncertainty that is not computed. This issue could be easily resolved by a model-based clustering algorithm that assigns a posterior probability of cluster assignment. (It could also be replaced by a different problem, of assessing DE in a topic model.) The second issue relates to the general problem of detecting whether two distributions are different from each other in a statistically meaningful way, which seems to be a much harder problem, and one which has an unclear biological interpretation. There has been some work in linking the distribution of true gene expression levels to genetic (Yvert 2013) and environmental (Korthauer et al. 2016) perturbations. However, there are still fundamental disconnects between these ideas and e.g., theory of the kinetics of transcriptional regulation.
Mapping single cells to a reference atlas. The problems in this section boil down to the statistical problem of semi-supervised learning of cell clusters in new data sets, using reference data as labeled auxiliary data. This problem seems to run into a fundamental limitation, that there will be systematic differences in measurement error between new data sets and reference data sets. This limitation suggests that what is needed is not just labeled reference data, but a full generative model for the reference data which can be updated with the new data set. Existing methods have already begun to approach this idea (Wang et al. 2019, Lotfollahi et al. 2020).
On the one hand, distributing a full generative model could be easier than distributing a high-dimensional cell atlas (c.f., pre-trained Inception v3). On the other hand, it is unclear how to make this sort of framework widely accessible to non-experts, or even extensible by other expert researchers.
Generalizing trajectory inference. What is “trajectory inference”? The paper suggests a model in which cells undergoing some biological process move in a state space parameterized by transcriptome, proteome, epigenome, etc. Then, what would it mean to “generalize” trajectory inference? The paper never addresses this question, but instead discusses the problems of: (1) performing trajectory inference from different types of measurements, (2) performing (parallel) trajectory inference from different initial conditions, and (3) making rigorous statistical comparisons between estimated trajectories.
One goal of trajectory inference is to learn about critical points, characterized by specific changes in e.g., transcriptomes, in biological processes of interest. Such critical points might correspond to bifurcations of a relatively homogeneous population of cells responding to some stimulus (Lönnberg et al. 2017), or more general branching from an initial pluripotent state (Farrell et al. 2017).
It is unclear that trajectory inference from single cell epigenomic or proteomic data actually requires new statistical methods. However, this is in part because the generative models for these processes (in particular, the measurement error) have not been rigorously laid out. It is clear that trajectory inference from simultaneous measurements of transcriptome, proteome, etc. in each cell (e.g. CITE-Seq, Stoeckius et al. 2017) is principled, and would require (relatively) simple development of new, linked statistical models (in which different measurement models depend on a common latent variable model). The situation of integrating independent trajectories learned from transcriptome, etc. seems much more difficult (due to systematic differences between measurements), and potentially unsolvable.
One glaring omission in this discussion is the use of known time points of collection as prior information in trajectory inference. In general, cells do not proceed lock-step through responses to stimuli, and the use of multiple collections seems like the obvious choice to characterize temporal processes at single cell resolution. However, one could argue that such experimental designs are still unlikely to discover new biology, since the time points for collections are typically chosen based on prior knowledge. However, the statistical problems here seem much more tractable.
Finding patterns in spatially resolved measurements. The paper indicates that the fundamental question is assigning samples to cell types/compartments given gene expression and spatial coordinates. Why would this be an interesting thing to do? In the case of slide-based approaches, one issue is that measurement are not made at single-cell resolution, but at few-cell resolution, and so (unwanted) spatial variation in cell type abundance may lead to false positive discoveries of spatial patterns in gene expression (e.g., Cable et al. 2020, Kleschchevnikov et al. 2021).
My own list of fundamental questions in spatial measurements is very different: (1) what are the predominant patterns of spatial gene expression, (2) which spatial structures define (or are defined by) those patterns, and (3) which genes co-vary spatially? In my view, answering these questions are the first step towards understanding how (perturbations to) cell-level (molecular) phenotypes lead to (changes in) tissue-level phenotypes. I argue that understanding how cell-level phenotypes lead to tissue-level phenotypes is the next major hurdle in connecting disease-associated genetic variation to disease susceptibility/outcomes.
The paper also indicates that there are is also a problem of “resolution” in spatial coherence of cells. My idea to use convolutional neural nets to parameterize spatial expression models naturally handles this case, since stacked convolutions naturally encode spatial coherence at different scales.
Challenges in single-cell genomics
Dealing with errors and missing data in the identification of variation from single-cell DNA sequencing data.
Challenges in single-cell phylogenomics
Scaling phylogenetic models to many cells and many sites.
Integrating multiple types of variation into phylogenetic models.
Inferring population genetic parameters of tumor heterogeneity by model integration.
Overarching challenges
Integration of single-cell data across samples, experiments, and types of measurement.
Validating and benchmarking analysis tools for single-cell measurements.