An improved voom transform for scRNA-seq data

Introduction
Setup
Methods
Results

Introduction

The key idea of limma-voom (Law et al. 2014) is to transform a count matrix generated by bulk RNA-seq into two matrices, representing the mean and variance of true (log) gene expression. These matrices can then be analyzed using (heteroscedastic) Gaussian methods. However, limma-voom was developed before the development of scRNA-seq, and therefore before it was possible to measure the variance of gene expression between cells from a single donor. To address this limitation, Law et al. instead proposed to pool information across both donors and genes, estimating a LOESS trend between the mean and variance of true gene expression values across donors. \( \DeclareMathOperator\E{E} \DeclareMathOperator\Gam{Gamma} \DeclareMathOperator\N{\mathcal{N}} \DeclareMathOperator\Poi{Poisson} \DeclareMathOperator\V{V} \DeclareMathOperator\digamma{\psi} \DeclareMathOperator\trigamma{\psi^{(1)}} \newcommand\vb{\mathbf{b}} \newcommand\vc{\mathbf{c}} \newcommand\xiplus{x_{i+}} \)

Now suppose we have observed scRNA-seq data \(x_{ij}\), where \(x_{ij}\) denotes the number of molecules from gene \(j\) observed in cell \(i\). Then, there are two possible DE analysis we might be interested in. First, we might divide (some subset of) cells into two groups, and ask whether genes are differentially expressed between groups. Second, we might divide donors into groups, and test whether genes (in some subset of cells, per donor) are differentially expressed between groups. The key distinction is that single cells are the units in the first case, and donors are the units in the second case.

We can apply limma-voom without modification to the first case, because the typical log transformation corresponds to an MLE

\begin{align} x_{ij} \mid \xiplus, \theta_{ij} &\sim \Poi(\xiplus \exp(\theta_{ij}))\\ \ell \triangleq \ln p(x_{ij} \mid \xiplus, \theta_{ij}) &= x_{ij} (\ln \xiplus + \theta_{ij}) - \xiplus \exp(\theta_{ij}) + \mathrm{const}\\ \frac{\partial \ell}{\partial \theta_{ij}} &= x_{ij} - \xiplus \exp(\theta_{ij})\\ \hat\theta_{ij} &= \ln\left(\frac{x_{ij}}{\xiplus}\right). \end{align}

This simple theoretical argument and empirical studies have demonstrated that applying limma-voom to scRNA-seq data can work (Soneson and Robinson 2018, Hsiao 2019). However, unlike the bulk RNA-seq case, now the notion of “variance of gene expression” within a single unit no longer makes sense. Therefore, it is unclear what precisely limma-voom is fitting in this case.

Applying limma-voom to the second case also works, because we can estimate a point mass expression model for the cells from each donor \(k\)

\begin{align} x_{ij} \mid \xiplus, \theta_j &\sim \Poi(\xiplus \exp(\theta_j))\\ \ell \triangleq \sum_i \ln p(x_{ij} \mid \xiplus, \theta_j) &= \sum_i x_{ij} (\ln \xiplus + \theta_j) - \xiplus \exp(\theta_j) + \mathrm{const}\\ \frac{\partial \ell}{\partial \theta_j} &= \sum_i x_{ij} - \xiplus \exp(\theta_j)\\ \theta_j &= \ln\left(\frac{\sum_i x_{ij}}{\sum_i \xiplus}\right)\\ \end{align}

where \(\xiplus \triangleq \sum_j x_{ij}\) (Sarkar and Stephens 2020). This approach is equivalent to constructing pseudobulk data \(y_{kj} \triangleq \sum_i x_{ij} z_{ik}\), where \(z_{ik}\) indicates whether cell \(i\) came from donor \(k\), and using \(\ln(y_{kj} / y_{k+})\) as the estimated mean of true log gene expression, where \(y_{k+} \triangleq \sum_j y_{kj}\). However, the relationship between the voom-estimated variance and the true gene expression variance is unclear because the variance used by voom is between individuals, not within an individual (although there is some evidence that the two are highly correlated). Further, it is unlikely that a point mass expression model will be supported by the data.

We previously developed a method to efficiently estimate more complex expression models in large-scale scRNA-seq data sets (Sarkar et al. 2019). Here, we use that method to investigate two new possibilities for a precision weight derived from fitted expression models: (1) the inverse squared standard error of a point mass model, or (2) the inverse variance of the log true expression under a Gamma model. Specifically, we ask whether these alterantive approaches improve the power or robustness of DE analysis in scRNA-seq data.

Setup

import anndata
import numpy as np
import mpebpm
import pandas as pd
import scanpy as sc
import rpy2.robjects.packages
import rpy2.robjects.pandas2ri
import scipy.special as sp
import scipy.sparse as ss
import scipy.stats as st
import sqlite3
import scqtl
import torch

ashr = rpy2.robjects.packages.importr('ashr')
limma = rpy2.robjects.packages.importr('limma')
rpy2.robjects.pandas2ri.activate()

%matplotlib inline
%config InlineBackend.figure_formats = set(['retina'])

import matplotlib.pyplot as plt
plt.rcParams['figure.facecolor'] = 'w'
plt.rcParams['font.family'] = 'Nimbus Sans'

Methods

Standard error of point mass expression model

The standard error of \(\hat\theta_j\) is analytic

\begin{align} \frac{\partial^2 \ell}{\partial \theta_j^2} &= -\sum_i \xiplus \exp(\theta_j)\\ \mathcal{I}(\mu_j) &= -\E\left[\frac{\partial^2 \ell}{\partial \mu_j^2}\right] = \sum_i \xiplus \exp(\theta_j)\\ s_j^2 &= \frac{1}{\sum_i \xiplus \exp(\theta_j)}, \end{align}

where we have treated \(\xiplus\) as fixed. This treatment is justified by the fact that the Poisson measurement model for each gene arises from a Multinomial measurement model for all genes jointly, in which the total number of molecules observed is fixed rather than a sum of random variables. As an illustrative example, plot the bootstrap distribution of the \(\hat\theta_j\) against a normal density with mean \(\theta_j\) and variance \(s_j^2\) for a simple simulation.

rng = np.random.default_rng(1)
n_trials = 1000
n = 100
s = 1e4
theta = -11
thetahat = []
for i in range(n_trials):
  x = rng.poisson(s * np.exp(theta), size=n)
  thetahat.append(np.log(x.sum()) - np.log(n) - np.log(s))
thetahat = np.array(thetahat)

plt.clf()
plt.gcf().set_size_inches(2.5, 2.5)
plt.hist(thetahat, bins=16, density=True, color='0.7')
grid = np.linspace(thetahat.min(), thetahat.max(), 1000)
plt.plot(grid, st.norm(loc=theta, scale=np.sqrt(1 / (np.exp(theta) * n * s))).pdf(grid), lw=1, c='k')
plt.xlabel('Est ln mean gene expression')
plt.ylabel('Density')
plt.tight_layout()

After introducing multiplicative effects \(\vb_j\) for observed technical covariates \(\vc_i\) into the measurement model

\begin{equation} x_{ij} \mid \xiplus, \vc_i, \vb_j, \theta_j \sim \Poi(\xiplus \exp(\vc_i' \vb_j + \theta_j)), \end{equation}

the standard error of \(\hat\theta_j\) also depends on \(\vc_i'\vb_j\). In contrast, if we assume the identity link

\begin{align} x_{ij} \mid \xiplus, \mu_j &\sim \Poi(\xiplus \mu_j)\\ \ell \triangleq \sum_i \ln p(x_{ij} \mid \xiplus, \mu_j) &= \sum_i x_{ij} \ln(\xiplus \mu_j) - \xiplus \mu_j + \mathrm{const}\\ \frac{\partial \ell}{\partial \mu_j} &= \sum_i \frac{x_{ij}}{\mu_j} - \xiplus\\ \hat\mu_j &= \frac{\sum_i x_{ij}}{\sum_i \xiplus}\\ \frac{\partial^2 \ell}{\partial \mu_j^2} &= -\sum_i \frac{x_{ij}}{\mu_j^2}\\ \mathcal{I}(\mu_j) &= -\E\left[\frac{\partial^2 \ell}{\partial \mu_j^2}\right] = \frac{\E[\sum_i x_{ij}]}{\mu_j^2} = \frac{\sum_i \xiplus}{\mu_j}\\ s_j^2 &= \frac{\mu_j}{\sum_i \xiplus}, \end{align}

where we have used the fact that \(\sum_i x_{ij} \sim \Poi(\mu_j \sum_i \xiplus)\). Surprisingly, \(\ln \hat\mu_j = \hat\theta_j\), the standard error of \(\hat\mu_j\) increases as \(\mu_j\) increases, and the standard error does not depend on technical covariates or their effects. As a sanity check, plot the bootstrap distribution of \(\hat\mu_j\) against a normal density with mean \(\theta_j\) and variance \(s_j^2\) for a simple simulation.

rng = np.random.default_rng(2)
n_trials = 1000
n = 100
s = 1e4
log_mu = -10
muhat = []
for i in range(n_trials):
  x = rng.poisson(s * np.exp(log_mu), size=n)
  muhat.append(x.sum() / (n * s))
muhat = np.array(muhat)

plt.clf()
plt.gcf().set_size_inches(2.5, 2.5)
plt.hist(muhat, bins=14, density=True, color='0.7')
grid = np.linspace(muhat.min(), muhat.max(), 1000)
plt.plot(grid, st.norm(loc=muhat.mean(), scale=np.sqrt(muhat[0] / (n * s))).pdf(grid), lw=1, c='k')
plt.xlabel('Est mean gene expression')
plt.ylabel('Density')
plt.tight_layout()

With an analytic expression for \(s_j^2\), we can also derive a \(z\)-test for DE, where

\begin{align} \frac{\hat\theta_{1j} - \hat\theta_{0j}}{\sqrt{s_{1j}^2 + s_{0j}^2}} \sim \N(0, 1) \end{align}

under the null. However, one clear issue is that the estimator (8) is biased, because

\begin{equation} \E[\ln \textstyle\sum_i x_{ij}] = \sum_{t=0}^{\infty} \ln(t) \Poi(t; \textstyle\sum_i \xiplus \exp(\theta_j)) = -\infty. \end{equation}

This issue was previously noted (e.g., Lun 2018) via Taylor expansion

\begin{equation} \E[\ln x] \approx \ln \E[x] - \frac{\V[x]}{2 \E[x]^2}, \end{equation}

and demonstrates that the bias is worse for lower-expressed genes. To illustrate the issue, plot the theoretical quantiles of the \(\N(\theta_j, s^2_j)\) distribution against the empirical quantiles of the (parametric, oracle) bootstrap distribution of \(\hat\theta_j\).

grid = np.linspace(0, 1, thetahat.shape[0] + 1)[1:]
plt.clf()
plt.gcf().set_size_inches(2.5, 2.5)
plt.xscale('log')
plt.yscale('log')
plt.plot(grid, st.norm(loc=theta, scale=np.sqrt(1 / (np.exp(theta) * n * s))).cdf(np.sort(thetahat)), lw=1, c='k')
lim = [5e-7, 1]
plt.plot(lim, lim, lw=1, ls=':', c='r')
plt.xlim(lim)
plt.ylim(lim)
plt.xlabel('Theoretical quantile')
plt.ylabel('Empirical quantile')
plt.tight_layout()

Variance of Gamma expression model

Assuming a Gamma expression model

\begin{align} \lambda_{ij} &\sim \Gam(\phi_j^{-1}, \mu_j^{-1} \phi_j^{-1})\\ \E[\ln \lambda_{ij}] &= \digamma(\phi_j^{-1}) + \ln(\mu_j \phi_j)\\ \V[\ln \lambda_{ij}] &= \trigamma(\phi_j^{-1}), \end{align}

where the Gamma distribution is parameterized by shape and rate, \(\digamma(\cdot)\) denotes the digamma function, and \(\trigamma(\cdot)\) denotes the trigamma function. We previously noted that robustly estimating \(\phi_j\) is difficult, even from hundreds of cells per condition; despite this difficulty, our method can still accurately estimate the variance of true gene expression.

Controlling FDR

Given transformed data and standard errors, DE analysis is performed in two steps:

Estimate the effect of the covariate of interest by GLS
Estimate moderated test statistics and \(p\)-values by EB treatment of the standard errors from (1)
Control FDR by applying e.g., the BH procedure to the \(p\)-values from (2)

Lu and Stephens 2019 describe a more powerful approach to solve (3).

Simulation

Implement a simplified DSC.

def simulate_null(dat, b=0, n_donors=1, n_cells=100, min_cells=1, seed=0):
  """Return AnnData object with simulated counts, and onehot matrix of labels

  Simulate data under the null by randomly assigning cells to donors. Simulate
  data with unequal sequencing depths (unequal total number of molecules
  observed) by subsequently applying binomial thinning (Gerard 2019) to *every*
  gene.

  dat - AnnData object
  b - log fold change in total number of molecules observed between groups
  n_donors - number of donors per group
  n_cells - number of cells per donor
  min_cells - minimum number of non-zero observations to retain genes

  """
  assert 2 * n_donors * n_cells <= dat.shape[0]
  if b > 0:
    b = -b
  query = sc.pp.subsample(dat, n_obs=2 * n_donors * n_cells, random_state=seed, copy=True)
  sc.pp.filter_genes(query, min_cells=min_cells)
  onehot = ss.coo_matrix((np.ones(query.shape[0]), (np.arange(query.shape[0]), np.repeat(np.arange(2 * n_donors), n_cells)))).tocsr()
  if np.isclose(b, 0):
    return query, onehot
  else:
    if ss.isspmatrix(query.X):
      # Important: keep this sparse to make downsampling non-zeros easier
      temp = query.X[:n_donors * n_cells].astype(int).tocsr()
      y = ss.csr_matrix((st.binom(n=temp.data, p=np.exp(b)).rvs().astype(float), temp.indices, temp.indptr), shape=temp.shape)
      x = ss.vstack((y, query.X[n_donors * n_cells:]), format='csr')
    else:
      # TODO: this breaks for n=0, but treating e.g. bulk RNA-seq counts as
      # sparse is *very* slow
      x = st.binom(n=query.X.astype(int), p=np.exp(b)).rvs().astype(float)
    return anndata.AnnData(x, obs=query.obs, var=query.var), onehot

def simulate_effects_law_2014(size, pi0=0.9):
  """Return effect sizes assumed by Law et al. 2014"""
  b = np.zeros(size)
  z = np.random.uniform(size=size) > pi0
  s = np.random.uniform(size=size) > 0.5
  b[z & s] = 2
  b[z & ~s] = -2
  return b

def simulate_effects_pn(size, pi0=0.9, scale=1):
  """Return effect sizes assuming a point-Normal distribution"""
  b = np.zeros(size)
  z = np.random.uniform(size=size) > pi0
  b[z] = st.norm(scale=scale).rvs(size=z.sum())
  return b

def simulate_study(dat, g, b=0, n_donors=1, n_cells=100, min_cells=10, seed=0):
  """Return AnnData object with simulated counts, onehot matrix, design matrix,
and true effect size vector

  Simulate non-null data assuming effect size distribution g by first
  simulating null data, then applying binomial thinning (Gerard 2019) to effect
  genes.

  dat - AnnData object
  g - function which returns array of effect sizes
  b - log fold change in total number of molecules observed between groups
  n_donors - number of donors per group
  n_cells - number of cells per donor
  min_cells - minimum number of non-zero observations to retain genes

  """
  x, onehot = simulate_null(dat, b=b, n_donors=n_donors, n_cells=n_cells, min_cells=min_cells, seed=seed)
  lfc = g(size=x.shape[1])
  if ss.isspmatrix(x.X):
    # Convert to CSC to deal with half the cells at a time
    y = x.X.tocsc()
    data = y.data.copy()
    for i in range(y.shape[0]):
      if lfc[i] > 0:
        idx = y.indptr[i] + np.where(y.indices[y.indptr[i]:y.indptr[i+1]] > n_donors * n_cells)[0]
        data[idx] = st.binom(n=y.data[idx].astype(int), p=np.exp(-lfc[i])).rvs(size=len(idx)).astype(float)
      elif lfc[i] < 0:
        idx = y.indptr[i] + np.where(y.indices[y.indptr[i]:y.indptr[i+1]] < n_donors * n_cells)[0]
        data[idx] = st.binom(n=y.data[idx].astype(int), p=np.exp(lfc[i])).rvs(size=len(idx)).astype(float)
    x = anndata.AnnData(ss.csc_matrix((data, y.indices, y.indptr)), var=x.var, obs=x.obs)
  else:
    y = x.X.copy()
    query = lfc < 0
    y[n_donors * n_cells:,query] = st.binom(n=y[n_donors * n_cells,query].astype(int), p=np.exp(lfc[query])).rvs()
    y[:n_donors * n_cells,~query] = st.binom(n=y[:n_donors * n_cells,~query].astype(int), p=np.exp(lfc[~query])).rvs()
    x = annData.AnnData(y, var=x.var, obs=x.obs)
  if n_donors > 1:
    design = np.ones((2 * n_donors, 2))
    design[:n_donors,0] = 0
  else:
    design = None
  return x, onehot, design, lfc

def estimate_limma_voom(x, onehot, design=None, **kwargs):
  """Return MarrayLM object

  If design is None, assume rows of x are statistical units, and onehot[:,0]
  gives assignment of units to groups. Otherwise, assume onehot gives
  assignment of cells to donors, donors are statistical units, and design [:,0]
  gives the assignment of units to groups.

  x - count matrix (dense)
  onehot - mapping of cells to groups/donors (dense)
  design - design matrix for limma (dense)

  """
  # Important: limma expects genes x samples
  if design is None:
    design = np.vstack([onehot[:,0], np.ones(x.shape[0])]).T
    y = limma.voom(x.T, design)
  else:
    y = limma.voom(x.T @ onehot, design)
  fit = limma.lmFit(y, design)
  return fit

def estimate_point(x, onehot, **kwargs):
  """Return DataFrame of (bhat, se, p)

  Assume cells are units, and for each group estimate θ_j = log μ_j and its SE
  under a point mass expression model.

  x - count matrix (sparse)
  onehot - mapping of cells to groups (dense)

  """
  # Allow -inf
  theta = np.log(onehot.T @ x) - np.log(onehot.T @ x.sum(axis=1, keepdims=True))
  if ss.isspmatrix(theta):
    theta = theta.A
  bhat = theta[0] - theta[1]
  V = 1 / (onehot.T @ x)
  se = np.sqrt(V.sum(axis=0))
  z = bhat / se
  p = st.norm().sf(abs(z))
  return pd.DataFrame({'bhat': bhat, 'se': se, 'p': p})

def estimate_rqr(x, onehot, **kwargs):
  """Return DataFrame of (bhat, se, p)

  Assume cells are units, estimate an expression model for each group, and then
  estimate randomized quantile residuals for each observation (Dunn and Smyth
  1996). Perform a z-test using the residuals as the data.

  This method is related to the sctransform method (Hafemeister and Satija
  2019), which uses Pearson residuals of a GLM. The advantage of using RQRs is
  that they can be computed for expression models which do not give rise to
  observation models which can be fit using GLMs (e.g., Gamma expression models
  with different variances per group).

  Dunn and Smyth, "Randomized Quantile Residuals". J Comput Graph Stat 1996.
  http://www.jstor.org/stable/1390802

  Hafemeister and Satija, "Normalization and variance stabilization of
  single-cell RNA-seq data using regularized negative binomial
  regression". Genome Biol 2020.
  https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1874-1

  x - count matrix (dense)
  onehot - mapping of cells to groups (dense)

  """
  n, p = x.shape
  _, m = onehot.shape
  s = x.sum(axis=1, keepdims=True)
  # TODO: don't fix expression model?
  # Heuristic: fix total number of updates
  num_epochs = 6000 * 32 // x.shape[0]
  log_mean, log_inv_disp, logodds = mpebpm.ebpm_point_gamma(x, s, onehot=onehot, batch_size=32, num_epochs=num_epochs, shuffle=True)

  n = onehot @ np.exp(log_inv_disp)
  p = 1 / (1 + (s * (onehot @ np.exp(log_mean - log_inv_disp))))
  pi0 = onehot @ sp.expit(logodds)
  F = st.nbinom(n, p).cdf(x - 1)
  F = np.where(x - 1 >= 0, pi0 + (1 - pi0) * F, F)
  f = st.nbinom(n, p).pmf(x)
  f *= (1 - pi0)
  f[x == 0] += pi0[x == 0]
  u = np.random.uniform(size=F.shape)
  rpp = F + u * f
  resid = st.norm().ppf(rpp)

  bhat = np.diff((onehot.T @ resid) / (onehot.sum(axis=0).reshape(-1, 1)), axis=0).ravel()
  V = 1 / (onehot.T @ x)
  se = np.sqrt(V.sum(axis=0))
  z = bhat / se
  p = st.norm().sf(abs(z))
  return pd.DataFrame({'bhat': bhat, 'se': se, 'p': p})

def estimate_wls_point(x, onehot, design=None, **kwargs):
  """Return MarrayLM object

  Instead of voom, estimate θ_j = log μ_j under a point mass expression model
  and its sampling variance, and use those as input to WLS.

  If design is None, assume rows of x are statistical units, and onehot[:,0]
  gives assignment of units to groups. Otherwise, assume onehot gives
  assignment of cells to donors, donors are statistical units, and design [:,0]
  gives the assignment of units to groups.

  x - count matrix (dense)
  onehot - mapping of cells to groups/donors (dense)
  design - design matrix for limma (dense)

  """
  s = x.sum(axis=1, keepdims=True)
  if design is None:
    log_mean = np.log(x + 1) - np.log(s)
    # Important: this only reduces when there are no techncial covariates in
    # the measurement model
    w = x
    design = np.vstack([onehot[:,0], np.ones(x.shape[0])]).T
  else:
    log_mean = np.log(onehot.T @ x + 1) - np.log(onehot.T @ s)
    w = onehot.T @ x
  fit = limma.lm_series(log_mean.T, design=design, weights=w.T)
  return fit

def estimate_wls_gamma(x, onehot, design, lr=1e-2, num_epochs=40, batch_size=64, shuffle=True, log_dir=None, **kwargs):
  """Return DataFrame of bhat, se

  Instead of voom, estimate E[log λ_{ij}] and V[log λ_{ij}] under a Gamma
  model, and use those as input to WLS.

  Assume donors are statistical units, onehot gives assignment of cells to
  donors, and design [:,0] gives the assignment of units to groups.

  """
  # Important: this will be too slow on CPU
  assert torch.cuda.is_available()
  s = x.sum(axis=1, keepdims=True)
  log_mean, log_inv_disp = mpebpm.ebpm_gamma(
    x,
    s=s,
    onehot=onehot,
    lr=lr,
    num_epochs=num_epochs,
    batch_size=batch_size,
    shuffle=shuffle,
    log_dir=log_dir)
  # [n_donors, n_genes]
  m = sp.digamma(np.exp(log_inv_disp)) + log_mean - log_inv_disp
  w = 1 / sp.polygamma(1, np.exp(log_inv_disp))
  fit = limma.lm_series(m.T, design=design, weights=w.T)
  return fit

def estimate_moderated_t(fit):
  fit = limma.eBayes(fit)
  return fit.rx2('p.value')[:,0]

def estimate_z(fit):
  sf = st.chi2(1).sf
  # Important: this has estimates for the intercept also
  stat = fit.rx2('coefficients') / fit.rx2('stdev.unscaled')
  pval = sf(np.square(stat))
  return pval[:,0]

def evaluate_type1(dat, b=0, alpha=0.01, n_cells=100, n_trials=1, min_cells=10):
  """Estimate Type 1 error to detect DE genes, treating cells as units.

  Use binomial thinning to simulate scenarios with unequal sequencing depth
  (total number of molecules observed).

  dat - AnnData object
  b - fold change in total observed molecules between groups
  alpha - significance level
  n_cells - number of cells per group
  n_trials - number of simulation trials
  min_cells - minimum number of non-zero observations to retain a gene

  """
  result = []
  for i in range(n_trials):
    x, onehot = simulate_null(dat, b=b, n_donors=1, n_cells=n_cells, min_cells=min_cells, seed=i)
    if ss.isspmatrix(x.X):
      x = x.X.A
    else:
      x = x.X
    onehot = onehot.A
    # Heuristic: fix total number of updates
    num_epochs = 6000 * 64 // x.shape[0]
    for method in ('limma_voom', 'wls_point'):
      fit = globals()[f'estimate_{method}'](x, onehot, batch_size=64, num_epochs=num_epochs)
      for test in ('moderated_t', 'z'):
        pval = globals()[f'estimate_{test}'](fit)
        result.append((b, n_cells, method, test, i, (pval < alpha).mean()))
    for method in ('point', 'rqr'):
      fit = globals()[f'estimate_{method}'](x, onehot, batch_size=64, num_epochs=num_epochs)
      result.append((b, n_cells, method, 'z', i, (fit['p'] < alpha).mean()))
  result = pd.DataFrame(result, columns=['b', 'n_cells', 'method', 'test', 'trial', 'fdp'])
  return result

def evaluate_type1_by_donor(dat, n_donors, b=0, alpha=0.01, n_cells=100, n_trials=1, min_cells=10):
  """Estimate Type 1 error to detect DE genes, treating donors as units.

  Use binomial thinning to simulate scenarios with unequal sequencing depth
  (total number of molecules observed).

  dat - AnnData object
  b - fold change in total observed molecules between groups
  n_donor - number of donors per group
  alpha - significance level
  n_cells - number of cells per group
  n_trials - number of simulation trials
  min_cells - minimum number of non-zero observations to retain a gene

  """
  result = []
  for i in range(n_trials):
    x, onehot = simulate_null(dat, b=b, n_donors=n_donors, n_cells=n_cells, min_cells=min_cells, seed=i)
    if ss.isspmatrix(x.X):
      x = x.X.A
    else:
      x = x.X
    onehot = onehot.A
    design = np.ones((2 * n_donors, 2))
    design[:n_donors,0] = 0
    # Heuristic: fix total number of updates
    num_epochs = 6000 * 64 // x.shape[0]
    for method in ('limma_voom', 'wls_gamma', 'wls_point'):
      fit = globals()[f'estimate_{method}'](x, onehot, design=design, batch_size=64, num_epochs=num_epochs)
      for test in ('moderated_t', 'z'):
        pval = globals()[f'estimate_{test}'](fit)
        result.append((b, n_cells, method, test, i, (pval < alpha).mean()))
  result = pd.DataFrame(result, columns=['b', 'n_cells', 'method', 'test', 'trial', 'fdp'])
  return result

def estimate_fdr_bh(fit):
  """Return vector of BH adjusted p-values"""
  return rpy2.robjects.r['p.adjust'](fit.rx2('p.value')[:,0], method='BH')

def estimate_lfsr(fit):
  """Return vector of lfsr

  Implement the pipeline of Liu and Stephens 2020

  https://arxiv.org/pdf/1901.10679

  """
  fit1 = ashr.ash(
    betahat=pd.Series(fit.rx2('coefficients')[:,0]),
    sebetahat=pd.Series(np.sqrt(((fit.rx2('df.total') - fit.rx2('df.residual')) * fit.rx2('s2.prior') + fit.rx2('df.residual') * fit.rx2('stdev.unscaled')[:,0]) / fit.rx2('df.total'))),
    df=fit.rx2('df.total')[0],
    mixcompdist='halfuniform')
  return ashr.get_lfsr(fit1)

def evaluate_power(dat, g, b=0, fdr=0.1, n_trials=1, n_cells=100, min_cells=10):
  """Evaluate power to detect DE genes, assuming effect size distribution g, at
specified FDR.

  dat - AnnData object
  g - function which returns array of effect sizes
  n_donors - number of donors per group
  b - log fold change in total number of molecules observed between groups
  fdr - False Discovery Rate
  n_trials - number of simulation trials
  n_cells - number of cells per donor
  min_cells - minimum number of non-zero observations to retain genes

  """
  result = []
  for i in range(n_trials):
    x, onehot, design, lfc = simulate_study(dat, g, b=b, n_donors=1, n_cells=n_cells, min_cells=min_cells, seed=i)
    if ss.isspmatrix(x.X):
      x = x.X.A
    else:
      x = x.X
    onehot = onehot.A
    # Heuristic: fix total number of updates
    num_epochs = 6000 * 64 // x.shape[0]
    for method in ('limma_voom', 'wls_point'):
      fit = globals()[f'estimate_{method}'](x, onehot, design=design, batch_size=64, num_epochs=num_epochs)
      # Important: always use moderated t test
      fit = limma.eBayes(fit)
      for fdr_method in ('fdr_bh', 'lfsr'):
        score = globals()[f'estimate_{fdr_method}'](fit)
        fdp = ((score < fdr) & np.isclose(lfc, 0)).mean()
        power = (score[~np.isclose(lfc, 0)] < fdr).mean()
        result.append((fdr, n_cells, method, fdr_method, i, fdp, power))
    for method in ('point', 'rqr',):
      fit = globals()[f'estimate_{method}'](x, onehot, design=design, batch_size=64, num_epochs=num_epochs)
      score = rpy2.robjects.r['p.adjust'](fit['p'], method='BH')
      fdp = ((score < fdr) & np.isclose(lfc, 0)).mean()
      power = (score[~np.isclose(lfc, 0)] < fdr).mean()
      result.append((fdr, n_cells, method, 'fdr_bh', i, fdp, power))
  result = pd.DataFrame(result, columns=['fdr', 'n_cells', 'method', 'fdr_method', 'trial', 'fdp', 'power'])
  return result

def evaluate_power_by_donor(dat, g, n_donors, b=0, fdr=0.1, n_trials=1, n_cells=100, min_cells=10):
  """Evaluate power to detect DE genes, assuming effect size distribution g, at
specified FDR.

  dat - AnnData object
  g - function which returns array of effect sizes
  n_donors - number of donors per group
  b - log fold change in total number of molecules observed between groups
  fdr - False Discovery Rate
  n_trials - number of simulation trials
  n_cells - number of cells per donor
  min_cells - minimum number of non-zero observations to retain genes

  """
  assert n_donors > 1
  result = []
  for i in range(n_trials):
    x, onehot, design, lfc = simulate_study(dat, g, b=b, n_donors=n_donors, n_cells=n_cells, min_cells=min_cells, seed=i)
    if ss.isspmatrix(x.X):
      x = x.X.A
    else:
      x = x.X
    onehot = onehot.A
    # Heuristic: fix total number of updates
    num_epochs = 6000 * 64 // x.shape[0]
    for method in ('limma_voom', 'wls_gamma', 'wls_point'):
      fit = globals()[f'estimate_{method}'](x, onehot, design=design, batch_size=64, num_epochs=num_epochs)
      # Important: always use moderated t test
      fit = limma.eBayes(fit)
      for fdr_method in ('fdr_bh', 'lfsr'):
        score = globals()[f'estimate_{fdr_method}'](fit)
        fdp = ((score < fdr) & np.isclose(lfc, 0)).mean()
        power = (score[~np.isclose(lfc, 0)] < fdr).mean()
        result.append((fdr, n_donors, n_cells, method, fdr_method, i, fdp, power))
  result = pd.DataFrame(result, columns=['fdr', 'n_donors', 'n_cells', 'method', 'fdr_method', 'trial', 'fdp', 'power'])
  return result

To generate null data, randomly sample cells from a homogeneous population, and randomly assign labels. Use 10X v3 PBMC data, which has more molecules observed per sample on average.

dat = anndata.read_h5ad('/scratch/midway2/aksarkar/modes/10k_pbmc_v3.h5ad')

Results

Accuracy of estimation

We previously evaluated mpebpm by simulating from a point-Gamma expression model. Now, simulate from a Gamma model, and evaluate the accuracy of estimating \(\E[\ln\lambda_{ij}]\) and \(\V[\ln\lambda_{ij}]\).

def evaluate(num_samples, num_mols, num_trials=10, **kwargs):
  # Important: generate all of the samples for each trial in one shot, and use
  # one-hot encoding to get separate estimates
  args = [(num_samples * num_trials, num_mols, log_mu, log_phi, -1000, None, None, None)
          for log_mu in np.linspace(-12, -6, 7)
          for log_phi in np.linspace(-4, 0, 5)]
  x = np.concatenate([scqtl.simulation.simulate(*a)[0][:,:1] for a in args], axis=1)
  x = ss.csr_matrix(x)
  s = num_mols * np.ones((x.shape[0], 1))
  onehot = np.zeros((num_samples * num_trials, num_trials))
  onehot[np.arange(onehot.shape[0]), np.arange(onehot.shape[0]) // num_samples] = 1
  onehot = ss.csr_matrix(onehot)

  # Important: this is analytic
  theta = mpebpm.sgd.ebpm_point(x.A, s=s, onehot=onehot.A)
  log_mu, neg_log_phi = mpebpm.sgd.ebpm_gamma(x, s=s, onehot=onehot, **kwargs)
  result = pd.DataFrame(
    [(a[0] // num_trials, int(a[1]), int(a[2]), int(a[3]), int(a[4]), a[-1], trial)
     for a in args
     for trial in range(num_trials)],
    columns=['num_samples', 'num_mols', 'log_mu', 'log_phi', 'logodds', 'fold', 'trial'])
  result['theta_hat'] = theta.ravel(order='F')
  result['log_mu_hat'] = log_mu.ravel(order='F')
  result['log_phi_hat'] = -neg_log_phi.ravel(order='F')
  # Important: sign flipped in scqtl.simulation.simulate
  result['mean_log'] = sp.digamma(np.exp(-query['log_phi'])) + query['log_mu'] + query['log_phi']
  result['var_log'] = sp.polygamma(1, np.exp(-result['log_phi']))
  result['mean_log_hat'] = sp.digamma(np.exp(-result['log_phi_hat'])) + result['log_mu_hat'] + result['log_phi_hat']
  result['var_log_hat'] = sp.polygamma(1, np.exp(-result['log_phi_hat']))
  return result

Run the simulation.

result = [evaluate(num_samples=num_samples,
                   num_mols=num_mols,
                   batch_size=32,
                   num_epochs=num_epochs,
                   log_dir=f'runs/mpebpm/gamma-sim-{num_mols}-{num_samples}/')
          for num_mols in (10000, 100000)
          # Important: for fixed batch size, having more samples means more
          # updates to each parameter per epoch
          for num_samples, num_epochs in zip((100, 1000), (400, 40))]
pd.concat(result).to_csv('/scratch/midway2/aksarkar/ideas/mpebpm-gamma-sim.txt.gz', sep='\t')

Read the results.

result = pd.read_csv('/scratch/midway2/aksarkar/ideas/mpebpm-gamma-sim.txt.gz', sep='\t', index_col=0)

Compare the estimated \(\E[\ln\lambda_{ij}]\) to the ground truth.

plt.clf()
fig, ax = plt.subplots(2, 2, sharex=True, sharey=True)
fig.set_size_inches(4.5, 4.5)
lim = [-20, -5]
for a, (k, g) in zip(ax.ravel(), result.groupby(['num_samples', 'num_mols'])):
  a.scatter(sp.digamma(np.exp(-g['log_phi'])) + g['log_mu'] + g['log_phi'],
            sp.digamma(np.exp(-g['log_phi_hat'])) + g['log_mu_hat'] + g['log_phi_hat'], s=1, c='k', alpha=0.3)
  a.plot(lim, lim, lw=1, ls=':', c='r')
  a.set_xlim(lim)
  a.set_ylim(lim)
  a.set_title(f'$n$={k[0]}, $s$={k[1]}')
  a.set_xlabel('True $\mathrm{E}[\ln\ \lambda]$')
for a in ax:
  a[0].set_ylabel('Estimated $\mathrm{E}[\ln\ \lambda]$')
fig.tight_layout()

Estimate \(\E[\ln\lambda_{ij}]\) under a point mass expression model, and compare to the ground truth.

plt.clf()
fig, ax = plt.subplots(2, 2, sharex=True, sharey=True)
fig.set_size_inches(4.5, 4.5)
lim = [-15, -5]
for a, (k, g) in zip(ax.ravel(), result.groupby(['num_samples', 'num_mols'])):
  a.scatter(sp.digamma(np.exp(-g['log_phi'])) + g['log_mu'] + g['log_phi'], g['theta_hat'], s=1, c='k', alpha=0.3)
  a.plot(lim, lim, lw=1, ls=':', c='r')
  a.set_xlim(lim)
  a.set_ylim(lim)
  a.set_title(f'$n$={k[0]}, $s$={k[1]}')
  a.set_xlabel('True $\mathrm{E}[\ln\ \lambda]$')
for a in ax:
  a[0].set_ylabel('Estimated $\mathrm{E}[\ln\ \lambda]$')
fig.tight_layout()

Compare the estimated \(\V[\ln\lambda_{ij}]\) to the ground truth.

plt.clf()
fig, ax = plt.subplots(2, 2, sharex=True, sharey=True)
fig.set_size_inches(4.5, 4.5)
lim = [1e-4, 1e2]
for a, (k, g) in zip(ax.ravel(), result.groupby(['num_samples', 'num_mols'])):
  query = g
  a.set_xscale('log')
  a.set_yscale('log')
  a.scatter(query['var_log'], query['var_log_hat'], s=1, c='k', alpha=0.2)
  a.plot(lim, lim, lw=1, ls=':', c='r')
  a.set_xlim(lim)
  a.set_ylim(lim)
  a.set_title(f'$n$={k[0]}, $s$={k[1]}')
  a.set_xlabel('True $\mathrm{V}[\ln\ \lambda]$')
for a in ax:
  a[0].set_ylabel('Estimated $\mathrm{V}[\ln\ \lambda]$')
fig.tight_layout()

Compare the estimated \(\V[\ln\lambda_{ij}]\) to the ground truth, restricting to genes with \(\ln\mu > -11\).

plt.clf()
fig, ax = plt.subplots(2, 2, sharex=True, sharey=True)
fig.set_size_inches(4.5, 4.5)
lim = [1e-4, 1e2]
for a, (k, g) in zip(ax.ravel(), result.groupby(['num_samples', 'num_mols'])):
  query = g[g['log_mu'] > -10]
  a.set_xscale('log')
  a.set_yscale('log')
  a.scatter(query['var_log'], query['var_log_hat'], s=1, c='k', alpha=0.2)
  a.plot(lim, lim, lw=1, ls=':', c='r')
  a.set_xlim(lim)
  a.set_ylim(lim)
  a.set_title(f'$n$={k[0]}, $s$={k[1]}')
  a.set_xlabel('True $\mathrm{V}[\ln\ \lambda]$')
for a in ax:
  a[0].set_ylabel('Estimated $\mathrm{V}[\ln\ \lambda]$')
fig.tight_layout()

Estimation of true effect size distribution

To estimate power controlling FDR, we need to make an assumption about the true distribution of effect sizes. We will make a data-driven assumption by estimating the true distribution of effect sizes, given observed effect sizes and standard errors output by limma-voom on a constructed problem using ashr.

dat = anndata.read_h5ad('/scratch/midway2/aksarkar/ideas/zheng-10-way.h5ad')

Construct a problem comparing B cells to cytotoxic T cells. Randomly assign 50 sorted cells to each of 128 donors within each cell type.

x1, onehot1 = simulate_null(dat[dat.obs['cell_type'] == 'b_cells'], n_donors=128, n_cells=50, min_counts=1, to_array=False)
x2, onehot2 = simulate_null(dat[dat.obs['cell_type'] == 'cytotoxic_t'], n_donors=128, n_cells=50, min_counts=1, to_array=False)
mix = x1.concatenate(x2)
sc.pp.filter_genes(mix, min_counts=10)
onehot = ss.block_diag([onehot1, onehot2], format='csr')
design = np.ones((256, 2))
design[:128,0] = 0

Fit limma-voom, followed by EB shrinkage of standard errors.

fit0 = limma.eBayes(estimate_limma_voom(mix.X.A, onehot, design))

Fit ashr, using the moderated standard errors and degrees of freedom.

ashr = rpy2.robjects.packages.importr('ashr')
fit1 = ashr.ash(
  betahat=pd.Series(fit0.rx2('coefficients')[:,0]),
  sebetahat=pd.Series(np.sqrt((fit0.rx2("df.prior") * fit0.rx2("s2.prior") + fit0.rx2("df.residual") * fit0.rx2("stdev.unscaled")[:,0]) / fit0.rx2("df.total"))),
  df=fit0.rx2('df.total')[0],
  mixcompdist='halfuniform')

Look at the fitted prior distribution of true effects. Find an analytic distribution whose tail behavior is close enough.

cm = plt.get_cmap('Paired')
grid = np.linspace(-3, 3, 1000)
F = ashr.cdf_ash(fit1, grid).rx2('y').ravel()
F2 = st.t(scale=0.1, df=1).cdf(grid)
plt.clf()
plt.gcf().set_size_inches(4.5, 2.5)
plt.plot(grid, F, lw=1, c=cm(0), label='ashr')
plt.plot(grid, F2, lw=1, c=cm(1), label=f'$t_1(0, 0.1^2)$')
plt.axvline(x=0, lw=1, ls=':', c='0.5')
plt.legend(frameon=False)
plt.xlabel('Prior effect size')
plt.ylabel('CDF')
plt.tight_layout()

Simulation sanity check

As a sanity check, use bulk RNA-seq data from GTEx v6 lung.

dat = anndata.AnnData(rpy2.robjects.r['readRDS']('/project2/gilad/joycehsiao/dsc-log-fold-change/dsc/data/gtex_lung.rds').T)

n_donors = 4
x, onehot = simulate_null(dat, n_donors=2 * n_donors, n_cells=1, to_dense=True, min_counts=2000, seed=1)
design = np.ones((2 * n_donors, 2))
design[:n_donors,0] = 0
(estimate_moderated_t(estimate_limma_voom(x, onehot, design)) < 0.01).mean()

0.005917682388270624

np.random.seed(3)
pi0 = 0.9
z = np.random.uniform(size=x.shape[1]) > pi0
s = np.random.uniform(size=x.shape[1]) > 0.5
x1 = x[:n_donors].copy()
x1[:,z & s] = st.binom(n=x1[:,z & s].astype(int), p=0.5).rvs()
x2 = x[n_donors:].copy()
x2[:,z & ~s] = st.binom(n=x2[:,z & ~s].astype(int), p=0.5).rvs()
y = np.vstack([x1, x2])
fit = estimate_limma_voom(y, onehot, design)
fit = limma.eBayes(fit)
score = estimate_fdr_bh(fit)
(score[z] < 0.1).mean(), (score[~z] < 0.1).mean()

(0.33662477558348297, 0.0025470219435736676)

Type 1 error rate (single cells as units)

Simulate null data, varying the number of cells per group.

result = pd.concat(
  evaluate_type1(dat, n_cells=n_cells, min_cells=n_cells, n_trials=50)
  for n_cells in (10, 100, 1000))

Plot the distribution of Type 1 errors in each trial.

plt.clf()
fig, ax = plt.subplots(1, 3)
fig.set_size_inches(5.5, 2.5)
labels = ['LV-EB', 'LV-Z', 'PM-Z', 'SE-EB', 'SE-Z']
for a, (t, g) in zip(ax, result.groupby('n_cells')):
  a.boxplot(g.pivot_table(columns='trial', index=['method', 'test'], values='fdp'), widths=0.35, medianprops={'color': 'k'}, flierprops={'marker': '.', 'markersize': 2})
  a.axhline(y=0.01, c='r', lw=1, ls=':')
  a.set_xticklabels(labels, rotation=90)
  a.set_xlabel('Method')
  a.set_title(f'$n$={t}')
ax[0].set_ylabel('Type 1 error rate')
fig.tight_layout()

To generate null data with systematic differences in sequencing depth, follow the scheme above, then apply binomial thinning with \(p = 0.5\) to every gene.

result = dict()
for n_cells in (25, 50, 100):
  result[n_cells] = evaluate_type1(dat, b=np.log(2), n_cells=n_cells, min_cells=n_cells, n_trials=50)
result = pd.concat(result).reset_index(drop=True)

plt.clf()
fig, ax = plt.subplots(1, 3)
fig.set_size_inches(5.5, 2.5)
labels = ['LV-EB', 'LV-Z', 'PM-Z', 'SE-EB', 'SE-Z']
for a, (t, g) in zip(ax, result.groupby('n_cells')):
  a.boxplot(g.pivot_table(columns='trial', index=['method', 'test'], values='fdp'), widths=0.35, medianprops={'color': 'k'}, flierprops={'marker': '.', 'markersize': 2})
  a.axhline(y=0.01, c='r', lw=1, ls=':')
  a.set_xticklabels(labels, rotation=90)
  a.set_xlabel('Method')
  a.set_title(f'$n$={t}')
ax[0].set_ylabel('Type 1 error rate')
fig.tight_layout()

As an alternative view, fix the number of cells \(n=100\) and vary the difference in sequencing depth.

result = pd.concat(
  evaluate_type1(dat, b=b, n_cells=100, min_cells=100, n_trials=50)
  for b in np.log(np.linspace(1, 2, 4))
)

plt.clf()
fig, ax = plt.subplots(1, 4)
fig.set_size_inches(6.5, 2.5)
labels = ['LV-EB', 'LV-Z', 'PM-Z', 'SE-EB', 'SE-Z']
for a, (b, g) in zip(ax, result.groupby('b')):
  a.boxplot(g.pivot_table(columns='trial', index=['method', 'test'], values='fdp'), widths=0.35, medianprops={'color': 'k'}, flierprops={'marker': '.', 'markersize': 2})
  a.axhline(y=0.01, c='r', lw=1, ls=':')
  a.set_xticklabels(labels, rotation=90)
  a.set_xlabel('Method')
  a.set_title(f'$b$={np.exp(b):.2g}')
ax[0].set_ylabel('Type 1 error rate')
fig.tight_layout()

Unequal size factors

To generate single cells with unequal size factors, combine simulated doublets from sorted cells in one group.

dat = anndata.read_h5ad('/scratch/midway2/aksarkar/ideas/zheng-10-way.h5ad')
n_cells = 100
x1, onehot1 = simulate_null(dat[dat.obs['cell_type'] == 'b_cells'], n_donors=1, n_cells=n_cells, min_counts=0, to_dense=False)
x2, onehot2 = simulate_null(dat[dat.obs['cell_type'] == 'b_cells'], n_donors=n_cells, n_cells=2, min_counts=0, to_dense=False)
mix = ss.vstack([x1, onehot2.T @ x2], format='csr')
mix = anndata.AnnData(mix)
# Estimate the global mean for filtering genes
thetahat = np.log(mix.X.A.sum(axis=0) + 1) - np.log(mix.X.sum())
onehot = ss.block_diag([onehot1, np.ones((100, 1))], format='csr')

0 - 06711880-dce2-448e-9b52-b1b94f6d7c9e

Fit limma-voom, followed by EB shrinkage of standard errors. Report the number of false positives (\(\alpha < 0.01\)), and the number of genes tested.

query = thetahat > -11
(estimate_moderated_t(estimate_limma_voom(mix[:,query].X.A, onehot.A)) < 0.01).mean(), query.sum()

(0.9590070598952403, 4391)

Repeat the analysis for a larger number of cells.

n_cells = 1000
x1, onehot1 = simulate_null(dat[dat.obs['cell_type'] == 'b_cells'], n_donors=1, n_cells=n_cells, min_counts=0, to_dense=False, seed=1)
x2, onehot2 = simulate_null(dat[dat.obs['cell_type'] == 'b_cells'], n_donors=n_cells, n_cells=2, min_counts=0, to_dense=False, seed=2)
mix = ss.vstack([x1, onehot2.T @ x2], format='csr')
mix = anndata.AnnData(mix)
# Estimate the global mean for filtering genes
thetahat = np.log(mix.X.A.sum(axis=0) + 1) - np.log(mix.X.sum())
onehot = ss.block_diag([onehot1, np.ones((n_cells, 1))], format='csr')

query = thetahat > -11
(estimate_moderated_t(estimate_limma_voom(mix[:,query].X.A, onehot.A)) < 0.01).mean(), query.sum()

(0.9706529713866471, 4089)

Fit a point mass expression model to each group.

theta = np.log(onehot.T @ mix.X[:,query].A + 1) - np.log(onehot.T @ mix.X.sum(axis=1)).A

Fit a Gamma expression model to each group.

log_mu, neg_log_phi = mpebpm.sgd.ebpm_gamma(
  mix.X[:,query].A,
  s=mix.X.sum(axis=1).A,
  onehot=onehot.A,
  batch_size=32,
  num_epochs=1000,
  log_dir='runs/mpebpm/b_cells-sim-unequal-s')

Compare the estimated mean log expression for each gene under the Gamma model to the estimate under the point mass model.

plt.clf()
fig, ax = plt.subplots(1, 2, sharey=True)
fig.set_size_inches(4, 2.5)
lim = [-12, -2]
for a, y, t in zip(ax, [theta.A, log_mu], ['Point mass', 'Gamma']):
  a.scatter(y[0], y[1], c='k', s=1, alpha=0.1)
  a.plot(lim, lim, lw=1, ls=':', c='r')
  a.set_xlim(lim)
  a.set_ylim(lim)
  a.set_title(t)
  a.set_xlabel('$\mathrm{E}[\ln\ \lambda]$ (size $s$)')
ax[0].set_ylabel('$\mathrm{E}[\ln\ \lambda]$ (size $2s$)')
fig.tight_layout()

The problem seems to be that in limma, observed \(x_{ij} = 0\) are treated as having \(\ln\lambda_{ij} = -\ln\xiplus\). This treatment makes sense, because observing a zero is strong information that true gene expression is not bigger than \(1 / \xiplus\); however, it is clear that with double \(\xiplus\), observing a zero further bounds true gene expression. Therefore, limma is expected to identify all genes as “differentially expressed” in this scenario. This problem cannot be solved by using the posterior mean of gene expression under e.g., Gamma models for each group instead, because the posterior mean is

\begin{equation} \E[\ln\lambda_{ij} \mid x_{ij}, \xiplus, g_{jk}] = \psi(x_{ij} + \phi_{jk}^{-1}) + \ln(\xiplus + \mu_{jk}^{-1}\phi_{jk}^{-1}), \end{equation}

and we have shown empirically that the prior means are equal. Therefore, it will still be the case that the posterior means for observed zeros will be systematically different between two groups with systematically different size factors. Suppose instead we use the estimate \(\hat\theta_{jk}\) under a point mass model and its estimated standard error to compute a \(z\)-score.

# Important: this only reduces when there is no design matrix
s2 = 1 / (onehot.T @ mix.X[:,query].A)
z = (theta[0].ravel() - theta[1].ravel()) / np.sqrt(s2.sum(axis=0))
(st.norm().sf(abs(z)) < 0.01).mean()

0.016385424309122036

Plot the distribution of \(z\)-scores in this simulation against a standard normal density (to make sure we don’t need to use a \(t\) distribution instead).

grid = np.linspace(-4, 4, 1000)
plt.clf()
plt.gcf().set_size_inches(2.5, 2.5)
plt.hist(z, bins=11, color='0.7', density=True)
plt.plot(grid, st.norm().pdf(grid), lw=1, c='k')
plt.xlabel('$z$-score')
plt.ylabel('Density')
plt.tight_layout()

Anti-conservative \(z\)-test

x, onehot = simulate_null(dat, n_donors=1, n_cells=100, min_cells=100, seed=3)
y = x.X.A.copy()
# for j in range(y.shape[1]):
#   temp = y[:,j].copy()
#   np.random.shuffle(temp)
#   y[:,j] = temp
theta = np.log(onehot.T @ y) - np.log(onehot.T @ y.sum(axis=1, keepdims=True))
bhat = theta[0] - theta[1]
V = 1 / (onehot.T @ y)
se = np.sqrt(V.sum(axis=0))
z = bhat / se
p = st.norm().sf(abs(z))

grid = np.linspace(-5, 5, 1000)
plt.clf()
plt.gcf().set_size_inches(3, 3)
plt.hist(bhat / se, bins=21, density=True, color='0.7')
plt.plot(grid, st.norm().pdf(grid), lw=1, c='k')
plt.xlabel('$z$-score')
plt.ylabel('Density')
plt.tight_layout()

lim = [1e-4, 1]
cm = plt.get_cmap('Dark2')
plt.clf()
plt.gcf().set_size_inches(3, 3)
plt.xscale('log')
plt.yscale('log')
for i, (F, label) in enumerate(zip([st.norm()], ['N(0, 1)'])):
  plt.plot(np.linspace(0, 1, fit.shape[0] + 1)[1:], np.sort(F.cdf(fit['bhat'] / fit['se'])), lw=1, c=cm(i), label=label)
plt.plot(lim, lim, lw=1, ls=':', c='k')
plt.xlim(lim)
plt.ylim(lim)
plt.legend(frameon=False)
plt.xlabel('Theoretical quantile')
plt.ylabel('Observed quantile')
plt.tight_layout()

Check whether the SE is calibrated.

n_trials = 50
n_bootstraps = 400
n_cells = 100
s = 1e4
result = []
for trial in range(n_trials):
  for theta in np.arange(-12, -2, 1):
    x = st.poisson(s * np.exp(theta)).rvs(size=n_cells)
    thetahat = np.log(x.sum()) - np.log(n_cells * s)
    se = np.sqrt(1 / x.sum())
    bootstraps = []
    for b in range(n_bootstraps):
      x = st.poisson(s * np.exp(theta)).rvs(size=n_cells)
      bootstraps.append(np.log(x.sum()) - np.log(n_cells * s))
    se_b = np.array(bootstraps).std()
    result.append([theta, trial, thetahat, se, se_b])
result = pd.DataFrame(result, columns=['theta', 'trial', 'thetahat', 'se', 'se_b'])

plt.clf()
plt.gcf().set_size_inches(2.5, 2.5)
plt.xscale('log')
plt.yscale('log')
plt.scatter(result['se'], result['se_b'], s=1, c='k')
lim = [1e-3, 1.5]
plt.plot(lim, lim, lw=1, ls=':', c='r')
plt.xlim(lim)
plt.ylim(lim)
plt.xlabel('Analytic SE')
plt.ylabel('Bootstrap SE')
plt.tight_layout()

Power (single cells as units)

To generate true positives treating cells as units, generate null data as above, and then use binomial thinning (Gerard 2019) to introduce effects of a given magnitude at effect genes. Assume effects are drawn from a scaled \(t\) distribution \(g = 0.1 t_1\).

result = dict()
for n_cells in (25, 50, 100):
  result[n_cells] = evaluate_power(dat, g=st.t(scale=0.1, df=1).rvs, n_cells=n_cells, min_cells=n_cells, n_trials=1)
result = pd.concat(result).reset_index(drop=True)

cm = plt.get_cmap('Paired')
plt.clf()
fig, ax = plt.subplots(1, 3, sharey=True)
fig.set_size_inches(5, 2.5)
labels = ['LV-EB', 'PM-EB', 'PM-Z']
for a, (t, g) in zip(ax, result[result['fdr_method'] == 'fdr_bh'].groupby('n_cells')):
  a.boxplot(g.pivot_table(columns='trial', index=['method', 'test'], values='power'), widths=0.35, medianprops={'color': 'k'}, flierprops={'marker': '.', 'markersize': 2})
  a.set_xticklabels(labels, rotation=90)
  a.set_xlabel('Method')
  a.set_title(f'$n$={t}')
ax[0].set_ylabel(r'Power (FDR 10%)')
fig.tight_layout()

power-2-0.01.png

Type 1 error rate (donors as units)

To generate null data, first randomly assign cells from a homogeneous sample to donors, then randomly assign donors to groups.

result = pd.concat(
  evaluate_type1_by_donor(dat, b=b, n_donors=4, n_cells=100, min_cells=100, n_trials=10)
  for b in np.log(np.linspace(1, 2, 4)))

plt.clf()
fig, ax = plt.subplots(1, 4, sharey=True)
fig.set_size_inches(5.5, 2.5)
labels = ['LV-EB', 'LV-Z', 'GAM-EB', 'GAM-Z', 'PM-EB', 'PM-Z']
for a, (k, g) in zip(ax, result.groupby('b')):
  a.boxplot(g.pivot_table(index=['method', 'test'], columns='trial', values='fdp'), widths=0.35, medianprops={'color': 'k'}, flierprops={'marker': '.', 'markersize': 2})
  a.axhline(y=0.01, c='r', lw=1, ls=':')
  a.set_xticklabels(labels, rotation=90)
  a.set_title(f'$b$={np.exp(k):.3g}')
  a.set_xlabel('Method')
ax[0].set_ylabel('Type 1 error rate')
ax[0].set_ylim(0, 0.1)
fig.tight_layout()

Unequal size factors

Generate null data with different size factors by assigning different numbers of cells to different donors.

x1, onehot1 = simulate_null(dat[dat.obs['cell_type'] == 'b_cells'], n_donors=4, n_cells=50, min_counts=1, to_array=False)
x2, onehot2 = simulate_null(dat[dat.obs['cell_type'] == 'b_cells'], n_donors=4, n_cells=100, min_counts=1, to_array=False)
mix = x1.concatenate(x2)
sc.pp.filter_genes(mix, min_counts=10)
onehot = ss.block_diag([onehot1, onehot2], format='csr')
design = np.ones((8, 2))
design[:4,0] = 0

Look at the distribution of size factors.

plt.clf()
plt.gcf().set_size_inches(2.5, 2.5)
plt.boxplot((onehot.T @ mix.X.sum(axis=1)).A.reshape(2, -1).T, widths=0.5, medianprops={'color': 'k'})
plt.xlabel('Group')
plt.ylabel('Size factor')
plt.tight_layout()

Fit limma-voom, followed by EB shrinkage of standard errors.

(estimate_moderated_t(estimate_limma_voom(mix.X.A, onehot, design)) < 0.01).mean()

0.0

Look at the point mass estimates.

thetahat = np.log((onehot.T @ mix.X).A + 1) - np.log(onehot.T @ mix.X.sum(axis=1).A)
plt.clf()
plt.gcf().set_size_inches(2.5, 2.5)
plt.scatter(thetahat[:4].ravel(), thetahat[4:].ravel(), s=1, c='k', alpha=0.1)
lim = [-15, 3]
plt.plot(lim, lim, c='r', lw=1, ls=':')
plt.xlim(lim)
plt.ylim(lim)
plt.xlabel('Est mean (50 cells)')
plt.ylabel('Est mean (100 cells)')
plt.tight_layout()

Power (donors as units)

To generate true positives treating donors as units, randomly sample cells from a homogeneous population, randomly assign cells to donors, randomly assign donors to groups, and then thin effect genes.

result = pd.concat(
  evaluate_power_by_donor(dat, g=st.t(scale=0.1, df=1).rvs, n_donors=n_donors, n_cells=50, n_trials=10)
  for n_donors in (4, 16, 64))

cm = plt.get_cmap('Paired')
plt.clf()
fig, ax = plt.subplots(1, 3, sharey=True)
fig.set_size_inches(5, 2.5)
labels = ['LV-EB', 'LV-Z', 'GAM-EB', 'GAM-Z', 'PM-EB', 'PM-Z']
for a, (k, g) in zip(ax, result[result['fdr_method'] == 'fdr_bh'].groupby('n_donors')):
  a.boxplot(g.pivot_table(columns='trial', index=['method', 'test'], values='power'), widths=0.4, medianprops={'color': 'k'}, flierprops={'marker': '.', 'markersize': 2})
  a.set_xticklabels(labels, rotation=90)
  a.set_xlabel('Method')
  a.set_title(f'$n$={k}')
ax[0].set_ylabel(r'Power (FDR 10%)')
fig.tight_layout()

An improved voom transform for scRNA-seq data

Table of Contents

Introduction

Setup

Methods

Standard error of point mass expression model

Variance of Gamma expression model

Controlling FDR

Simulation

Results

Accuracy of estimation

Estimation of true effect size distribution

Simulation sanity check

Type 1 error rate (single cells as units)

Unequal size factors

Anti-conservative \(z\)-test

Power (single cells as units)

Type 1 error rate (donors as units)

Unequal size factors

Power (donors as units)