Parallel computation of randomized quantiles
Table of Contents
Introduction
A simple idea to test goodness of fit is to use the fact that if \(x_i \sim F\), \(i = 1, \ldots, n\), then \(F(x_i) \sim \operatorname{Uniform}(0, 1)\). However, this idea breaks down for discontinuous \(F\), as arises from discrete data. Randomized quantiles (Dunn and Smyth 1996, Feng et al. 2017) can be used to side-step this issue. For count data, randomized quantiles are generated as
\begin{align} q_i \mid x_i, u_i &\sim F(x_i - 1) + u_i f(x_i)\\ u_i &\sim \operatorname{Uniform}(0, 1). \end{align}Setup
import numpy as np import scipy.sparse as ss import scipy.stats as st import sys import time
%matplotlib inline %config InlineBackend.figure_formats = set(['retina'])
import matplotlib.pyplot as plt plt.rcParams['figure.facecolor'] = 'w' plt.rcParams['font.family'] = 'Nimbus Sans'
Implementation
Time evaluation of the Poisson CDF for 15,000 elements.
p = 15000 x = np.ones(p) %timeit st.poisson(x).cdf(x)
2.47 ms ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For comparison, do the same in R.
module load R srun --partition=mstephens R --quiet --no-save <<EOF p <- 15000 x <- rep(1, p) system.time(ppois(x, x)) EOF
> p <- 15000 > x <- rep(1, p) > system.time(ppois(x, x)) user system elapsed 0.003 0.001 0.003 >
The Census of Immune Cells is part of the Human Cell Atlas. Currently, it comprises scRNA-seq data of 593,844 cells from 16 donors. Read the sparse data and take 150,000 cells.
n = 150000 x_csr = ss.load_npz('/scratch/midway2/aksarkar/modes/immune-cell-census.npz') x_csc = x_csr[:150000].tocsc() x_csc
<150000x16002 sparse matrix of type '<class 'numpy.int32'>' with 139419510 stored elements in Compressed Sparse Column format>
Compute the sparsity of \([x_{ij}]\).
1 - x_csc.nnz / np.prod(x_csc.shape)
0.9419157980252468
Compute how many GB are used to store \([x_{ij}]\).
4 * (x_csc.data.shape[0] + x_csc.indptr.shape[0] + x_csc.indices.shape[0]) / (2 ** 32)
0.2597039779648185
Under the assumption \(x_{ij} \sim \operatorname{Pois}(\lambda_{ij})\), compute how many GB would be required to store \(\lambda_{ij}\). This will be much larger than storing \(x_{ij}\) because \(\lambda_{ij}\) is not sparse.
n, p = x_csc.shape (n * p * 4) / (2 ** 32)
2.235453575849533
Generate some random non-negative loadings and factors.
rank = 2 scale = np.sqrt(x_csc.mean() / rank) l = np.random.uniform(1e-8, scale, size=(n, rank)) f = np.random.uniform(1e-8, scale, size=(p, rank)) lam = l @ f.T
Draw one randomized quantile for each element \(x_{ij}\). Report how many minutes elapsed.
start = time.time() for j in range(x_csc.shape[1]): F = st.poisson(lam[:,j]) u = np.random.uniform(size=n) # Convert sparse to dense, because scipy.stats does not support sparse # arguments query = x_csc[:,j].A.ravel() rpp = F.cdf(query - 1) + u * F.pmf(query) elapsed = time.time() - start elapsed / 60
7.754513065020244
This problem is embarassingly parallel, because each iteration of the loop is independent (up to collecting the results).