Expression variation in Census of Immune Cells

Introduction
Setup
Data
- Pre-processing
- Read the data
Results

Introduction

The Census of Immune Cells is part of the Human Cell Atlas. Currently, it comprises scRNA-seq of 593,844 cells from 16 donors.

Setup

import anndata
import loompy
import functools as ft
import multiprocessing as mp
import numpy as np
import pandas as pd
import scanpy
import scipy.io as sio
import scipy.sparse as ss
import scipy.special as sp
import scmodes
import scmodes.benchmark.gof
import scmodes.ebpm.sgd
import sys
import torch

%matplotlib inline
%config InlineBackend.figure_formats = set(['retina'])

import matplotlib.pyplot as plt
plt.rcParams['figure.facecolor'] = 'w'
plt.rcParams['font.family'] = 'Nimbus Sans'

Data

Pre-processing

The data is available in loom and Matrix Market format. The loom format allows out-of-memory processing, but the entire sparse data should take 6GB of memory. loom stores the data as dense, indexed hdf5, making it unsuitable to load the entire data. However, out-of-memory scanning of the metadata seems to be faster than loading everything into pandas.

The Matrix Market data is stored using format real, which makes it twice as big as necessary. Convert the Matrix Market data to integer on disk.

sbatch --partition=mstephens
#!/bin/bash
zcat /project2/mstephens/aksarkar/projects/singlecell-ideas/data/human-cell-atlas/immune-cell-census/matrix.mtx.gz | \
    awk 'NR == 1 {sub("real", "integer", $4)} NR > 1 {$3 = int($3)} {print}' >immune-cell-census.mtx

Read the sparse data. (16 minutes; reading compressed takes 26 minutes)

x = sio.mmread('/scratch/midway2/aksarkar/modes/immune-cell-census.mtx')

Get its shape.

x.shape

(58347, 782859)

As loaded, rows are genes, and columns are samples. Get its size in memory.

pd.Series({k: sys.getsizeof(getattr(x, k)) for k in ('row', 'col', 'data')}) / (2 ** 30)

row     2.307106
col     2.307106
data    4.614212
dtype: float64

The data still got read in as int64 in COO format, which defeated the purpose of our processing on disk.

x.data.dtype

dtype('int64')

To actually do anything with the data, we need to convert to CSR/CSC. The current implementation converting COO to CSR is broken. One possible explanation is that the implementation of coalescing COO entries is the culprit, which isn't needed for this data (there should be no duplicate row/column entries).

Figure out whether the COO data is row-major or column-major.

(x.row[:5], x.col[:5])

(array([ 76, 198, 231, 326, 589], dtype=int32),
array([0, 0, 0, 0, 0], dtype=int32))

Columns are samples, which means we can compress the indices and transpose in one shot.

indices = x.row
indptr = np.hstack((0, 1 + np.where(np.diff(x.col))[0], x.nnz))

Make sure the compression was correct.

for j in range(10):
  assert (x.data[x.col == j] == x.data[indptr[j]:indptr[j+1]]).all()

We could actually get away with using np.int16, but we need torch.int32 on the GPU.

x.data.max()

y = ss.csr_matrix((x.data.astype(np.int32), indices, indptr), shape=tuple(reversed(x.shape)))

Get the droplets which are predicted to contain mRNA from an intact cell.

with loompy.connect('/project2/mstephens/aksarkar/projects/singlecell-ideas/data/human-cell-atlas/immune-cell-census.loom') as d:
  keep_samples = d.ca['emptydrops_is_cell'] == 't'
  donor = pd.Categorical(d.ca['donor_organism.provenance.document_id'][keep_samples])
  pd.DataFrame(donor).to_csv('/scratch/midway2/aksarkar/modes/immune-cell-census-samples.txt.gz', sep='\t')
keep_samples.sum()

Count how many cells came from each donor.

donor.value_counts()

085e737d-adb5-4597-bd54-5ebeda170038    37262
0a6c46dd-0905-4581-95eb-d89eef8a7213    36368
0b91cb1f-e2a8-413a-836c-1d38e7af3f2d    36672
31f89559-2682-4bbc-84c6-826dfe4a4e39    26079
4a404c91-0dbf-4246-bc23-d13aff961ba7    37493
4e98f612-15ec-44ab-b5f9-39787f92b01a    44776
509c507c-4759-452f-994e-d134d90329fd    32926
53af872d-b838-44d6-ae1b-25b56405483c    46690
6072d1f5-aa0c-4ab1-a8a6-a00ab479a1ba    45513
9aaf8a07-924f-456c-86dc-82f5da718246    35527
af7fe7a6-7d7e-4cdf-9799-909680fa9a3f    33235
cf514c66-88b2-45e4-a397-7fb362ae9950    31551
d23515a7-e182-4bc6-89e2-b1635885c0ec    42175
e4b5115d-3a0d-4c50-aba4-04b5f76810da    32834
eb8fb36b-6e02-41c4-8760-3eabbde6bacb    35376
fb30bb83-0278-4117-bd42-e2e8dddfedfe    39367
dtype: int64

Only keep genes with non-zero observations in 1000 cells.

gene_detect = (y[keep_samples] > 0).tocsc().sum(axis=0).A.ravel()
keep_genes = gene_detect > 1000
keep_genes.sum()

Write the gene metadata, only for the genes analyzed.

with loompy.connect('/project2/mstephens/aksarkar/projects/singlecell-ideas/data/human-cell-atlas/immune-cell-census.loom') as d:
  genes = pd.DataFrame({k: d.ra[k] for k in d.ra}).loc[keep_genes]
  genes.to_csv('/scratch/midway2/aksarkar/modes/immune-cell-census-genes.txt.gz', sep='\t')

We need to use CSC to subset genes, and then CSR to efficiently get minibatches (subset samples).

y_csr = y[keep_samples].tocsc()[:,keep_genes].tocsr()
y_csr.shape

(593844, 16002)

Write out the sparse data as npz. (5 minutes)

ss.save_npz('/scratch/midway2/aksarkar/modes/immune-cell-census.npz', y_csr)

Get its size on disk.

ls -lh /scratch/midway2/aksarkar/modes/immune-cell-census.npz

-rw-rw-r-- 1 aksarkar aksarkar 1.1G Dec 21 10:44 /scratch/midway2/aksarkar/modes/immune-cell-census.npz

Read the data

Read the sparse data. (20 seconds)

y_csr = ss.load_npz('/scratch/midway2/aksarkar/modes/immune-cell-census.npz')

Read the metadata.

genes = pd.read_csv('/scratch/midway2/aksarkar/modes/immune-cell-census-genes.txt.gz', sep='\t', index_col=0)
donor = pd.Categorical(pd.read_csv('/scratch/midway2/aksarkar/modes/immune-cell-census-samples.txt.gz', sep='\t', index_col=0)['0'])

Results

Gamma assumption

Fit a Gamma distribution to expression variation within each donor at each gene. (18 s/epoch)

gamma_res = dict()
for i, k in enumerate(donor.categories):
  print(f'Fitting donor {i}')
  gamma_res[k] = scmodes.ebpm.sgd.ebpm_gamma(y_csr[donor.codes == i], batch_size=128, lr=5e-2, max_epochs=10, verbose=True)
gamma_res = pd.concat({
  k: pd.DataFrame(np.vstack(gamma_res[k][:-1]).T, columns=['log_mu', 'neg_log_phi'], index=genes['Gene'])
  for k in gamma_res}, axis=1)

Write out the results.

gamma_res.to_csv('/scratch/midway2/aksarkar/modes/immune-cell-census-gamma.txt.gz', sep='\t')

Test GOF at each gene.

gamma_gof = []
for k, g in gamma_res.groupby(level=0, axis=1):
  query = y_csr[donor.codes == k].tocsc()
  s = query.sum(axis=1)
  for j, (gene, (log_mu, neg_log_phi)) in enumerate(g[k].iterrows()):
    d, p = scmodes.benchmark.gof._gof(
      x=query[:,j].A.ravel(),
      cdf=scmodes.benchmark.gof._zig_cdf,
      pmf=scmodes.benchmark.gof._zig_pmf,
      size=s,
      log_mu=log_mu,
      log_phi=-neg_log_phi)
    gamma_gof.append([gene, k, d, p])
gamma_gof = pd.DataFrame(gamma_gof, columns=['gene', 'donor', 'stat', 'p'])

Write the GOF results.

gamma_gof.to_csv('/project2/mstephens/aksarkar/projects/singlecell-modes/data/gof/immune-census-gamma.txt.gz', sep='\t')

Read the GOF results.

gamma_gof = pd.read_csv('/project2/mstephens/aksarkar/projects/singlecell-modes/data/gof/immune-census-gamma.txt.gz', sep='\t', index_col=0)

Plot the histogram of GOF \(p\)-values.

plt.clf()
plt.gcf().set_size_inches(2, 2)
plt.hist(gamma_gof.pivot_table(index='gene', columns='donor', values='p').values.ravel(),
         bins=np.linspace(0, 1, 11), color='0.7', density=True)
plt.axhline(y=1, lw=1, ls=':', c='k')
plt.xlim(0, 1)
plt.xlabel('$p$-value')
plt.ylabel('Density')
plt.tight_layout()

Plot the GOF by donor.

plt.clf()
fig, ax = plt.subplots(4, 4, sharex=True)
fig.set_size_inches(6, 6)
for a, (k, g) in zip(ax.ravel(), gamma_gof.groupby(['donor'])):
  a.hist(g['p'].values.ravel(), bins=np.linspace(0, 1, 11), color='0.7', density=True)
  a.axhline(y=1, lw=1, ls=':', c='k')
  a.set_xlim(0, 1)
for a in ax:
  a[0].set_ylabel('Density')
for a in ax.T:
  a[-1].set_xlabel('$p$-value')
fig.tight_layout()

Report the number of genes which significantly depart from a Gamma assumption on expression variation (Bonferroni-corrected \(p < 0.05\)).

gamma_gof.groupby('donor').apply(lambda x: (x['p'] < 0.05 / x.shape[0]).sum())

donor
085e737d-adb5-4597-bd54-5ebeda170038    0
0a6c46dd-0905-4581-95eb-d89eef8a7213    0
0b91cb1f-e2a8-413a-836c-1d38e7af3f2d    1
31f89559-2682-4bbc-84c6-826dfe4a4e39    0
4a404c91-0dbf-4246-bc23-d13aff961ba7    1
4e98f612-15ec-44ab-b5f9-39787f92b01a    0
509c507c-4759-452f-994e-d134d90329fd    0
53af872d-b838-44d6-ae1b-25b56405483c    1
6072d1f5-aa0c-4ab1-a8a6-a00ab479a1ba    0
9aaf8a07-924f-456c-86dc-82f5da718246    0
af7fe7a6-7d7e-4cdf-9799-909680fa9a3f    0
cf514c66-88b2-45e4-a397-7fb362ae9950    0
d23515a7-e182-4bc6-89e2-b1635885c0ec    0
e4b5115d-3a0d-4c50-aba4-04b5f76810da    1
eb8fb36b-6e02-41c4-8760-3eabbde6bacb    0
fb30bb83-0278-4117-bd42-e2e8dddfedfe    0
dtype: int64

Find the gene which departed from a Gamma assumption.

query = gamma_gof[gamma_gof['donor'] == '0b91cb1f-e2a8-413a-836c-1d38e7af3f2d']
query[query['p'] < .05 / query.shape[0]]

gene                                 donor      stat         p
43477  MT-ND3  0b91cb1f-e2a8-413a-836c-1d38e7af3f2d  0.999999  0.000003

Point-Gamma assumption

Fit a point-Gamma assumption distribution to expression variation.

gamma_res = pd.read_csv('/scratch/midway2/aksarkar/modes/immune-cell-census-gamma.txt.gz', sep='\t', header=[0, 1], index_col=0)
point_gamma_res = dict()
for i, k in enumerate(donor.categories):
  print(f'Fitting donor {i}')
  point_gamma_res[k] = scmodes.ebpm.sgd.ebpm_point_gamma(
    y_csr[donor.codes == i],
    init=(gamma_res[k, 'log_mu'].values.reshape(1, -1), 
          gamma_res[k, 'neg_log_phi'].values.reshape(1, -1)),
    batch_size=128,
    lr=2e-2,
    max_epochs=15,
    verbose=True)

Write out the results.

point_gamma_res = pd.concat({
  k: pd.DataFrame(np.vstack(point_gamma_res[k][:-1]).T, columns=['log_mu', 'neg_log_phi', 'logodds'], index=genes['Gene'])
  for k in point_gamma_res}, axis=1)
point_gamma_res.to_csv('/scratch/midway2/aksarkar/modes/immune-cell-census-point-gamma.txt.gz', sep='\t')

Test for GOF.

point_gamma_gof = []
for k, g in point_gamma_res.groupby(level=0, axis=1):
  query = y_csr[donor.codes == k].tocsc()
  s = query.sum(axis=1)
  for j, (gene, (log_mu, neg_log_phi, logodds)) in enumerate(g[k].iterrows()):
    d, p = scmodes.benchmark.gof._gof(
      x=query[:,j].A.ravel(),
      cdf=scmodes.benchmark.gof._zig_cdf,
      pmf=scmodes.benchmark.gof._zig_pmf,
      size=s,
      log_mu=log_mu,
      log_phi=-neg_log_phi,
      logodds=logodds)
    point_gamma_gof.append([gene, k, d, p])
point_gamma_gof = pd.DataFrame(point_gamma_gof, columns=['gene', 'donor', 'stat', 'p'])

Write out the GOF results.

point_gamma_gof.to_csv('/project2/mstephens/aksarkar/projects/singlecell-modes/data/gof/immune-census-point-gamma.txt.gz')

plt.clf()
plt.gcf().set_size_inches(2, 2)
plt.hist(point_gamma_gof.pivot_table(index='gene', columns='donor', values='p').values.ravel(),
         bins=np.linspace(0, 1, 11), color='0.7', density=True)
plt.axhline(y=1, lw=1, ls=':', c='k')
plt.xlim(0, 1)
plt.xlabel('$p$-value')
plt.ylabel('Density')
plt.tight_layout()

Examples

Read the fitted Gamma distributions.

gamma_res = pd.read_csv('/scratch/midway2/aksarkar/modes/immune-cell-census-gamma.txt.gz', sep='\t', header=[0, 1], index_col=0)
point_gamma_res = pd.read_csv('/scratch/midway2/aksarkar/modes/immune-cell-census-point-gamma.txt.gz', sep='\t', header=[0, 1], index_col=0)

Get the highest expressed genes.

(gamma_res
 .xs('log_mu', level=1, axis=1)
 .mean(axis=1)
 .sort_values(ascending=False)
 .head())

Gene
MALAT1   -2.718048
RPS27    -4.359568
RPL10    -4.504897
RPL13    -4.565336
RPS18    -4.632655
dtype: float64

y_csc = y_csr.tocsc()

yj = y_csc[:,genes.reset_index()[(genes['Gene'] == 'MALAT1').values].index[0]].A.ravel()
s = y_csr.sum(axis=1).A.ravel()
lam = yj / s

plt.clf()
fig, ax = plt.subplots(3, 1)
fig.set_size_inches(4, 6)

ax[0].hist(yj / s, bins=100, color='k')
ax[0].set_xlabel('$x_i / s_i$')
ax[0].set_ylabel('Number of cells')
ax[0].set_title('MALAT1')

grid = np.linspace(lam.min(), lam.max(), 1000)
for k, g in gamma_res.groupby(level=0, axis=1):
  F = g.loc['MALAT1']
  log_mu = F.xs('log_mu', level=1)[0]
  neg_log_phi = F.xs('neg_log_phi', level=1)[0]
  ax[1].plot(grid, st.gamma(a=np.exp(neg_log_phi), scale=np.exp(log_mu - neg_log_phi)).cdf(grid), c='k', lw=1, alpha=0.25)
ax[1].set_xlim(0, 0.3)
ax[1].set_xlabel('Latent gene expression')
ax[1].set_ylabel('Gamma CDF')

for k, g in point_gamma_res.groupby(level=0, axis=1):
  F = g.loc['MALAT1']
  log_mu = F.xs('log_mu', level=1)[0]
  neg_log_phi = F.xs('neg_log_phi', level=1)[0]
  logodds = F.xs('logodds', level=1)[0]
  ax[2].plot(grid, sp.expit(logodds) + sp.expit(-logodds) * st.gamma(a=np.exp(neg_log_phi), scale=np.exp(log_mu - neg_log_phi)).cdf(grid), c='k', lw=1, alpha=0.25)
ax[2].set_xlim(0, 0.3)
ax[2].set_xlabel('Latent gene expression')
ax[2].set_ylabel('Point-Gamma CDF')

fig.tight_layout()