Experimental fine-mapping of QTLs from scRNA-seq data

Introduction
Setup
Method
- Regulatory network inference from scRNA-seq data
Results
- Proof of concept

Introduction

One of the fundamental problems in translating genetic associations with disease into actionable insights is identifying the molecular mechanism by which they operate. What do we mean by “identifying a molecular mechanism”? Ideally, we would identify a cell type of action, a collection of regulatory elements, a target gene, and collection of upstream regulators (e.g., Smemo et al. 2013, Claussnitzer et al. 2015, Sekar et al. 2016).

Considerable progress has been made on identifying genomic annotations enriched for associations, which can potentially be used to identify putative cell types of action (e.g., Maurano et al. 2012, Gusev et al. 2015, Finucane et al. 2015, Finucane et al. 2016). Progress has also been made on statistically identifying putative causal variants directly from genotype and phenotype, possibly using enrichments as prior information (e.g., Hormozdiari et al. 2014, Pickrell 2014, Kichaev et al. 2014, Hormozdiari et al. 2014, Chung et al. 2014, Li and Kellis 2016). Recent attention has focused on the statistical problem of identifying the target gene from genotype, phenotype, and gene expression data (e.g., Gamazon et al. 2015, Gusev et al. 2016, Mancuso et al. 2017, Zhu et al. 2018).

Together, these statistical approaches can potentially be combined to identify a putative mechanism by which a disease-associated locus causes disease risk/onset. However, they are fundamentally limited by the availability of cell type–specific molecular data at the transcriptomic and epigenomic levels, which are required to derive relevant genomic annotations. These limitations have, in part, motivated recent efforts to generate relevant single cell genomic data (e.g., Regev et al. 2017).

In parallel, experimental approaches have been developed to identify functional regulatory elements using transfected reporter constructs (e.g., Melkinov et al. 2012, Arnold et al. 2013, Tewhey et al. 2016, Ernst et al. 2016, Wang et al. 2018), or more recently, combining CRISPR-perturbation with scRNA-seq phenotyping (e.g., Adamson et al. 2016, Dixit et al. 2016, Jaitin et al. 2016, Xie et al. 2017, Datlinger et al. 2017). These approaches have been limited by the cell type in which experiments are performed.

Here, we outline an alternative approach, which exploits the fact that naturally occuring stochastic variation in protein abundances of upstream regulators will lead to downstream variation in mRNA levels of target genes. This fact suggests that it could possible to identify upstream regulators, functional regulatory elements, and target genes directly from scRNA-seq data. Such an approach could avoid problems of collecting large numbers of donor individuals, and of picking the correct cell type/developmental time point a priori. Further, it could more directly answer the relevant questions for interpreting disease-associated loci. Longer term, this methodology could be extended to spatial transcriptomic measurements, allowing us to account for microenvironments and cell-cell interactions.

Setup

import anndata
import numpy as np
import pandas as pd
import rpy2.robjects.packages
import rpy2.robjects.pandas2ri
import txpred.models.susie
import torch
import torch.utils.data as td

rpy2.robjects.pandas2ri.activate()
mrash = rpy2.robjects.packages.importr('mr.ash.alpha')
susie = rpy2.robjects.packages.importr('susieR')

%matplotlib inline
%config InlineBackend.figure_formats = set(['retina'])

import matplotlib.pyplot as plt
plt.rcParams['figure.facecolor'] = 'w'
plt.rcParams['font.family'] = 'Nimbus Sans'

Method

Regulatory network inference from scRNA-seq data

We assume (Sarkar and Stephens 2020) \( \DeclareMathOperator\Pois{Poisson} \DeclareMathOperator\N{\mathcal{N}} \newcommand\xiplus{x_{i+}} \newcommand\mh{\mathbf{H}} \newcommand\vb{\mathbf{b}} \newcommand\mi{\mathbf{I}} \)

\begin{equation} x_{ij} \mid \xiplus, \eta_{ij} \sim \Pois(\xiplus \exp(\eta_{ij})), \end{equation}

where \(x_{ij}\) denotes the number of molecules of gene \(j\) observed in cell \(i\) and \(\xiplus \triangleq \sum_j x_{ij}\). We further assume

\begin{align} \eta_{\cdot j} &\sim \N([\mh]_{\cdot j} \vb_j, \sigma^2 \mi)\\ b_{j j^{\prime}} &\sim \pi \N(0, \sigma^2 \sigma_b^2) + (1 - \pi) \delta_0(\cdot), \end{align}

where \(\mh = [\eta_{ij}]\) and \([\mh]_{ij}\) denotes the submatrix of \(\mh\) obtained by deleting the \(i\)th row and \(j\)th column. Then, the vectors \(\vb_j\) give the off-diagonal elements of the adjacency matrix of the gene-gene-interaction network. We can estimate \(\eta_{ij}\) from the data using e.g., mpebpm (Sarkar et al. 2019), and estimate \(\vb_j\) from the data using e.g., susie (Wang et al. 2018).

Results

Proof of concept

Read the iPSC data (Sarkar et al. 2019).

dat = anndata.read_h5ad('/project2/mstephens/aksarkar/projects/singlecell-ideas/data/ipsc/ipsc.h5ad')

Read the estimated expression models.

log_mu = np.load('/scratch/midway2/aksarkar/ideas/mpebpm-ipsc-design-log-mu.npy')
neg_log_phi = np.load('/scratch/midway2/aksarkar/ideas/mpebpm-ipsc-design-neg-log-phi.npy')
logodds = np.load('/scratch/midway2/aksarkar/ideas/mpebpm-ipsc-design-logodds.npy')

Compute \(E[\ln\lambda_{ij}]\) to zeroth order. (This is needed because the expectation is not finite when \(\pi_j > 0\).)

onehot = pd.get_dummies(dat.obs['chip_id'], sparse=True)
del onehot['NA18498']
onehot = onehot.sparse.to_coo().tocsr()
# Important: this is cells by genes, not donors by genes
log_mean = (onehot @ -np.log1p(-logodds)) + np.log1p(dat.X + onehot @ np.exp(neg_log_phi)) - np.log(dat.X.sum(axis=1) + onehot @ np.exp(-log_mu + neg_log_phi))

Download a curated database of TF genes (Lambert et al. 2018)

curl -OL "http://humantfs.ccbr.utoronto.ca/download/v_1.01/DatabaseExtract_v_1.01.csv"

tfs = pd.read_csv('/scratch/midway2/aksarkar/singlecell/DatabaseExtract_v_1.01.csv', index_col=0)
query = dat.var.merge(tfs.loc[tfs['Is TF?'] == 'Yes'], left_index=True, right_on='Ensembl ID', how='inner').index
design = log_mean[:,query]

Focus on SKP1, an eQTL found in the previous study. First, fit sgvbvs, and report genes with non-trivial PIP.

gene = 'SKP1'
batch_size = 128
seed = 3
torch.manual_seed(seed)
j = np.where(dat.var['name'] == gene)[0][0]
data = td.DataLoader(
  td.TensorDataset(
    torch.tensor(design),
    torch.tensor(log_mean[:,j])),
  batch_size=batch_size,
  shuffle=True)
fit0 = (txpred.models.susie.GaussianRegression(txpred.models.susie.SpikeSlab(design.shape[1]))
        .fit(data, n_epochs=100, log_dir=f'runs/txpred/{gene}-{batch_size}-{seed}'))
dat[:,query].var.loc[(torch.sigmoid(fit0.prior.logits) > 0.1).numpy()]

chr     start       end name strand      source
index
ENSG00000087086  hs19  49468558  49470135  FTL      +  H. sapiens

Fit susie to get credible sets.

fit = susie.susie(design, log_mean[:,j], L=5)
fit.rx2('V')

array([0.04996618, 0.0089835 , 0.00727507, 0.00513533, 0.00376015])

Get the credible sets.

cs = [susie.susie_get_cs(fit, coverage=0.95).rx2('cs').rx2(i + 1) for i in range(5)]

dat[:,query].var.iloc[np.array(cs).ravel()]

chr     start       end     name strand      source
index
ENSG00000060656   hs1  29563028  29653325    PTPRU      +  H. sapiens
ENSG00000103145  hs16   3072621   3074287  HCFC1R1      -  H. sapiens
ENSG00000054116   hs1  36602173  36615098  TRAPPC3      -  H. sapiens
ENSG00000059122  hs16   2961938   3001209  FLYWCH1      +  H. sapiens
ENSG00000083544  hs13  60970591  61148012    TDRD3      +  H. sapiens

Get all predicted TF binding motifs within 1MB of SKP1.

module load htslib
curl -OL "http://compbio.mit.edu/encode-motifs/enrichments.txt.gz"
curl -OL "http://compbio.mit.edu/encode-motifs/matches.txt.gz"
zcat matches.txt.gz | sort -k2,2 -k3,3n -k4,4n | bgzip >motif-matches.txt.gz
tabix -s 2 -b 3 -e 4 motif-matches.txt.gz