Experimental fine-mapping of QTLs from scRNA-seq data
Table of Contents
Introduction
One of the fundamental problems in translating genetic associations with disease into actionable insights is identifying the molecular mechanism by which they operate. What do we mean by “identifying a molecular mechanism”? Ideally, we would identify a cell type of action, a collection of regulatory elements, a target gene, and collection of upstream regulators (e.g., Smemo et al. 2013, Claussnitzer et al. 2015, Sekar et al. 2016).
Considerable progress has been made on identifying genomic annotations enriched for associations, which can potentially be used to identify putative cell types of action (e.g., Maurano et al. 2012, Gusev et al. 2015, Finucane et al. 2015, Finucane et al. 2016). Progress has also been made on statistically identifying putative causal variants directly from genotype and phenotype, possibly using enrichments as prior information (e.g., Hormozdiari et al. 2014, Pickrell 2014, Kichaev et al. 2014, Hormozdiari et al. 2014, Chung et al. 2014, Li and Kellis 2016). Recent attention has focused on the statistical problem of identifying the target gene from genotype, phenotype, and gene expression data (e.g., Gamazon et al. 2015, Gusev et al. 2016, Mancuso et al. 2017, Zhu et al. 2018).
Together, these statistical approaches can potentially be combined to identify a putative mechanism by which a disease-associated locus causes disease risk/onset. However, they are fundamentally limited by the availability of cell type–specific molecular data at the transcriptomic and epigenomic levels, which are required to derive relevant genomic annotations. These limitations have, in part, motivated recent efforts to generate relevant single cell genomic data (e.g., Regev et al. 2017).
In parallel, experimental approaches have been developed to identify functional regulatory elements using transfected reporter constructs (e.g., Melkinov et al. 2012, Arnold et al. 2013, Tewhey et al. 2016, Ernst et al. 2016, Wang et al. 2018), or more recently, combining CRISPR-perturbation with scRNA-seq phenotyping (e.g., Adamson et al. 2016, Dixit et al. 2016, Jaitin et al. 2016, Xie et al. 2017, Datlinger et al. 2017). These approaches have been limited by the cell type in which experiments are performed.
Here, we outline an alternative approach, which exploits the fact that naturally occuring stochastic variation in protein abundances of upstream regulators will lead to downstream variation in mRNA levels of target genes. This fact suggests that it could possible to identify upstream regulators, functional regulatory elements, and target genes directly from scRNA-seq data. Such an approach could avoid problems of collecting large numbers of donor individuals, and of picking the correct cell type/developmental time point a priori. Further, it could more directly answer the relevant questions for interpreting disease-associated loci. Longer term, this methodology could be extended to spatial transcriptomic measurements, allowing us to account for microenvironments and cell-cell interactions.
Setup
import anndata import numpy as np import pandas as pd import rpy2.robjects.packages import rpy2.robjects.pandas2ri import txpred.models.susie import torch import torch.utils.data as td rpy2.robjects.pandas2ri.activate() mrash = rpy2.robjects.packages.importr('mr.ash.alpha') susie = rpy2.robjects.packages.importr('susieR')
%matplotlib inline %config InlineBackend.figure_formats = set(['retina'])
import matplotlib.pyplot as plt plt.rcParams['figure.facecolor'] = 'w' plt.rcParams['font.family'] = 'Nimbus Sans'
Method
Regulatory network inference from scRNA-seq data
We assume (Sarkar and Stephens 2020) \( \DeclareMathOperator\Pois{Poisson} \DeclareMathOperator\N{\mathcal{N}} \newcommand\xiplus{x_{i+}} \newcommand\mh{\mathbf{H}} \newcommand\vb{\mathbf{b}} \newcommand\mi{\mathbf{I}} \)
\begin{equation} x_{ij} \mid \xiplus, \eta_{ij} \sim \Pois(\xiplus \exp(\eta_{ij})), \end{equation}where \(x_{ij}\) denotes the number of molecules of gene \(j\) observed in cell \(i\) and \(\xiplus \triangleq \sum_j x_{ij}\). We further assume
\begin{align} \eta_{\cdot j} &\sim \N([\mh]_{\cdot j} \vb_j, \sigma^2 \mi)\\ b_{j j^{\prime}} &\sim \pi \N(0, \sigma^2 \sigma_b^2) + (1 - \pi) \delta_0(\cdot), \end{align}
where \(\mh = [\eta_{ij}]\) and \([\mh]_{ij}\) denotes the submatrix of
\(\mh\) obtained by deleting the \(i\)th row and \(j\)th column. Then, the
vectors \(\vb_j\) give the off-diagonal elements of the adjacency matrix of
the gene-gene-interaction network. We can estimate \(\eta_{ij}\) from the
data using e.g., mpebpm (Sarkar et al. 2019), and
estimate \(\vb_j\) from the data using e.g., susie
(Wang et al. 2018).
Results
Proof of concept
Read the iPSC data (Sarkar et al. 2019).
dat = anndata.read_h5ad('/project2/mstephens/aksarkar/projects/singlecell-ideas/data/ipsc/ipsc.h5ad')
Read the estimated expression models.
log_mu = np.load('/scratch/midway2/aksarkar/ideas/mpebpm-ipsc-design-log-mu.npy') neg_log_phi = np.load('/scratch/midway2/aksarkar/ideas/mpebpm-ipsc-design-neg-log-phi.npy') logodds = np.load('/scratch/midway2/aksarkar/ideas/mpebpm-ipsc-design-logodds.npy')
Compute \(E[\ln\lambda_{ij}]\) to zeroth order. (This is needed because the expectation is not finite when \(\pi_j > 0\).)
onehot = pd.get_dummies(dat.obs['chip_id'], sparse=True) del onehot['NA18498'] onehot = onehot.sparse.to_coo().tocsr() # Important: this is cells by genes, not donors by genes log_mean = (onehot @ -np.log1p(-logodds)) + np.log1p(dat.X + onehot @ np.exp(neg_log_phi)) - np.log(dat.X.sum(axis=1) + onehot @ np.exp(-log_mu + neg_log_phi))
Download a curated database of TF genes (Lambert et al. 2018)
curl -OL "http://humantfs.ccbr.utoronto.ca/download/v_1.01/DatabaseExtract_v_1.01.csv"
tfs = pd.read_csv('/scratch/midway2/aksarkar/singlecell/DatabaseExtract_v_1.01.csv', index_col=0) query = dat.var.merge(tfs.loc[tfs['Is TF?'] == 'Yes'], left_index=True, right_on='Ensembl ID', how='inner').index design = log_mean[:,query]
Focus on SKP1, an eQTL found in the previous study. First, fit sgvbvs
,
and report genes with non-trivial PIP.
gene = 'SKP1' batch_size = 128 seed = 3 torch.manual_seed(seed) j = np.where(dat.var['name'] == gene)[0][0] data = td.DataLoader( td.TensorDataset( torch.tensor(design), torch.tensor(log_mean[:,j])), batch_size=batch_size, shuffle=True) fit0 = (txpred.models.susie.GaussianRegression(txpred.models.susie.SpikeSlab(design.shape[1])) .fit(data, n_epochs=100, log_dir=f'runs/txpred/{gene}-{batch_size}-{seed}')) dat[:,query].var.loc[(torch.sigmoid(fit0.prior.logits) > 0.1).numpy()]
chr start end name strand source index ENSG00000087086 hs19 49468558 49470135 FTL + H. sapiens
Fit susie
to get credible sets.
fit = susie.susie(design, log_mean[:,j], L=5) fit.rx2('V')
array([0.04996618, 0.0089835 , 0.00727507, 0.00513533, 0.00376015])
Get the credible sets.
cs = [susie.susie_get_cs(fit, coverage=0.95).rx2('cs').rx2(i + 1) for i in range(5)]
dat[:,query].var.iloc[np.array(cs).ravel()]
chr start end name strand source index ENSG00000060656 hs1 29563028 29653325 PTPRU + H. sapiens ENSG00000103145 hs16 3072621 3074287 HCFC1R1 - H. sapiens ENSG00000054116 hs1 36602173 36615098 TRAPPC3 - H. sapiens ENSG00000059122 hs16 2961938 3001209 FLYWCH1 + H. sapiens ENSG00000083544 hs13 60970591 61148012 TDRD3 + H. sapiens
Get all predicted TF binding motifs within 1MB of SKP1.
module load htslib curl -OL "http://compbio.mit.edu/encode-motifs/enrichments.txt.gz" curl -OL "http://compbio.mit.edu/encode-motifs/matches.txt.gz" zcat matches.txt.gz | sort -k2,2 -k3,3n -k4,4n | bgzip >motif-matches.txt.gz tabix -s 2 -b 3 -e 4 motif-matches.txt.gz