fQTL models

Introduction
Setup
Simulation
- Generate the data
Multi-tissue eQTL mapping
- Read the data
- Relationship of sample size to number of eGenes

Introduction

We previously fit fqtl models for 39,998 genes across 44 tissues from GTEx (version 6p). Here, we analyze the fitted fqtl models.

Setup

import glob
import itertools as it
import os.path
import numpy as np
import pandas as pd
import scipy.linalg as sl
import scipy.special as sp
import scipy.stats as st

%matplotlib inline
%config InlineBackend.figure_formats = set(['retina'])

import colorcet
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'Nimbus Sans'

Simulation

Generate the data

parallel --dry-run ./sim_data.sh {} /broad/hptmp/aksarkar/simulation/{1}/{#} :::: /broad/compbio/ypp/gtex/analysis/fqtl-gtex/result/simulation/selected.genes ::: 1 2 3 ::: 0.05 0.1 0.15 0.2 0.3 ::: 1 3 5 ::: 5 10 20 30 ::: 1001 >joblist

qsub -cwd -V -terse -sync n -j y -N sim -l h_vmem=2G -t 1
readarray -O1 tasks < /broad/hptmp/aksarkar/simulation/joblist
source activate /broad/compbio/aksarkar/.conda/envs/fqtl
pushd /broad/compbio/ypp/gtex/analysis/fqtl-gtex/
exec ${tasks[$SGE_TASK_ID]}

Multi-tissue eQTL mapping

Read the data

Extract the GTEx univariate summary statistics.

qsub -cwd -V -terse -j y
cp /broad/compbio/data/GTEx/v6_GTEx_Analysis_2015-01-12/eqtl_data/GTEx_Analysis_2015-01-12_eGenesFiltered.tar.gz .
mkdir -p egenes-filtered
tar xf GTEx_Analysis_2015-01-12_eGenesFiltered.tar.gz -C egenes-filtered

Extract the gene expression matrices to get sample sizes.

qsub -cwd -V -terse -j y
cp /broad/compbio/data/GTEx/v6_GTEx_Analysis_2015-01-12/eqtl_data/GTEx_Analysis_2015-01-12_eQTLInputFiles_geneLevelNormalizedExpressionMatrices.tar.gz .
mkdir -p expr
tar xf GTEx_Analysis_2015-01-12_eQTLInputFiles_geneLevelNormalizedExpressionMatrices.tar.gz -C expr
parallel --halt=now,fail=1 -X gzip ::: expr/*.txt

qsub -cwd -V -terse -j y
function z () { zcat $1 | awk -vf=$(basename $1 _Analysis.expr.txt.gz) 'NR == 1 {n = NF - 1} END {print f, n, NR - 1; exit}'; }
export -f z
parallel -j1 z ::: *.expr.txt.gz | awk 'BEGIN {print "tissue", "num_samples", "num_genes"} {print}' >sample-sizes.txt

sample_sizes = pd.read_csv('/broad/hptmp/aksarkar/fqtl/expr/sample-sizes.txt', sep=' ')

Read the GTEx tissue colors.

colors = pd.read_csv('/broad/compbio/aksarkar/projects/gtex-fqtl/data/tissues.colors.txt', header=None, sep='\t')

Parse the fqtl results.

def unpack(X):
  if '|' in X.iloc[3]:
    T = pd.DataFrame(it.zip_longest(*X.apply(lambda x: str(x).split('|')).values, fillvalue=np.nan))
    T.columns = X.index
    T.fillna(T.iloc[0], inplace=True)
    return T
  else:
    return X.to_frame().T

fqtl_res = pd.concat([pd.read_csv(f'/broad/compbio/aksarkar/projects/gtex-fqtl/result/stat/chr{i}/50/combined.txt.gz', header=None, sep='\t') for i in range(1, 23)])
fqtl_res.columns = ['gene', 'chr', 'tss', 'tissue_idx', 'tissue_name', 'tissue_theta', 'tissue_se', 'tissue_lodds', 'snp_name', 'snp_theta', 'snp_se', 'snp_lodds', 'factor', 'pip']
fqtl_res = pd.concat([unpack(x) for _, x in fqtl_res[fqtl_res['pip'] == 0.95].iterrows()])
fqtl_res.to_csv('/broad/compbio/aksarkar/projects/gtex-fqtl/result/stat/fqtl-combined.txt.gz', sep='\t', compression='gzip')

Read the fqtl results.

fqtl_res = pd.read_csv('/broad/compbio/aksarkar/projects/gtex-fqtl/result/stat/fqtl-combined.txt.gz', sep='\t', index_col=0)

Count how many genes we retained at least one factor/loading pair for.

len(fqtl_res['gene'].unique())

Relationship of sample size to number of eGenes

Read the univariate eGenes (\(q < 0.05\)).

univariate_res = {os.path.basename(f[:-24]): pd.read_csv(f, sep='\t') for f in glob.glob('/broad/hptmp/aksarkar/fqtl/egenes-filtered/*.filteredEGenes')}

Read the list of genes for which we fit fqtl models.

valid_genes = pd.concat([pd.read_csv(f'/broad/compbio/aksarkar/projects/gtex-fqtl/data/fqtl-{i}-valid.genes.txt', sep='\t', header=None) for i in range(1, 23)])

Plot the relationship between eGenes and sample size.

fqtl_egenes = fqtl_res.groupby('tissue_name')['gene'].agg(lambda x: len(set(x)))
univariate_egenes = pd.DataFrame.from_dict({k: univariate_res[k].shape[0] for k in univariate_res}, orient='index')
J = (sample_sizes
     .merge(colors, left_on='tissue', right_on=0)
     .merge(fqtl_egenes, left_on='tissue', right_on='tissue_name')
     .merge(univariate_egenes, left_on='tissue', right_index=True)
     .rename({'gene': 'fqtl', '0_y': 'univariate', 1: 'color'}, axis='columns'))

plt.clf()
plt.gcf().set_size_inches(4, 3)
for _, row in J.iterrows():
  base = row['univariate']
  top = row['fqtl']
  plt.scatter(row['num_samples'], base, c=f'{row["color"]}', s=8, marker='.')
  plt.scatter(row['num_samples'], top, c=f'{row["color"]}', s=8, marker='^')
  plt.arrow(row['num_samples'], base, 0, top - base, lw=1, color=f'{row["color"]}')
plt.xlabel('Sample size')
plt.ylabel('Number of discovered eGenes')
plt.ylim(0, 11000)
plt.tight_layout()

fQTL finds more eGenes in brain tissues, but appears to over-regularize tissues with larger sample sizes. This is in constrast to what we observe in applying fQTL to GTEx v8.