Gaussian methods

Table of Contents

Introduction

We focused our attention on count models for single- and multi-gene observation models. However, one commonly used strategy is to use Gaussian methods on (log-transformed) counts. Here, we investigate some of the issues related this strategy.

Setup

%matplotlib inline
%config InlineBackend.figure_formats = set(['retina'])
import matplotlib.pyplot as plt
plt.rcParams['figure.facecolor'] = 'w'
plt.rcParams['font.family'] = 'Nimbus Sans'

Results

PCA on (log) counts

Investigate whether PCA on counts or log counts gives better log likelihood. We can write principal components analysis as a generative model (Tipping 1999):

\[ \mathbf{y}_i \sim \mathcal{N}(\mathbf{y}_i; \mathbf{W z}_i, \sigma^2 \mathbf{I}) \]

Now, suppose \(y_{ij} = \ln(x_{ij} + \epsilon)\). Then, via change of variables we have:

\[ p(\mathbf{x}_i) = \frac{1}{\mathbf{x}_i + \epsilon} \mathcal{N}(\ln(\mathbf{x}_i + \epsilon); \mathbf{W z}_i, \sigma^2 \mathbf{I}) \]

where we abuse notation for element-wise operations. In general, this means we can compare the likelihood of the observed data under models of different transformations of the observations.

<<imports>>
import scipy.stats as st
import sklearn.decomposition as skd

ipsc = scmodes.dataset.ipsc('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/')
res0 = skd.PCA(n_components=10).fit(ipsc)
z_0 = res0.transform(ipsc)
llik0 = st.norm(loc=z_0.dot(res0.components_), scale=np.sqrt(res0.noise_variance_)).logpdf(ipsc).sum()

log_ipsc = np.log(ipsc + 1)
res1 = skd.PCA(n_components=10).fit(log_ipsc)
z_1 = res1.transform(ipsc)
llik1 = (-log_ipsc + st.norm(loc=z_1.dot(res1.components_), scale=np.sqrt(res1.noise_variance_)).logpdf(log_ipsc)).sum()
print(f'PCA of untransformed counts: {llik0:.4g}')
print(f'PCA of log1p counts: {llik1:.4g}')
sbatch --partition=mstephens --mem=8G
#!/bin/bash
source activate scmodes
python <<EOF
<<pca>>
EOF
PCA of untransformed counts: -4.712e+08
PCA of log1p counts: -3.57e+09

Factor analysis of log-transformed counts

By Taylor expansion, we have

\[ \ln(x + \epsilon) \approx \ln(\mu + \epsilon) - \frac{\sigma^2}{2(\mu + \epsilon)^2}, \]

where \(\mu = E[x]\), \(\sigma^2 = V[x]\), and \(\epsilon\) is the pseudocount. This result suggests a connection between factor analysis on \(\ln(x + \epsilon)\) and the class of multi-gene observation models on \(x\) we discussed

\begin{align*} x_{ij} &\sim \operatorname{Poisson}(\lambda_{ij})\\ \lambda_{ij} &= \mu_{ij} u_{ij}\\ \mu_{ij} &= (\mathbf{L}\mathbf{F}')_{ij}\\ u_{ij} &\sim p(u_{ij}) \end{align*}

Author: Abhishek Sarkar

Created: 2020-03-05 Thu 23:06

Validate