Gaussian methods
Table of Contents
Introduction
We focused our attention on count models for single- and multi-gene observation models. However, one commonly used strategy is to use Gaussian methods on (log-transformed) counts. Here, we investigate some of the issues related this strategy.
Setup
%matplotlib inline %config InlineBackend.figure_formats = set(['retina'])
import matplotlib.pyplot as plt plt.rcParams['figure.facecolor'] = 'w' plt.rcParams['font.family'] = 'Nimbus Sans'
Results
PCA on (log) counts
Investigate whether PCA on counts or log counts gives better log likelihood. We can write principal components analysis as a generative model (Tipping 1999):
\[ \mathbf{y}_i \sim \mathcal{N}(\mathbf{y}_i; \mathbf{W z}_i, \sigma^2 \mathbf{I}) \]
Now, suppose \(y_{ij} = \ln(x_{ij} + \epsilon)\). Then, via change of variables we have:
\[ p(\mathbf{x}_i) = \frac{1}{\mathbf{x}_i + \epsilon} \mathcal{N}(\ln(\mathbf{x}_i + \epsilon); \mathbf{W z}_i, \sigma^2 \mathbf{I}) \]
where we abuse notation for element-wise operations. In general, this means we can compare the likelihood of the observed data under models of different transformations of the observations.
<<imports>> import scipy.stats as st import sklearn.decomposition as skd ipsc = scmodes.dataset.ipsc('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/') res0 = skd.PCA(n_components=10).fit(ipsc) z_0 = res0.transform(ipsc) llik0 = st.norm(loc=z_0.dot(res0.components_), scale=np.sqrt(res0.noise_variance_)).logpdf(ipsc).sum() log_ipsc = np.log(ipsc + 1) res1 = skd.PCA(n_components=10).fit(log_ipsc) z_1 = res1.transform(ipsc) llik1 = (-log_ipsc + st.norm(loc=z_1.dot(res1.components_), scale=np.sqrt(res1.noise_variance_)).logpdf(log_ipsc)).sum() print(f'PCA of untransformed counts: {llik0:.4g}') print(f'PCA of log1p counts: {llik1:.4g}')
sbatch --partition=mstephens --mem=8G #!/bin/bash source activate scmodes python <<EOF <<pca>> EOF
PCA of untransformed counts: -4.712e+08 PCA of log1p counts: -3.57e+09
Factor analysis of log-transformed counts
By Taylor expansion, we have
\[ \ln(x + \epsilon) \approx \ln(\mu + \epsilon) - \frac{\sigma^2}{2(\mu + \epsilon)^2}, \]
where \(\mu = E[x]\), \(\sigma^2 = V[x]\), and \(\epsilon\) is the pseudocount. This result suggests a connection between factor analysis on \(\ln(x + \epsilon)\) and the class of multi-gene observation models on \(x\) we discussed
\begin{align*} x_{ij} &\sim \operatorname{Poisson}(\lambda_{ij})\\ \lambda_{ij} &= \mu_{ij} u_{ij}\\ \mu_{ij} &= (\mathbf{L}\mathbf{F}')_{ij}\\ u_{ij} &\sim p(u_{ij}) \end{align*}