Link functions in Poisson LRA
Table of Contents
Introduction
We are interested in fitting the model: \( \newcommand\ml{\mathbf{L}} \newcommand\mf{\mathbf{F}} \)
\begin{align*} x_{ij} &\sim \operatorname{Poisson}(s_i \lambda_{ij})\\ h(\lambda_{ij}) &= (\ml\mf)_{ij} \end{align*}GLM-PCA fits this model assuming \(h(x) = \ln x\) using Fisher scoring (Newton-Raphson updates). More precisely, GLM-PCA writes \(\lambda_{ij} = \exp(\ln s_i + (\mathbf{LF}')_{ij})\). If \([\ln\lambda_{ij}]\) is rank \(K\), then \([\ln s_i + \ln\lambda_{ij}]\) is no more than rank \(K + 1\) (subadditivity of rank), which we will use in implementations which don't include size factors \(s_i\).
In benchmarking GLM-PCA on the cortex
dataset (Zeisel et al. 2015), we
found severe numerical problems. Here, we investigate whether this is because
of the link function or the optimization algorithm.
Setup
import numpy as np import pandas as pd import scipy.stats as st import scmodes import wlra import wlra.grad
%matplotlib inline %config InlineBackend.figure_formats = set(['retina'])
import matplotlib.pyplot as plt plt.rcParams['figure.facecolor'] = 'w' plt.rcParams['font.family'] = 'Nimbus Sans'
Results
Cortex data
Read the data.
X = scmodes.dataset.cortex('/project2/mstephens/aksarkar/projects/singlecell-ideas/data/zeisel-2015/GSE60361_C1-3005-Expression.txt.gz', return_df=True)
GLM-PCA
Run GLM-PCA for different choices of rank \(K\) and report the per-sample training log likelihood.
Remarks.
- There is a bug in the software in fitting a rank 1 model.
- Townes et al. did not analyze this dataset in their paper.
llik = [] for k in range(2, 11): try: res = scmodes.benchmark.training_score_glmpca(X, n_components=k) except: res = np.nan llik.append(res) llik = np.array(llik)
pd.Series({k: l for k, l in zip(np.arange(2, 11), llik)})
2 -4.805854e+02 3 -3.209583e+03 4 -5.343589e+01 5 -1.073886e+02 6 NaN 7 -7.383533e+01 8 -2.566765e+09 9 NaN 10 NaN dtype: float64
Is the problem initialization? GLM-PCA uses a random initialization, so try multiple restarts.
opt = -np.inf for i in range(5): try: res = scmodes.benchmark.training_score_glmpca(X, n_components=10) if res > opt: opt = res except: continue opt
-53.37463058913195
PLRA1
Fit the model via expectation maximization.
res = wlra.plra(X.values, rank=10, max_iters=50000, verbose=True)
Estimate the training log likelihood.
l = st.poisson(mu=np.exp(res)).logpmf(X.values)
l.mean()
-inf
Find data where we estimated \(\lambda_{ij} = 0\) for \(x_{ij} > 0\).
np.where(~np.isfinite(l))
(array([2928]), array([0]))
X.iloc[2928, 0]
2
Estimate the log likelihood of the remaining data.
np.ma.masked_invalid(l).mean()
-13.571587027584892
Gradient descent
Fit the model using gradient descent, accelerated by Adam.
m = (wlra.grad.PoissonFA(n_samples=X.shape[0], n_features=X.shape[1], n_components=10, log_link=True) .fit(X.values, lr=1e-2, max_epochs=10000, atol=1, verbose=True))
Estimate the training log likelihood.
st.poisson(mu=np.exp(m.L.dot(m.F))).logpmf(X.values).mean()
-2.9916342744465516
This result appears to be in line with our previous GLM-PCA results.