Link functions in Poisson LRA

Table of Contents

Introduction

We are interested in fitting the model: \( \newcommand\ml{\mathbf{L}} \newcommand\mf{\mathbf{F}} \)

\begin{align*} x_{ij} &\sim \operatorname{Poisson}(s_i \lambda_{ij})\\ h(\lambda_{ij}) &= (\ml\mf)_{ij} \end{align*}

GLM-PCA fits this model assuming \(h(x) = \ln x\) using Fisher scoring (Newton-Raphson updates). More precisely, GLM-PCA writes \(\lambda_{ij} = \exp(\ln s_i + (\mathbf{LF}')_{ij})\). If \([\ln\lambda_{ij}]\) is rank \(K\), then \([\ln s_i + \ln\lambda_{ij}]\) is no more than rank \(K + 1\) (subadditivity of rank), which we will use in implementations which don't include size factors \(s_i\).

In benchmarking GLM-PCA on the cortex dataset (Zeisel et al. 2015), we found severe numerical problems. Here, we investigate whether this is because of the link function or the optimization algorithm.

Setup

import numpy as np
import pandas as pd
import scipy.stats as st
import scmodes
import wlra
import wlra.grad
%matplotlib inline
%config InlineBackend.figure_formats = set(['retina'])
import matplotlib.pyplot as plt
plt.rcParams['figure.facecolor'] = 'w'
plt.rcParams['font.family'] = 'Nimbus Sans'

Results

Cortex data

Read the data.

X = scmodes.dataset.cortex('/project2/mstephens/aksarkar/projects/singlecell-ideas/data/zeisel-2015/GSE60361_C1-3005-Expression.txt.gz', return_df=True)

GLM-PCA

Run GLM-PCA for different choices of rank \(K\) and report the per-sample training log likelihood.

Remarks.

  • There is a bug in the software in fitting a rank 1 model.
  • Townes et al. did not analyze this dataset in their paper.
llik = []
for k in range(2, 11):
  try:
    res = scmodes.benchmark.training_score_glmpca(X, n_components=k)
  except:
    res = np.nan
  llik.append(res)
llik = np.array(llik)
pd.Series({k: l for k, l in zip(np.arange(2, 11), llik)})
2    -4.805854e+02
3    -3.209583e+03
4    -5.343589e+01
5    -1.073886e+02
6              NaN
7    -7.383533e+01
8    -2.566765e+09
9              NaN
10             NaN
dtype: float64

Is the problem initialization? GLM-PCA uses a random initialization, so try multiple restarts.

opt = -np.inf
for i in range(5):
  try:
    res = scmodes.benchmark.training_score_glmpca(X, n_components=10)
    if res > opt:
      opt = res
  except:
    continue
opt
-53.37463058913195

PLRA1

Fit the model via expectation maximization.

res = wlra.plra(X.values, rank=10, max_iters=50000, verbose=True)

Estimate the training log likelihood.

l = st.poisson(mu=np.exp(res)).logpmf(X.values)
l.mean()
-inf

Find data where we estimated \(\lambda_{ij} = 0\) for \(x_{ij} > 0\).

np.where(~np.isfinite(l))
(array([2928]), array([0]))

X.iloc[2928, 0]
2

Estimate the log likelihood of the remaining data.

np.ma.masked_invalid(l).mean()
-13.571587027584892

Gradient descent

Fit the model using gradient descent, accelerated by Adam.

m = (wlra.grad.PoissonFA(n_samples=X.shape[0], n_features=X.shape[1], n_components=10, log_link=True)
     .fit(X.values, lr=1e-2, max_epochs=10000, atol=1, verbose=True))

Estimate the training log likelihood.

st.poisson(mu=np.exp(m.L.dot(m.F))).logpmf(X.values).mean()
-2.9916342744465516

This result appears to be in line with our previous GLM-PCA results.

Author: Abhishek Sarkar

Created: 2019-09-05 Thu 21:09

Validate